openinx commented on issue #360: URL: https://github.com/apache/iceberg/issues/360#issuecomment-650694300
Well, seems I thought differently with yours. For my understanding, your solution will divide the deletes into two parts: 1. equality-deletion ; 2. positional-deletion. The equality-deletion will only be applied to data files with sequence_number < current delete file, the positional-deletion will only be applied to data files with the same sequence number. There're several problems in my thought: 1. As you said, keeping the index from ID to position is expensive, especially when the data couldn't fit in the limited memory. In that case, we may need to spill into disk. That could produce many random seeks if searching the position for a given ID when generating the positional delete files. 2. The equality-deletion and positional-deletion seems make the JOIN algorithm complex, both the read & replay implementation need to consider both of them. I'd prefer to use one kind of deletion if possible. 3. If the equality-deletions only keep the primary key columns in delta files, then it will be a problem when replaying to the downstream iceberg table. For example, we have a table with two columns (a,b), a is primary key and b is partition key. the operation `DELETE(a=1)` will need to be applied to all partitions in downstream iceberg table, while the `DELETE(a=1, b=2)` will only need to be applied to `partition=2`. Keeping all columns for equality-deletions is good for replaying. I'm writing the some document for the `equality-deletes`, will post to the mail list in next days. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
