openinx commented on issue #360:
URL: https://github.com/apache/iceberg/issues/360#issuecomment-650694300


   Well, seems  I thought differently with yours. 
   
   For my understanding, your solution will divide the deletes into two parts:  
1. equality-deletion ; 2. positional-deletion.   The equality-deletion will 
only be applied to data files with sequence_number < current delete file, the 
positional-deletion will only be applied to data files with the same sequence 
number. 
   
   There're several problems in my thought: 
   1.  As you said, keeping the index from ID to position is expensive, 
especially when the data couldn't fit in the limited memory. In that case, we 
may need to spill into disk. That could produce many random seeks if searching 
the position for a given ID when generating the positional delete files.
   2.  The equality-deletion and positional-deletion seems make the JOIN 
algorithm complex,  both the read & replay implementation need to consider both 
of them.  I'd prefer to use one kind of deletion if possible. 
   3.   If the equality-deletions only keep the primary key columns in delta 
files, then it will be a problem when replaying to the downstream iceberg 
table.  For example, we have a table with two columns (a,b), a is primary key 
and b is partition key.  the operation `DELETE(a=1)` will need to be applied to 
all partitions in downstream iceberg table,  while the `DELETE(a=1, b=2)` will 
only need to be applied to `partition=2`.  Keeping all columns for 
equality-deletions is good for replaying. 
   
   I'm writing the some document for the `equality-deletes`,  will post to the 
mail list in next days. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to