[GitHub] [iceberg] rdblue commented on issue #2764: Deduplication support in RewriteDataFilesAction

GitBox Fri, 30 Jul 2021 14:03:20 -0700


rdblue commented on issue #2764:
URL: https://github.com/apache/iceberg/issues/2764#issuecomment-890152309



   Bulk operations are difficult to propagate as mutations. Most of the early 
incremental consumption focuses on the easy case, appends, for that reason. But 
there are ways to make it work. For example, you could read all the deleted 
files and all the added files in an overwrite and use a full outer join to 
label each row deleted, added, or kept and then feed those rows into 
incremental processing. That join is expensive, though.
   
   Another strategy is to read just the added files and consider all of the 
changes sort of an `upsert` operation. As long as you know that all of the data 
coming in replaces 0 or 1 rows, then you may not need to know what the previous 
row was. The problem is that this makes assumptions about the operation that 
happened and doesn't apply to all cases. For your deduplication case, it 
wouldn't tell you when duplicate rows are removed.
   
   I think the easier solution is turning a row-delta commit into changes 
because you can read the deleted rows and added rows directly without needing 
to process (and usually discard) the kept rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on issue #2764: Deduplication support in RewriteDataFilesAction

Reply via email to