rdblue commented on issue #2764: URL: https://github.com/apache/iceberg/issues/2764#issuecomment-890152309
Bulk operations are difficult to propagate as mutations. Most of the early incremental consumption focuses on the easy case, appends, for that reason. But there are ways to make it work. For example, you could read all the deleted files and all the added files in an overwrite and use a full outer join to label each row deleted, added, or kept and then feed those rows into incremental processing. That join is expensive, though. Another strategy is to read just the added files and consider all of the changes sort of an `upsert` operation. As long as you know that all of the data coming in replaces 0 or 1 rows, then you may not need to know what the previous row was. The problem is that this makes assumptions about the operation that happened and doesn't apply to all cases. For your deduplication case, it wouldn't tell you when duplicate rows are removed. I think the easier solution is turning a row-delta commit into changes because you can read the deleted rows and added rows directly without needing to process (and usually discard) the kept rows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
