kbendick commented on issue #2788:
URL: https://github.com/apache/iceberg/issues/2788#issuecomment-939270495


   Hi @SreeramGarlapati,
   
   I was looking into this recently as I was doing some investigation into 
existing CDC systems. Particularly as Spark has no built in concept of this.
   
   Your proposal seems pretty good. My one concern would be streaming all data 
from V1 format tables in `OVERWRITE` snapshots by default. If that's a flag 
people can opt into, I could get behind that. But that's potentially quite a 
lot of duplicated data, and I think we could in theory do some sort of 
anti-join on files from past snapshots if they haven't been physically deleted 
yet (expensive, but could get much more correct results for V1 tables).
   
   I don't have a very strong opinion on much related to V1 tables, other than 
restreaming potential duplicates should definitely be opt-in imo. 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to