kbendick commented on issue #2788: URL: https://github.com/apache/iceberg/issues/2788#issuecomment-939270495
Hi @SreeramGarlapati, I was looking into this recently as I was doing some investigation into existing CDC systems. Particularly as Spark has no built in concept of this. Your proposal seems pretty good. My one concern would be streaming all data from V1 format tables in `OVERWRITE` snapshots by default. If that's a flag people can opt into, I could get behind that. But that's potentially quite a lot of duplicated data, and I think we could in theory do some sort of anti-join on files from past snapshots if they haven't been physically deleted yet (expensive, but could get much more correct results for V1 tables). I don't have a very strong opinion on much related to V1 tables, other than restreaming potential duplicates should definitely be opt-in imo. 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
