kbendick opened a new pull request #3267: URL: https://github.com/apache/iceberg/pull/3267
This PR adds 1. Explicit handling for type DataOperations.OVERWRITE operations when set the Spark read option "streaming-skip-delete-snapshots" to true. 2. More explicit handling of all possible DataOperations, as well as a catch all clause for unrecognized / unimplemented snapshot operation types. 3. A unit test to show the behavior when `streaming-skip-delete-snapshots` is set to true (to show the behavior and pair with the unit test for the default case, which sets the flag to false). This closes https://github.com/apache/iceberg/issues/3265 **Detailed explanation & Motivating Use Case** The Spark MicroBatch streaming source can currently only handle snapshots that do not mutate or delete any of the existing rows and still produce a "correct" stream. This means that it can presently handle two types of snapshots that generate data: - DataOperations.APPEND - New data files are added to the table. - DataOperations.REPLACE - Files are replaced, without changing the data in the table (e.g. data file rewrites) Users can choose to skip "delete" type snapshots via the read option "streaming-skip-delete-snapshots", which simply skips the given snapshot if it potentially contains any kind of mutation to a row in the table. OVERWRITE type snapshots are a form of delete and so they should arguably be skippable if users choose to skip deletes. Consider the case where users are simply interested in using Spark to get an Iceberg source stream of an append only table (which is more or less what they can do now), but data is then reprocessed as part of a batch operation to change its compression type for example. There would be no easy way to skip these out-of-band one time OVERWRITE operations committed to the table. However, they also can add data (as opposed to snapshots of purely DELETE type). While it would potentially be possible to grab only the added data files and still skip the deletes in a limited set of cases, given that users are skipping deletes, it seems fair to allow them to skip mixed deletes as well. I'd personally rather the time spent implementing that go into refactoring the Spark MicroBatch stream to truly produce CDC data (something I have begun working on a high level proposal for an API, given that it's not supported in Spark natively). But I'm very much open to discussion on this When we refactor the spark streaming source to handle deletions, we will of course be sure to handle commits that both delete and add data at the same time. Hence why I think this more explicit addition is good enough for now. If we don't want to take this relatively simplistic (but in line with the status quo) approach, at the very least, a test should be added indicating the intended behavior when "streaming-skip-delete-snapshots" is true, as there's a test showing that OVERWRITE snapshots will fail an Iceberg spark streaming source when the option is not used / set to false (its default value). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
