kbendick opened a new issue #3265: URL: https://github.com/apache/iceberg/issues/3265
The Spark microbatch streaming source can currently only handle snapshots that do not mutate or delete any of the existing rows. This means that it can presently handle two types of snapshots that generate data: - DataOperations.APPEND - New data files are added to the table. - DataOperations.REPLACE - Files are removed and replaced, without changing the data in the table (such as during data file rewrites when optimizing for small files, etc). Users can choose to skip "delete" type snapshots via the read option "streaming-skip-delete-snapshots", which simply skips the given snapshot if it potentially contains any kind of mutation to a row in the table. OVERWRITE type snapshots are a form of delete and so they should arguably be skippable if users choose to skip deletes. However, they can also add data, so when we refactor the spark streaming source to handle deletions, we should be sure to handle commits that both delete and add data. At the very least, a test should be added indicating the intended behavior when "streaming-skip-delete-snapshots" is true, as there's a test showing that OVERWRITE snapshots will fail an Iceberg spark streaming source when the option is not used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
