kbendick opened a new issue #3265:
URL: https://github.com/apache/iceberg/issues/3265


   The Spark microbatch streaming source can currently only handle snapshots 
that do not mutate or delete any of the existing rows.
   
   This means that it can presently handle two types of snapshots that generate 
data:
   - DataOperations.APPEND - New data files are added to the table.
   - DataOperations.REPLACE -  Files are removed and replaced, without changing 
the data in the table (such as during data file rewrites when optimizing for 
small files, etc).
   
   Users can choose to skip "delete" type snapshots via the read option 
"streaming-skip-delete-snapshots", which simply skips the given snapshot if it 
potentially contains any kind of mutation to a row in the table.
   
   OVERWRITE type snapshots are a form of delete and so they should arguably be 
skippable if users choose to skip deletes.
   
   However, they can also add data, so when we refactor the spark streaming 
source to handle deletions, we should be sure to handle commits that both 
delete and add data.
   
   At the very least, a test should be added indicating the intended behavior 
when "streaming-skip-delete-snapshots" is true, as there's a test showing that 
OVERWRITE snapshots will fail an Iceberg spark streaming source when the option 
is not used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to