kbendick opened a new pull request #3267:
URL: https://github.com/apache/iceberg/pull/3267


   This PR adds
   
   1. Explicit handling for type DataOperations.OVERWRITE operations when set 
the Spark read option "streaming-skip-delete-snapshots" to true.
   2. More explicit handling of all possible DataOperations, as well as a catch 
all clause for unrecognized / unimplemented snapshot operation types.
   3. A unit test to show the behavior when `streaming-skip-delete-snapshots` 
is set to true (to show the behavior and pair with the unit test for the 
default case, which sets the flag to false).
   
   This closes https://github.com/apache/iceberg/issues/3265
   
   **Detailed explanation & Motivating Use Case**
   
   The Spark MicroBatch streaming source can currently only handle snapshots 
that do not mutate or delete any of the existing rows and still produce a 
"correct" stream.
   
   This means that it can presently handle two types of snapshots that generate 
data:
   - DataOperations.APPEND - New data files are added to the table.
   - DataOperations.REPLACE - Files are replaced, without changing the data in 
the table (e.g. data file rewrites)
   
   Users can choose to skip "delete" type snapshots via the read option 
"streaming-skip-delete-snapshots", which simply skips the given snapshot if it 
potentially contains any kind of mutation to a row in the table.
   
   OVERWRITE type snapshots are a form of delete and so they should arguably be 
skippable if users choose to skip deletes.
   
   Consider the case where users are simply interested in using Spark to get an 
Iceberg source stream of an append only table (which is more or less what they 
can do now), but data is then reprocessed as part of a batch operation to 
change its compression type for example. There would be no easy way to skip 
these out-of-band one time OVERWRITE operations committed to the table.
   
   However, they also can add data (as opposed to snapshots of purely DELETE 
type). While it would potentially be possible to grab only the added data files 
and still skip the deletes in a limited set of cases, given that users are 
skipping deletes, it seems fair to allow them to skip mixed deletes as well. 
I'd personally rather the time spent implementing that go into refactoring the 
Spark MicroBatch stream to truly produce CDC data (something I have begun 
working on a high level proposal for an API, given that it's not supported in 
Spark natively). But I'm very much open to discussion on this
   
   When we refactor the spark streaming source to handle deletions, we will of 
course be sure to handle commits that both delete and add data at the same 
time.  Hence why I think this more explicit addition is good enough for now.
   
   If we don't want to take this relatively simplistic (but in line with the 
status quo) approach, at the very least, a test should be added indicating the 
intended behavior when "streaming-skip-delete-snapshots" is true, as there's a 
test showing that OVERWRITE snapshots will fail an Iceberg spark streaming 
source when the option is not used / set to false (its default value).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to