[GitHub] [iceberg] SreeramGarlapati commented on a change in pull request #3517: Skip processing snapshots of type Overwrite during readStream

GitBox Tue, 04 Jan 2022 17:58:26 -0800


SreeramGarlapati commented on a change in pull request #3517:
URL: https://github.com/apache/iceberg/pull/3517#discussion_r778497242




##########
File path: 
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -205,6 +205,11 @@ private boolean shouldProcess(Snapshot snapshot) {
             "Cannot process delete snapshot : %s. Set read option %s to allow 
skipping snapshots of type delete",
             snapshot.snapshotId(), 
SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS);
         return false;
+      case DataOperations.OVERWRITE:

Review comment:
       all in all, there are 2 options for reading upserts:
   1. for updates which are written with - `copy on write` -- a new data file 
is created which has a combination of both old rows and these new updated rows. 
So, in this case - we can take a spark option from the user to take consent - 
that they are okay with data replay.
   2. for updates which are written with - `merge on read` - we will expose an 
option to read change data feed - where we will include a metadata column - 
which indicates whether a record is an INSERT vs DELETE.
   did this make sense - @rdblue & @kbendick 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] SreeramGarlapati commented on a change in pull request #3517: Skip processing snapshots of type Overwrite during readStream

Reply via email to