davseitsev commented on pull request #2660:
URL: https://github.com/apache/iceberg/pull/2660#issuecomment-1020896240


   @SreeramGarlapati  thank you very much for your work. I have a few questions.
   
   We have Spark streaming job which reads data from Kafka, process it and 
store to iceberg table partitioned by day.
   There is a background compaction process and scheduled cleanup task that 
expires old snapshots to remove old small files.
   I want to build another streaming job that reads a few tables produced by 
first job, unions them, filters necessary rows and stores data to iceberg table.
   Thus I'd like to understand better how expired snapshots are handled.
   
   In our case source table contains `append` and `replace` snapshots.
   `MicroBatchStream.initialOffset()` always returns `StreamingOffset` with 
`scanAllFiles=true` to process historical data. As old snapshots are expired by 
cleanup process we can get into the case when first snapshot is of type 
`replace`.  Due to #2752 we ignore `replace` snapshots. Will it lead to the 
situation when we skip initial snapshot with `scanAllFiles=true` and loose all 
data appended in old (expired) snapshots?
   
   And one more question.
   Let's say have data in source table for 1 year, expire snapshots older than 
7 days and cleanup job runs every 1 hour.
   If a job starts reading this table from initial offset it has at most 1 hour 
to process first snapshot, doesn't it? As initial snapshot is processed with 
`scanAllFiles=true`, it's the biggest one because it contains data for 1 year 
minus 7 days. If I'm correct, there is a big chance that streaming job will 
fail in the middle when cleanup job runs. Because it will expire the snapshot 
which is processed by streaming job.
   Would it work if `initialOffset()` returns latest snapshot with flag 
`scanAllFiles` instead of first snapshot?:
   - `latestOffset()` -> [LatestSnapshot, LatestFile, false]
   - `initialOffset()` -> [LatestSnapshot, 0, true]
   Probably in this case we should have 7 days to process initial snapshot.
   
   Are there any other corner cases with snapshots expiration?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to