davseitsev commented on pull request #2660: URL: https://github.com/apache/iceberg/pull/2660#issuecomment-1020896240
@SreeramGarlapati thank you very much for your work. I have a few questions. We have Spark streaming job which reads data from Kafka, process it and store to iceberg table partitioned by day. There is a background compaction process and scheduled cleanup task that expires old snapshots to remove old small files. I want to build another streaming job that reads a few tables produced by first job, unions them, filters necessary rows and stores data to iceberg table. Thus I'd like to understand better how expired snapshots are handled. In our case source table contains `append` and `replace` snapshots. `MicroBatchStream.initialOffset()` always returns `StreamingOffset` with `scanAllFiles=true` to process historical data. As old snapshots are expired by cleanup process we can get into the case when first snapshot is of type `replace`. Due to #2752 we ignore `replace` snapshots. Will it lead to the situation when we skip initial snapshot with `scanAllFiles=true` and loose all data appended in old (expired) snapshots? And one more question. Let's say have data in source table for 1 year, expire snapshots older than 7 days and cleanup job runs every 1 hour. If a job starts reading this table from initial offset it has at most 1 hour to process first snapshot, doesn't it? As initial snapshot is processed with `scanAllFiles=true`, it's the biggest one because it contains data for 1 year minus 7 days. If I'm correct, there is a big chance that streaming job will fail in the middle when cleanup job runs. Because it will expire the snapshot which is processed by streaming job. Would it work if `initialOffset()` returns latest snapshot with flag `scanAllFiles` instead of first snapshot?: - `latestOffset()` -> [LatestSnapshot, LatestFile, false] - `initialOffset()` -> [LatestSnapshot, 0, true] Probably in this case we should have 7 days to process initial snapshot. Are there any other corner cases with snapshots expiration? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
