alexprosak opened a new pull request, #16679: URL: https://github.com/apache/iceberg/pull/16679
Closes <issue #> Implements an initial-snapshot bootstrap for streaming reads of Iceberg tables in Spark. A streaming query started with a fresh checkpoint now reads the current snapshot in full as its first batch, then continues incrementally - matching how Delta Lake's streaming source behaves by default. Currently a fresh Iceberg streaming query without options replays the table's history from the oldest ancestor snapshot, which is slow for long-lived tables and any expired/deleted snapshots cannot be streamed. The only existing workarounds (bootstrap via batch read + cut over by timestamp) are fragile, so users can't easily express the common pattern of loading the entire table and then keeping up with new append data. ## About the Change This PR adds a `stream-from-snapshot` option: | `stream-from-snapshot` | Behavior | |---|---| | *(not set)* | **new default**: read the current snapshot in full, then continue with new snapshots | | `latest` | read snapshots committed after stream startup | | `earliest` | read from the oldest ancestor (the existing default) | | *`<snapshot-id>`* | read snapshots after the given snapshot id (exclusive) | `stream-from-snapshot` and `stream-from-timestamp` are mutually exclusive. Previous oldest ancestor default behavior can be opted-in via `stream-from-snapshot=earliest` The implementation reuses a `scanAllFiles` flag that already existed on `StreamingOffset` but has previously always been set to false. The new default returns `StreamingOffset(currentSnapshotId, 0, scanAllFiles=true)`. `SyncSparkMicroBatchPlanner.planFiles()` bypasses `shouldProcess` for the initial snapshot so an OVERWRITE / DELETE / REPLACE current snapshot is read for its full table state rather than rejected. **Follow-ups**: - Docs update with new default behavior & `stream-from-snapshot` option - Ports to other supported spark versions **Future Work**: - Supporting row-level deletes in the initial-snapshot read: the streaming planner does not load delete manifests, so V2 positional/equality delete files and V3 deletion vectors aren't applied. To avoid silently emitting row-level-deleted rows, initial snapshot load will throw currently when the current snapshot has any delete manifests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
