alexprosak opened a new pull request, #16679:
URL: https://github.com/apache/iceberg/pull/16679

   Closes <issue #>
   
   Implements an initial-snapshot bootstrap for streaming reads of Iceberg 
tables in Spark. A streaming query started with a fresh checkpoint now reads 
the current snapshot in full as its first batch, then continues incrementally - 
matching how Delta Lake's streaming source behaves by default.
   
   Currently a fresh Iceberg streaming query without options replays the 
table's history from the oldest ancestor snapshot, which is slow for long-lived 
tables and any expired/deleted snapshots cannot be streamed. The only existing 
workarounds (bootstrap via batch read + cut over by timestamp) are fragile, so 
users can't easily express the common pattern of loading the entire table and 
then keeping up with new append data.
   
   ## About the Change
   
   This PR adds a `stream-from-snapshot` option:
   | `stream-from-snapshot` | Behavior |
   |---|---|
   | *(not set)* | **new default**: read the current snapshot in full, then 
continue with new snapshots |
   | `latest` | read snapshots committed after stream startup |
   | `earliest` | read from the oldest ancestor (the existing default) |
   | *`<snapshot-id>`* | read snapshots after the given snapshot id (exclusive) 
|
   
   `stream-from-snapshot` and `stream-from-timestamp` are mutually exclusive. 
Previous oldest ancestor default behavior can be opted-in via 
`stream-from-snapshot=earliest`
   
   The implementation reuses a `scanAllFiles` flag that already existed on 
`StreamingOffset` but has previously always been set to false. The new default 
returns `StreamingOffset(currentSnapshotId, 0, scanAllFiles=true)`. 
`SyncSparkMicroBatchPlanner.planFiles()` bypasses `shouldProcess` for the 
initial snapshot so an OVERWRITE / DELETE / REPLACE current snapshot is read 
for its full table state rather than rejected.
   
   **Follow-ups**:
   - Docs update with new default behavior & `stream-from-snapshot` option
   - Ports to other supported spark versions
   
   **Future Work**:
   - Supporting row-level deletes in the initial-snapshot read: the streaming 
planner does not load delete manifests, so V2 positional/equality delete files 
and V3 deletion vectors aren't applied. To avoid silently emitting 
row-level-deleted rows, initial snapshot load will throw currently when the 
current snapshot has any delete manifests.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to