xanderbailey opened a new issue, #2152:
URL: https://github.com/apache/iceberg-rust/issues/2152

   ### Is your feature request related to a problem or challenge?
   
   ## Feature Request: Incremental Snapshot Scanning (from/to snapshot-id)
   
   ### Is your feature request related to a problem? Please describe.
   
   Currently, iceberg-rust only supports reading a table at a specific snapshot 
(time-travel) or the current snapshot. There is no way to read only the data 
that was added between two snapshots.
   
   The Java Iceberg client supports this via `IncrementalDataTableScan`:
   
   ```java
   // Java API
   TableScan scan = table.newIncrementalDataTableScan()
       .fromSnapshotExclusive(100)
       .toSnapshot(200);
   ```
   
   This is a critical feature for:
   - **Change Data Capture (CDC)** - Reading only new/changed data for 
downstream systems
   - **Incremental ETL pipelines** - Processing only new data since the last 
checkpoint
   - **Efficient data synchronization** - Syncing only deltas between systems
   - **Streaming workloads** - Reading appends as they happen
   
   ### Describe the solution you'd like
   
   Add incremental scan support to `TableScanBuilder` with methods similar to 
the Java client:
   
   ```rust
   // Scan changes between two snapshots (from exclusive, to inclusive)
   let scan = table.scan()
       .from_snapshot_exclusive(from_id)
       .to_snapshot(to_id)
       .build()?;
   
   // Scan with inclusive from
   let scan = table.scan()
       .from_snapshot_inclusive(from_id)
       .to_snapshot(to_id)
       .build()?;
   
   // Convenience methods
   let scan = table.scan().appends_after(from_id).build()?;
   let scan = table.scan().appends_between(from_id, to_id).build()?;
   ```
   
   Additionally, expose this feature through the DataFusion integration:
   
   ```rust
   // DataFusion integration
   let provider = IcebergStaticTableProvider::try_new_incremental(table, 
from_id, to_id).await?;
   ctx.register_table("changes", Arc::new(provider))?;
   let df = ctx.sql("SELECT * FROM changes").await?;
   ```
   
   ### Implementation Notes
   
   Based on the Java implementation (`IncrementalDataTableScan.java`):
   
   1. **Snapshot Range Validation** - Walk the snapshot ancestry chain to 
validate that `from_snapshot` is an ancestor of `to_snapshot`
   
   2. **Manifest Entry Filtering** - Only include manifest entries where:
      - `status == ADDED` (not EXISTING or DELETED)
      - `snapshot_id` is within the specified range
   
   3. **Operation Validation** - Initially only support `APPEND` operations 
(same as Java). `OVERWRITE` and `DELETE` operations require additional handling 
for delete files.
   
   4. **Mutual Exclusivity** - The `snapshot_id()` method (for time-travel) 
should be mutually exclusive with incremental scan methods.
   
   ### Describe alternatives you've considered
   
   1. **Post-filtering** - Users could scan the full table and filter by 
`_snapshot_id` metadata column, but this is inefficient as it scans all data.
   
   2. **Manual manifest parsing** - Users could manually read manifests and 
filter entries, but this defeats the purpose of having a scan API.
   
   ### Additional context
   
   - Java implementation reference: `IncrementalDataTableScan.java` in 
apache/iceberg
   - This feature is commonly requested for building CDC pipelines with Iceberg
   - Spark's `incrementalAppend` reader option provides similar functionality
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Willingness to contribute
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to