[PR] Feature: Incremental Append Scan [iceberg-python]

via GitHub Mon, 15 Jun 2026 18:47:48 -0700


smaheshwar-pltr opened a new pull request, #3512:
URL: https://github.com/apache/iceberg-python/pull/3512


   <!-- Closes #2634 -->
   
   Closes #2634.
   
   # Rationale for this change
   
   Adds `IncrementalAppendScan`, which reads the data appended between two 
snapshots — the building block for incremental ingestion. Largely a revival of 
the work in #2235; see #2634 and the previous PRs for motivation.
   
   Split out of #3364 at the reviewers' request. This is PR 2 of 2 and is 
**based on #3511** (the `BaseScan` / `ManifestGroupPlanner` refactor).
   
   > [!NOTE]
   > Stacked on #3511. GitHub won't let a PR into `apache/iceberg-python` use a 
fork branch as its base, so this PR targets `main` and its branch carries the 
refactor commit too — until #3511 merges, the diff here shows both. The 
append-scan change itself is the second commit (`Feature: Incremental Append 
Scan`). Please review #3511 first; once it lands, this diff collapses to the 
feature alone.
   
   References: https://github.com/apache/iceberg (Iceberg-Java and Spark) and 
https://github.com/apache/iceberg-cpp/pull/590. I've left review-aid comments 
inline (prefixed `[AI]`) pointing at the relevant reference code.
   
   # Changes
   
   - `Table.incremental_append_scan(...)` builds an `IncrementalAppendScan` 
over the `(from_snapshot_id_exclusive, to_snapshot_id_inclusive]` range; 
`StagedTable` overrides it to raise, mirroring `scan()`.
   - Planning walks the append-only ancestors in the range, dedups the data 
manifests whose `added_snapshot_id` is in range (set semantics via 
`ManifestFile.__eq__` / `__hash__`), and filters manifest entries to 
`ADDED`-in-range via a new `manifest_entry_filter` on 
`ManifestGroupPlanner.plan_files`.
   - Projects onto the table's **current** schema (matching Java/C++), so rows 
written under an older schema in the range get `NULL` for newer columns.
   - `from_snapshot_id_exclusive` is validated with `is_parent_ancestor_of`, so 
an expired start cursor is accepted as long as the lineage still passes through 
it; equal `from`/`to` is rejected. Adds the snapshot helpers 
`ancestors_between_ids` and `is_parent_ancestor_of`.
   
   # Out of scope (tracked follow-ups)
   
   Per @kevinjqliu's follow-up list on #3364: deciding on an unset start 
snapshot, branch/ref overloads (`use_ref`), `from_snapshot_inclusive`, 
`count()`, REST server-side planning, and user-facing doc examples.
   
   # Are these changes tested?
   
   Yes — unit tests (validation paths, current-schema projection, type 
preservation through chaining, expired-`from`) and integration tests 
(append-only, non-append snapshots ignored, schema evolution within range, 
partition-/metrics-evaluator pruning, disconnected snapshots), plus the 
`test_incremental_read` provision fixture.
   
   # Are there any user-facing changes?
   
   Yes — the new `Table.incremental_append_scan(...)` API and 
`IncrementalAppendScan` class. No changes to existing public surface.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Feature: Incremental Append Scan [iceberg-python]

Reply via email to