smaheshwar-pltr opened a new pull request, #3512: URL: https://github.com/apache/iceberg-python/pull/3512
<!-- Closes #2634 --> Closes #2634. # Rationale for this change Adds `IncrementalAppendScan`, which reads the data appended between two snapshots — the building block for incremental ingestion. Largely a revival of the work in #2235; see #2634 and the previous PRs for motivation. Split out of #3364 at the reviewers' request. This is PR 2 of 2 and is **based on #3511** (the `BaseScan` / `ManifestGroupPlanner` refactor). > [!NOTE] > Stacked on #3511. GitHub won't let a PR into `apache/iceberg-python` use a fork branch as its base, so this PR targets `main` and its branch carries the refactor commit too — until #3511 merges, the diff here shows both. The append-scan change itself is the second commit (`Feature: Incremental Append Scan`). Please review #3511 first; once it lands, this diff collapses to the feature alone. References: https://github.com/apache/iceberg (Iceberg-Java and Spark) and https://github.com/apache/iceberg-cpp/pull/590. I've left review-aid comments inline (prefixed `[AI]`) pointing at the relevant reference code. # Changes - `Table.incremental_append_scan(...)` builds an `IncrementalAppendScan` over the `(from_snapshot_id_exclusive, to_snapshot_id_inclusive]` range; `StagedTable` overrides it to raise, mirroring `scan()`. - Planning walks the append-only ancestors in the range, dedups the data manifests whose `added_snapshot_id` is in range (set semantics via `ManifestFile.__eq__` / `__hash__`), and filters manifest entries to `ADDED`-in-range via a new `manifest_entry_filter` on `ManifestGroupPlanner.plan_files`. - Projects onto the table's **current** schema (matching Java/C++), so rows written under an older schema in the range get `NULL` for newer columns. - `from_snapshot_id_exclusive` is validated with `is_parent_ancestor_of`, so an expired start cursor is accepted as long as the lineage still passes through it; equal `from`/`to` is rejected. Adds the snapshot helpers `ancestors_between_ids` and `is_parent_ancestor_of`. # Out of scope (tracked follow-ups) Per @kevinjqliu's follow-up list on #3364: deciding on an unset start snapshot, branch/ref overloads (`use_ref`), `from_snapshot_inclusive`, `count()`, REST server-side planning, and user-facing doc examples. # Are these changes tested? Yes — unit tests (validation paths, current-schema projection, type preservation through chaining, expired-`from`) and integration tests (append-only, non-append snapshots ignored, schema evolution within range, partition-/metrics-evaluator pruning, disconnected snapshots), plus the `test_incremental_read` provision fixture. # Are there any user-facing changes? Yes — the new `Table.incremental_append_scan(...)` API and `IncrementalAppendScan` class. No changes to existing public surface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
