[PR] Spark: Add drop_partition_from_refs procedure [iceberg]

via GitHub Mon, 15 Jun 2026 05:28:41 -0700


Adisok opened a new pull request, #16820:
URL: https://github.com/apache/iceberg/pull/16820


   ## Problem
   
   Iceberg tags and branches are useful for maintaining named historical 
snapshots (e.g. daily audit refs, compliance checkpoints). When a partition 
must be purged from all historical refs — for GDPR/right-to-erasure compliance, 
data-quality remediation, or accidental PII writes — there is currently no 
first-class way to do it. Users are forced to manually branch → delete → retag 
for each ref, which is error-prone and O(N) in the number of refs.
   
   ## Solution
   
   This PR adds a `drop_partition_from_refs` Spark SQL procedure (and the 
underlying `DropPartitionFromRefs` Action) that surgically removes all data 
files matching a partition filter from a configurable set of tags and/or 
branches, never touching `main`.
   
   ```sql
   CALL catalog.system.drop_partition_from_refs(
       table   => 'db.events',
       where   => 'dt = "2024-01-01"',
       refs    => 'tags',    -- TAGS | BRANCHES | ALL, default TAGS
       dry_run => false
   );
   -- returns: ref_name, previous_snapshot_id, new_snapshot_id (one row per 
updated ref)
   ```
   
   ## Design
   
   **Deduplication by snapshot ID** — refs that share the same underlying 
snapshot are grouped. The `DeleteFiles` commit runs once per unique snapshot 
(using a temporary branch as staging target), then all sharing refs are 
advanced to the resulting snapshot in a single `ManageSnapshots` commit. This 
makes the operation O(unique snapshots), not O(refs).
   
   **Manifest fast-path** — `ManifestEvaluator.forRowFilter` is used to skip 
manifests whose partition-field summaries provably have no overlap with the 
filter, avoiding unnecessary Avro reads.
   
   **Safety invariants**:
   - `main` is always excluded regardless of the `refs` parameter.
   - Temp branches are cleaned up best-effort on failure to avoid leaving 
orphaned refs.
   - `dry_run=true` reports what would change (affected refs + estimated file 
counts) without committing.
   
   ## Files changed
   
   | File | Change |
   |------|--------|
   | `api/.../actions/DropPartitionFromRefs.java` | New public Action interface 
|
   | `core/.../actions/BaseDropPartitionFromRefs.java` | Immutables binding for 
`Result` |
   | `spark/v{3.5,4.0,4.1}/.../DropPartitionFromRefsSparkAction.java` | Core 
implementation |
   | `spark/v{3.5,4.0,4.1}/.../DropPartitionFromRefsProcedure.java` | Spark SQL 
procedure surface |
   | `spark/v{3.5,4.0,4.1}/.../SparkActions.java` | Wire 
`dropPartitionFromRefs()` factory method |
   | `spark/v{3.5,4.0,4.1}/.../SparkProcedures.java` | Register procedure |
   | `spark/v3.5/.../TestDropPartitionFromRefsProcedure.java` | 9 parameterized 
tests (4 catalogs = 36 cases) |
   | `docs/docs/spark-procedures.md` | Procedure documentation |
   
   ## Testing
   
   - 36 test cases (9 methods × 4 catalog configs: testhive, testhadoop, 
testrest, spark_catalog)
   - Covers: single tag, shared-snapshot deduplication, multi-manifest tables, 
dry run, main-branch safety, branches mode, named args, invalid `refs` value, 
no-match empty result
   
   _Authored with Claude Code_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Spark: Add drop_partition_from_refs procedure [iceberg]

Reply via email to