Adisok opened a new pull request, #16820:
URL: https://github.com/apache/iceberg/pull/16820
## Problem
Iceberg tags and branches are useful for maintaining named historical
snapshots (e.g. daily audit refs, compliance checkpoints). When a partition
must be purged from all historical refs — for GDPR/right-to-erasure compliance,
data-quality remediation, or accidental PII writes — there is currently no
first-class way to do it. Users are forced to manually branch → delete → retag
for each ref, which is error-prone and O(N) in the number of refs.
## Solution
This PR adds a `drop_partition_from_refs` Spark SQL procedure (and the
underlying `DropPartitionFromRefs` Action) that surgically removes all data
files matching a partition filter from a configurable set of tags and/or
branches, never touching `main`.
```sql
CALL catalog.system.drop_partition_from_refs(
table => 'db.events',
where => 'dt = "2024-01-01"',
refs => 'tags', -- TAGS | BRANCHES | ALL, default TAGS
dry_run => false
);
-- returns: ref_name, previous_snapshot_id, new_snapshot_id (one row per
updated ref)
```
## Design
**Deduplication by snapshot ID** — refs that share the same underlying
snapshot are grouped. The `DeleteFiles` commit runs once per unique snapshot
(using a temporary branch as staging target), then all sharing refs are
advanced to the resulting snapshot in a single `ManageSnapshots` commit. This
makes the operation O(unique snapshots), not O(refs).
**Manifest fast-path** — `ManifestEvaluator.forRowFilter` is used to skip
manifests whose partition-field summaries provably have no overlap with the
filter, avoiding unnecessary Avro reads.
**Safety invariants**:
- `main` is always excluded regardless of the `refs` parameter.
- Temp branches are cleaned up best-effort on failure to avoid leaving
orphaned refs.
- `dry_run=true` reports what would change (affected refs + estimated file
counts) without committing.
## Files changed
| File | Change |
|------|--------|
| `api/.../actions/DropPartitionFromRefs.java` | New public Action interface
|
| `core/.../actions/BaseDropPartitionFromRefs.java` | Immutables binding for
`Result` |
| `spark/v{3.5,4.0,4.1}/.../DropPartitionFromRefsSparkAction.java` | Core
implementation |
| `spark/v{3.5,4.0,4.1}/.../DropPartitionFromRefsProcedure.java` | Spark SQL
procedure surface |
| `spark/v{3.5,4.0,4.1}/.../SparkActions.java` | Wire
`dropPartitionFromRefs()` factory method |
| `spark/v{3.5,4.0,4.1}/.../SparkProcedures.java` | Register procedure |
| `spark/v3.5/.../TestDropPartitionFromRefsProcedure.java` | 9 parameterized
tests (4 catalogs = 36 cases) |
| `docs/docs/spark-procedures.md` | Procedure documentation |
## Testing
- 36 test cases (9 methods × 4 catalog configs: testhive, testhadoop,
testrest, spark_catalog)
- Covers: single tag, shared-snapshot deduplication, multi-manifest tables,
dry run, main-branch safety, branches mode, named args, invalid `refs` value,
no-match empty result
_Authored with Claude Code_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]