pratikpandey21 opened a new issue, #15369:
URL: https://github.com/apache/iceberg/issues/15369
### Feature Request / Improvement
The current implementation of rewrite data files with
`removeDanglingDeletes` only works on the main branch of the iceberg table.
The RemoveDanglingDeleteFiles action has two issues:
1. Unpartitioned tables are silently skipped
The execute() method returns early with an empty result for unpartitioned
tables:
```
if (table.specs().size() == 1 && table.spec().isUnpartitioned()) {
return ImmutableRemoveDanglingDeleteFiles.Result.builder()
.removedDeleteFiles(Collections.emptyList())
.build();
}
```
2. No branch support
The action always operates on the main branch. There is no API to target a
specific branch, and the Spark implementation reads metadata tables without
branch scoping. This also means that when RewriteDataFilesSparkAction invokes
RemoveDanglingDeleteFiles internally (via the remove-dangling-deletes option),
it ignores the branch that the rewrite is targeting.
Proposed Changes
API (RemoveDanglingDeleteFiles):
- Add a toBranch(String branch) method to allow targeting a specific
branch.
**Background:**
- We're leveraging Flink to write to iceberg in streaming fashion, but using
Write-Audit-Publish pattern. So flink writes to a branch.
- Periodic Spark job that reads the latest changes on the branch, runs audit
and tries to merge/fast-forward to main.
- Sync to Snowflake.
Since Flink streaming data in Upsert mode generates `equality-deletes` on
branch, it is also present in the metadata on main.
Snowflake however doesn't support equality deletes for managed tables and
this requires us to remove equality deletes from main branch, which is why we
need the capability to remove dangling deletes from the branch, before we
fast-forward to main.
### Query engine
Spark
### Willingness to contribute
- [ ] I can contribute this improvement/feature independently
- [x] I would be willing to contribute this improvement/feature with
guidance from the Iceberg community
- [ ] I cannot contribute this improvement/feature at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]