joyhaldar opened a new pull request, #15154:
URL: https://github.com/apache/iceberg/pull/15154
This PR optimizes `ExpireSnapshotsSparkAction` by replacing driver-side
collection with distributed Spark operations for manifest filtering.
The previous implementation read content files from ALL manifests in expired
snapshots. This change filters at the manifest level first, reading content
files only from orphaned manifests, similar to the approach used in
`ReachableFileCleanup` but implemented with distributed Spark operations.
**Optimizations**
1. Early exit when no snapshots expired or no orphaned manifests.
2. Join-based filtering to identify orphaned manifests.
3. Read content files only from orphaned manifests instead of all expired
manifests.
**Code Changes**
- Added `emptyFileInfoDS()` helper to `BaseSparkAction`
- Changed `ReadManifest` visibility to `protected` in `BaseSparkAction`
- Added `contentFilesFromManifestDF()` method for targeted manifest reading
**Before**
```
All Expired Files ──────┐
├──► EXCEPT ──► Orphaned Files
All Live Files ─────────┘
(reads all manifests)
```
**After**
```
┌─► No expired snapshots? ──► Return empty (EARLY EXIT)
│
Expired Snapshots ──┤
│
└─► Find orphaned manifest PATHS via except
│
├─► No orphaned manifests? ──► Return manifest
lists + stats (EARLY EXIT)
│
└─► JOIN to get orphaned manifest details
│
└─► Read content files ONLY from
orphaned manifests
│
└─► EXCEPT with live
content files ──► Orphaned Files
```
All existing `TestExpireSnapshotsAction` tests pass.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]