maoli67660 opened a new pull request, #17094:
URL: https://github.com/apache/iceberg/pull/17094
## Problem
`filteredCompareToFileList()` in `DeleteOrphanFilesSparkAction` filters the
caller-provided `file_list_view` dataset using:
```java
files = files.filter(files.col(FILE_PATH).startsWith(location));
```
Because `location` has no trailing `/`, a sibling path that shares the table
location as a raw string prefix is incorrectly included. For example, when the
table is at `s3://bucket/my_table`, files under `s3://bucket/my_table_backup/`
also satisfy `startsWith("s3://bucket/my_table")` and get pulled into orphan
detection for the wrong table.
Fixes #16493.
## Solution
Append `/` to `location` before the prefix match:
```java
String locationPrefix = location.endsWith("/") ? location : location + "/";
files = files.filter(files.col(FILE_PATH).startsWith(locationPrefix));
```
Applied to Spark 3.5, 4.0, and 4.1.
## Testing
Added `testRemoveOrphanFilesFileListViewDoesNotMatchSiblingPaths` to
`TestRemoveOrphanFilesProcedure` in all three Spark versions. The test:
1. Creates an empty Iceberg table at a known location
2. Builds a `file_list_view` that includes an orphan file inside the table
directory **and** a file under a sibling path (`table-location + "-sibling"`)
3. Runs `remove_orphan_files` with `file_list_view` and `dry_run => true`
4. Asserts the sibling file is **not** identified as an orphan, and the
in-table orphan **is** identified
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]