Hello Team,

I'm writing to propose a change to the orphan file removal logic in this PR
<https://github.com/apache/iceberg/pull/12278>.

Currently, the orphan file removal process lists files at the root of the
table to figure out orphans files.
This can lead to unintended consequences in scenarios where multiple tables
share a common root directory.
Example:
*tbl1* -> */dir1/*tbl1
*tbl2* -> */dir1*
Orphan removal of tbl2 can clean up the tbl1 directory since the listing
happens at *dir1.*

I propose modifying the orphan file removal logic to list specifically
within the `data` and `metadata` directories of the target table. This
would ensure that only files within those directories,  and therefore
directly associated with the table(in most cases), are considered for
removal.

Are there any potential drawbacks or edge cases that I haven't considered?

*Note: *
1. This does not address scenarios where tables are nested within the
`data` or `metadata` directories of another table.
Example:
*tbl1* -> dir/tbl1
*tbl2* -> dir/tbl1/data/tbl2
2. When two tables have same location
Some related discussions related to location ownership here
<https://github.com/apache/iceberg/issues/4159> and here
<https://github.com/apache/iceberg/issues/9133>

Eager to hear your feedback here or on the PR. Thank you!.

- Karuppayya

Reply via email to