Hi Everyone,
I would like to add support for repartitioning old partition spec data
files, as described in detail below. Please take a look at my PR
https://github.com/apache/iceberg/pull/16190.

Improvement

Problem:
How to efficiently and reliably repartition only the data files belonging
to the old partition specification so they conform to the new partition
specification, without unnecessarily rewriting or impacting data files
already written using the new spec?

Example:
Suppose we have evolved the table's partition specification by adding a new
partition field, day, on top of an existing field, month. After a few
months, we want to re-partition all the old month data files to follow
partitioning by day. Currently if those files are already of the desired
data sizes, they won't get picked up and thus will remain partitioned by
the old spec only.

Explored solution using existing code and feature flags:

As our use case is to rewrite the old partition spec data files to new spec
data files, we have to use rewrite-all=true as rewrite job will skip the
files which are already of desired size for example (512 MB by default) or
if only one file per group rewrite but we would still need to rewrite them
to desired spec.
Based on a suggestion by @pvary <https://github.com/pvary> on an old PR ##12083
(comment)
<https://github.com/apache/iceberg/pull/12083#issuecomment-2751808447> and
looking at the current code, I thought we can use filters to filter only
the old data files after applying rewrite-all=true based on some column
values for example timestamp( month <=2025-06) for rewriting. To
efficiently rewrite a huge number of data files we have to also use
partial-progress.enabled and partial-progress.max-commits such that if job
fails half way we don't need to start from scratch.

Why this won't work?

Suppose there are so many files to rewrite and jobs fail half way. When we
rerun using the same filter, it will again pick up the same files even if
we have rewritten suppose 50% of files successfully. We can somehow improve
the filter to pick only old files after every iteration but it puts a lot
of work on end-user as currently we can't filter data files based on the
spec ID.

Suggested code change

Based on above reasons, I suggest to enable this use case using this new
flag rewrite-partition-spec-mismatch and partial-progress.enabled.

Happy to try out any other suggestion for achieving the use case.


Thanks,

Mukund

Reply via email to