Could someone take a look at this please and provide some feedback. Thanks. I am also working on an optimized re partitioning algorithm via PR #16515
On Wed, May 13, 2026 at 1:40 PM Mukund Thakur <[email protected]> wrote: > Hi Everyone, > I would like to add support for repartitioning old partition spec data > files, as described in detail below. Please take a look at my PR > https://github.com/apache/iceberg/pull/16190. > > Improvement > > Problem: > How to efficiently and reliably repartition only the data files belonging > to the old partition specification so they conform to the new partition > specification, without unnecessarily rewriting or impacting data files > already written using the new spec? > > Example: > Suppose we have evolved the table's partition specification by adding a > new partition field, day, on top of an existing field, month. After a few > months, we want to re-partition all the old month data files to follow > partitioning by day. Currently if those files are already of the desired > data sizes, they won't get picked up and thus will remain partitioned by > the old spec only. > > Explored solution using existing code and feature flags: > > As our use case is to rewrite the old partition spec data files to new > spec data files, we have to use rewrite-all=true as rewrite job will skip > the files which are already of desired size for example (512 MB by default) > or if only one file per group rewrite but we would still need to rewrite > them to desired spec. > Based on a suggestion by @pvary <https://github.com/pvary> on an old PR > ##12083 > (comment) > <https://github.com/apache/iceberg/pull/12083#issuecomment-2751808447> and > looking at the current code, I thought we can use filters to filter only > the old data files after applying rewrite-all=true based on some column > values for example timestamp( month <=2025-06) for rewriting. To > efficiently rewrite a huge number of data files we have to also use > partial-progress.enabled and partial-progress.max-commits such that if job > fails half way we don't need to start from scratch. > > Why this won't work? > > Suppose there are so many files to rewrite and jobs fail half way. When we > rerun using the same filter, it will again pick up the same files even if > we have rewritten suppose 50% of files successfully. We can somehow improve > the filter to pick only old files after every iteration but it puts a lot > of work on end-user as currently we can't filter data files based on the > spec ID. > > Suggested code change > > Based on above reasons, I suggest to enable this use case using this new > flag rewrite-partition-spec-mismatch and partial-progress.enabled. > > Happy to try out any other suggestion for achieving the use case. > > > Thanks, > > Mukund >
