Hi Everyone, I would like to add support for repartitioning old partition spec data files, as described in detail below. Please take a look at my PR https://github.com/apache/iceberg/pull/16190.
Improvement Problem: How to efficiently and reliably repartition only the data files belonging to the old partition specification so they conform to the new partition specification, without unnecessarily rewriting or impacting data files already written using the new spec? Example: Suppose we have evolved the table's partition specification by adding a new partition field, day, on top of an existing field, month. After a few months, we want to re-partition all the old month data files to follow partitioning by day. Currently if those files are already of the desired data sizes, they won't get picked up and thus will remain partitioned by the old spec only. Explored solution using existing code and feature flags: As our use case is to rewrite the old partition spec data files to new spec data files, we have to use rewrite-all=true as rewrite job will skip the files which are already of desired size for example (512 MB by default) or if only one file per group rewrite but we would still need to rewrite them to desired spec. Based on a suggestion by @pvary <https://github.com/pvary> on an old PR ##12083 (comment) <https://github.com/apache/iceberg/pull/12083#issuecomment-2751808447> and looking at the current code, I thought we can use filters to filter only the old data files after applying rewrite-all=true based on some column values for example timestamp( month <=2025-06) for rewriting. To efficiently rewrite a huge number of data files we have to also use partial-progress.enabled and partial-progress.max-commits such that if job fails half way we don't need to start from scratch. Why this won't work? Suppose there are so many files to rewrite and jobs fail half way. When we rerun using the same filter, it will again pick up the same files even if we have rewritten suppose 50% of files successfully. We can somehow improve the filter to pick only old files after every iteration but it puts a lot of work on end-user as currently we can't filter data files based on the spec ID. Suggested code change Based on above reasons, I suggest to enable this use case using this new flag rewrite-partition-spec-mismatch and partial-progress.enabled. Happy to try out any other suggestion for achieving the use case. Thanks, Mukund
