mukund-thakur opened a new issue, #16514:
URL: https://github.com/apache/iceberg/issues/16514

   ### Feature Request / Improvement
   
   **Problem**: 
   Let’s assume we have a large existing Iceberg table which is currently 
partitioned by month. After a few years, we would like to evolve the partition 
to have month and day as well. And now we want to rewrite all old data files 
using the current partition spec which is (month, day). As per the current 
algorithm, all the old partition spec data files are grouped in a single big 
group.  If the data to be partitioned is large, a user will want to enable 
partial-progress. But then, the files are randomly split into multiple spark 
jobs.  Thus, one partition gets processed in multiple spark jobs, which leads 
to small files in the resulting partition.  These small files often require yet 
another round of compaction.
   
   
   **Why it creates small files:**
   Suppose there are 15TB of old spec data files. It will get broken into 150 
spark shuffle jobs each processing 100GB of data. As the files are random in 
each group, every job can write files to all new partitions thus potentially 
leading to max of 150 files in each output spec partition. 
   
   **Current Solution:** 
   We have to run a separate compaction job to reduce the number of output 
files in each output partition. 
   
   **Proposed Solution:**
   We can optimize the algorithm to create smaller groups of files per old 
partition even for older spec files if the current spec satisfies the older 
spec. By satisfies, we mean whether the new partition spec has the same 
ordering as the old partition spec. For example, the new partition by day on a 
timestamp field satisfies the old partition by month on the same timestamp 
field but vice-versa is not true. 
   
   
   
   ### Query engine
   
   None
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [x] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to