davseitsev commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-668660110


   @rdblue logic that you described doesn't work for me. I have Spark 
Structured Streaming job which writes daily partitioned table and produce files 
of different size, primarily small ones. I ran `RewriteDataFilesAction` a few 
times on a single partition.
   First run produced 86 files with size up to 128Mb as I configured. Second 
run took all these files and compacted them again merging them with very small 
files. Third run again took all files produced by previous job and compacted 
them again.
   
   I understand that eventually compacted files will become so close to 
`targetSize` that they will be ignored, but until this happened we need to 
rewrite gigabytes of data again and again. Also it doesn't work for me because 
compaction process is relevant only for current day. At the beginning of new 
day we run major compaction with deduplication, sorting etc. and like "close" 
partition for previous day, it will not be modified anymore. Intermediate minor 
compaction is necessary only to prevent clients from reading thousands of small 
files.
   
   Applying filter by row timestamp can improve the situation, but as we have 
many tables with completely different size we need to choose right compaction 
period for each table to have output files with reasonable size. It's really 
difficult to manage because their size can vary, new table can be added, etc. 
Also we have late records which could not be considered for compaction if there 
is no fresh records in the file.
   
   It would be really nice to have separate configuration to limit file size or 
the predicate like @JingsongLi suggested. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to