Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Tue, 03 Mar 2026 14:29:42 -0800


RussellSpitzer commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3993965117


   @shangxinli I understand the benefit of having less files. 
   
   I'm speaking mostly about the code here
   
https://github.com/apache/iceberg/blob/50d310aef17908f03f595d520cd751527483752a/core/src/main/java/org/apache/iceberg/BaseContentScanTask.java#L99-L115
   
   When we break this up into tasks, we end up with a single task per row 
group. So adding them to the same file should have the same data file read 
performance as having them in separate files. It would reduce used manifest 
space but I'm wondering if it's really that much better than just compacting 
all the data files at a regular interval. 
   
   I'm not sure how Trino would behave in a similar situation, but basically 
what i'm worried about is we are essentially just creating a different kind of 
small file problem by making parquet files with very tiny rowgroups inside them.
   
   We are essentially just moving "datafile entries" into "row group entries" 
so the metadata still exists, it's just in the parquet footers and manifest 
offsets.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to