RussellSpitzer commented on PR #14435: URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3993965117
@shangxinli I understand the benefit of having less files. I'm speaking mostly about the code here https://github.com/apache/iceberg/blob/50d310aef17908f03f595d520cd751527483752a/core/src/main/java/org/apache/iceberg/BaseContentScanTask.java#L99-L115 When we break this up into tasks, we end up with a single task per row group. So adding them to the same file should have the same data file read performance as having them in separate files. It would reduce used manifest space but I'm wondering if it's really that much better than just compacting all the data files at a regular interval. I'm not sure how Trino would behave in a similar situation, but basically what i'm worried about is we are essentially just creating a different kind of small file problem by making parquet files with very tiny rowgroups inside them. We are essentially just moving "datafile entries" into "row group entries" so the metadata still exists, it's just in the parquet footers and manifest offsets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
