RussellSpitzer commented on a change in pull request #2591: URL: https://github.com/apache/iceberg/pull/2591#discussion_r635712203
########## File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java ########## @@ -162,4 +163,27 @@ private void validateOptions() { "Cannot set %s is less than 1. All values less than 1 have the same effect as 1. %d < 1", MIN_INPUT_FILES, minInputFiles); } + + protected long targetFileSize() { + return this.targetFileSize; + } + + /** + * Ideally every Spark Task that is generated will be less than or equal to our target size but + * in practice this is not the case. When we actually write our files, they may exceed the target + * size and end up being split. This would end up producing 2 files out of one task, one target sized + * and one very small file. Since the output file can vary in size, it is better to + * use a slightly larger (but still within threshold) size for actually writing the tasks out. Review comment: At least on Parquet, differences in compression and encoding seem to be issues here. @aokolnychyi has more info but one of the hypothesis was that smaller files used dictionary encoding while larger files did not. Most of the experience with this is from production use-cases with users with large numbers of small files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org