amogh-jahagirdar commented on code in PR #13555:
URL: https://github.com/apache/iceberg/pull/13555#discussion_r2212581830
##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java:
##########
@@ -300,7 +303,7 @@ public void testBinPackAfterPartitionChange() {
Integer.toString(averageFileSize(table) + 1000))
.option(
RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
- Integer.toString(averageFileSize(table) + 1001))
+ Integer.toString(averageFileSize(table) + 11000))
Review Comment:
There is this and 2 other test cases where I made a similar change to
increase target file size or max file write size.
It's not the real solution but essentially after these changes to preserve
lineage, what's happening is that we are writing just a little bit more data
for the extra lineage fields on materialization. They compress well on disk but
it still slightly throws off the number of output files on the rewrite.
Specifically, the presence of the extra columns means that we are more quickly
hitting the max write file size after which the writer rolls over. We output
the majority of the files in the appropriate size, but we also produce
additional small files.
Just slightly increasing the target file write size means we eliminate the
production of smaller files as a result of rolling over a little bit more
aggressively than needed.
I think for V3+ tables, we should re-evaluate the default 1.8x ratio because
the lineage fields will be required, and probably bump that up so we don't
regress on specific compaction workloads. cc @aokolnychyi @stevenzwu
@RussellSpitzer
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]