openinx commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r708216986
##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -56,7 +57,10 @@ public static boolean hasDeletes(FileScanTask task) {
Preconditions.checkArgument(lookback > 0, "Invalid split planning lookback
(negative or 0): %s", lookback);
Preconditions.checkArgument(openFileCost >= 0, "Invalid file open cost
(negative): %s", openFileCost);
- Function<FileScanTask, Long> weightFunc = file -> Math.max(file.length(),
openFileCost);
+ // Check the size of delete file as well to avoid unbalanced bin-packing
+ Function<FileScanTask, Long> weightFunc = file -> Math.max(
+ file.length() +
file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),
Review comment:
Should also consider the cost when using different join algorithm
between delete files and data files ?
For the equality files, the cost to join data file is : `file.recordCount()
* sum(eqDeleteFile.recordCount()) * avgRecordByteSize)` .
For the positional delete files, the cost to join data file is:
`file.length() + sum(posFiles.fileSizeInBytes())`.
The current approach sounds like we are treating all the delete files are
positional delete files...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]