rdblue commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-915646119
This doesn't seem like the right way to solve the problem that some splits
are too expensive because there are too many delete files. A symptom of the
problem is that it isn't obvious how to set the maximum number of items per bin.
I think that the right approach to solve the problem is to modify the weight
function that we use for splits, _not_ to modify the bin packing algorithm
itself.
A straightforward way to fix this problem is to make the weight function
take into account how much of a penalty a delete file is. Right now, the weight
function is this, in `TableScanUtil`:
```java
Function<FileScanTask, Long> weightFunc = file ->
Math.max(file.length(), openFileCost);
```
That could easily be updated so that each delete file that must be opened is
added to the open cost:
```java
Function<FileScanTask, Long> weightFunc = file ->
Math.max(file.length(), (1 + file.deletes().size()) * openFileCost);
```
Another option is to check the size of the delete data as well:
```java
Function<FileScanTask, Long> weightFunc = file -> Math.max(
file.length() +
file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),
(1 + file.deletes().size()) * openFileCost);
```
Let's try those options instead of modifying bin packing. I think modifying
the packing algorithm is the wrong direction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]