rdblue commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-915646119


   This doesn't seem like the right way to solve the problem that some splits 
are too expensive because there are too many delete files. A symptom of the 
problem is that it isn't obvious how to set the maximum number of items per bin.
   
   I think that the right approach to solve the problem is to modify the weight 
function that we use for splits, _not_ to modify the bin packing algorithm 
itself.
   
   A straightforward way to fix this problem is to make the weight function 
take into account how much of a penalty a delete file is. Right now, the weight 
function is this, in `TableScanUtil`:
   
   ```java
       Function<FileScanTask, Long> weightFunc = file -> 
Math.max(file.length(), openFileCost);
   ```
   
   That could easily be updated so that each delete file that must be opened is 
added to the open cost:
   
   ```java
       Function<FileScanTask, Long> weightFunc = file -> 
Math.max(file.length(), (1 + file.deletes().size()) * openFileCost);
   ```
   
   Another option is to check the size of the delete data as well:
   
   ```java
       Function<FileScanTask, Long> weightFunc = file -> Math.max(
           file.length() + 
file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),
           (1 + file.deletes().size()) * openFileCost);
   ```
   
   Let's try those options instead of modifying bin packing. I think modifying 
the packing algorithm is the wrong direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to