aokolnychyi commented on issue #77: Include the cost to open a file during 
split planning
URL: 
https://github.com/apache/incubator-iceberg/issues/77#issuecomment-464699643
 
 
   Once I implemented this locally, I started to wonder if we should add the 
opening cost to every file.
   
   For example, we might have two Parquet files 62MB each. Rigth now, Iceberg 
will pack them into one bin, which is the right thing to do. If we simply add 
4MB to the weight of each file, those files will be placed into separate bins. 
I am not sure it is a big deal, but I think having one bin is better in this 
case.
   
   As our main goal is to avoid straggler tasks, we can add the cost of opening 
to files that are smaller than a configurable threshold. Let's say 10 MB.
   
   Also, Spark only adds the cost of opening when a file is already assigned to 
a partition. For example, if we have a bin with one 62 MB file, we will still 
place the second file into the same bin as 62 + 4 + 62 = 128.
   
   @rdblue @mccheah what do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to