aokolnychyi commented on issue #77: Include the cost to open a file during split planning URL: https://github.com/apache/incubator-iceberg/issues/77#issuecomment-464699643 Once I implemented this locally, I started to wonder if we should add the opening cost to every file. For example, we might have two Parquet files 62MB each. Rigth now, Iceberg will pack them into one bin, which is the right thing to do. If we simply add 4MB to the weight of each file, those files will be placed into separate bins. I am not sure it is a big deal, but I think having one bin is better in this case. As our main goal is to avoid straggler tasks, we can add the cost of opening to files that are smaller than a configurable threshold. Let's say 10 MB. Also, Spark only adds the cost of opening when a file is already assigned to a partition. For example, if we have a bin with one 62 MB file, we will still place the second file into the same bin as 62 + 4 + 62 = 128. @rdblue @mccheah what do you think?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
