Hi all, I am wondering if we should modify the split planning logic to schedule the most expensive tasks first.
For example, we have one 128 MB file and two 65 MB files. If the max split size is 128MB, we will have 3 tasks. Now let’s assume we have only two executors and it will take X amount of time to process 65MB and 2X amount of time to process 128MB. Right now, the biggest task can be scheduled last. Consequently, the overall time will be X + 2X. If we schedule the most expensive task first, the overall runtime will be just 2X. I haven’t though a lot about this but one option is to sort all files by their size in descending order before packing into bins during split planning. This has its own trade-offs, however. Another idea is to modify PackingIterator to return the biggest bin instead of the first one when the number of bins exceeds `lookback`. What does the community think? Thanks, Anton