Hi all,

I am wondering if we should modify the split planning logic to schedule the 
most expensive tasks first.

For example, we have one 128 MB file and two 65 MB files. If the max split size 
is 128MB, we will have 3 tasks. Now let’s assume we have only two executors and 
it will take X amount of time to process 65MB and 2X amount of time to process 
128MB. Right now, the biggest task can be scheduled last. Consequently, the 
overall time will be X + 2X. If we schedule the most expensive task first, the 
overall runtime will be just 2X.

I haven’t though a lot about this but one option is to sort all files by their 
size in descending order before packing into bins during split planning. This 
has its own trade-offs, however.

Another idea is to modify PackingIterator to return the biggest bin instead of 
the first one when the number of bins exceeds `lookback`.

What does the community think?

Thanks,
Anton

Reply via email to