Rajesh Balamohan created SPARK-13059:
----------------------------------------
Summary: Sort inputsplits by size in HadoopRDD to avoid long tails
Key: SPARK-13059
URL: https://issues.apache.org/jira/browse/SPARK-13059
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Rajesh Balamohan
HadoopRDD.getPartitions invokes getSplits from the inputformat and returns the
HadoopPartition. There are cases where the input splits generated are not of
equal sizes all the time and some splits would be much smaller than others.
If bigger splits are scheduled at the end of the job, there is a possibility of
getting long tail in the job. Sorting the input splits by size (in descending
order) can help in scheduling the larger splits upfront. This could also help
in speculation as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]