[
https://issues.apache.org/jira/browse/SPARK-13059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-13059:
---------------------------------
Labels: bulk-closed (was: )
> Sort inputsplits by size in HadoopRDD to avoid long tails
> ---------------------------------------------------------
>
> Key: SPARK-13059
> URL: https://issues.apache.org/jira/browse/SPARK-13059
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: bulk-closed
>
> HadoopRDD.getPartitions invokes getSplits from the inputformat and returns
> the HadoopPartition. There are cases where the input splits generated are
> not of equal sizes all the time and some splits would be much smaller than
> others. If bigger splits are scheduled at the end of the job, there is a
> possibility of getting long tail in the job. Sorting the input splits by
> size (in descending order) can help in scheduling the larger splits upfront.
> This could also help in speculation as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]