Rajesh Balamohan created SPARK-13059:
----------------------------------------

             Summary: Sort inputsplits by size in HadoopRDD to avoid long tails
                 Key: SPARK-13059
                 URL: https://issues.apache.org/jira/browse/SPARK-13059
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Rajesh Balamohan


HadoopRDD.getPartitions invokes getSplits from the inputformat and returns the 
HadoopPartition.  There are cases where the input splits generated are not  of 
equal sizes all the time and some splits would be much smaller than others.   
If bigger splits are scheduled at the end of the job, there is a possibility of 
getting long tail in the job.  Sorting the input splits by size (in descending 
order) can help in scheduling the larger splits upfront. This could also help 
in speculation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to