[jira] [Updated] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:51:59 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-13059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-13059:
---------------------------------
    Labels: bulk-closed  (was: )

> Sort inputsplits by size in HadoopRDD to avoid long tails
> ---------------------------------------------------------
>
>                 Key: SPARK-13059
>                 URL: https://issues.apache.org/jira/browse/SPARK-13059
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: bulk-closed
>
> HadoopRDD.getPartitions invokes getSplits from the inputformat and returns 
> the HadoopPartition.  There are cases where the input splits generated are 
> not  of equal sizes all the time and some splits would be much smaller than 
> others.   If bigger splits are scheduled at the end of the job, there is a 
> possibility of getting long tail in the job.  Sorting the input splits by 
> size (in descending order) can help in scheduling the larger splits upfront. 
> This could also help in speculation as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

Reply via email to