[jira] [Commented] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

holdenk (JIRA) Mon, 01 Feb 2016 14:20:19 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-13059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127153#comment-15127153
 ]


holdenk commented on SPARK-13059:
---------------------------------

This sounds interesting - although having first() and take(1) still work as 
expected seems important too.

> Sort inputsplits by size in HadoopRDD to avoid long tails
> ---------------------------------------------------------
>
>                 Key: SPARK-13059
>                 URL: https://issues.apache.org/jira/browse/SPARK-13059
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Rajesh Balamohan
>
> HadoopRDD.getPartitions invokes getSplits from the inputformat and returns 
> the HadoopPartition.  There are cases where the input splits generated are 
> not  of equal sizes all the time and some splits would be much smaller than 
> others.   If bigger splits are scheduled at the end of the job, there is a 
> possibility of getting long tail in the job.  Sorting the input splits by 
> size (in descending order) can help in scheduling the larger splits upfront. 
> This could also help in speculation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

Reply via email to