Rajesh Balamohan created SPARK-13059: ----------------------------------------
Summary: Sort inputsplits by size in HadoopRDD to avoid long tails Key: SPARK-13059 URL: https://issues.apache.org/jira/browse/SPARK-13059 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Rajesh Balamohan HadoopRDD.getPartitions invokes getSplits from the inputformat and returns the HadoopPartition. There are cases where the input splits generated are not of equal sizes all the time and some splits would be much smaller than others. If bigger splits are scheduled at the end of the job, there is a possibility of getting long tail in the job. Sorting the input splits by size (in descending order) can help in scheduling the larger splits upfront. This could also help in speculation as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org