[
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066539#comment-14066539
]
Mark Hamstra commented on SPARK-2568:
-------------------------------------
What is at least as much a problem as the making of three passes through the
data is that the count and sample are separate hidden/special jobs within the
RangePartitioner that aren't launched by RDD actions under the user's control.
This ends up not only breaking Spark's "transformations are lazy; jobs are only
launched by actions" model, but it also messes up the construction of
FutureActions on sorted RDDs, accounting of resource usage of jobs that include
a sort, etc.
> RangePartitioner should go through the data only once
> -----------------------------------------------------
>
> Key: SPARK-2568
> URL: https://issues.apache.org/jira/browse/SPARK-2568
> Project: Spark
> Issue Type: Bug
> Affects Versions: 1.0.0
> Reporter: Reynold Xin
> Assignee: Xiangrui Meng
>
> As of Spark 1.0, RangePartitioner goes through data twice: once to compute
> the count and once to do sampling. As a result, to do sortByKey, Spark goes
> through data 3 times (once to count, once to sample, and once to sort).
> RangePartitioner should go through data only once (remove the count step).
--
This message was sent by Atlassian JIRA
(v6.2#6252)