[
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080935#comment-14080935
]
Apache Spark commented on SPARK-1021:
-------------------------------------
User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/1689
> sortByKey() launches a cluster job when it shouldn't
> ----------------------------------------------------
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 0.8.0, 0.9.0
> Reporter: Andrew Ash
> Assignee: Mark Hamstra
> Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the
> documentation. But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would
> fix this
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
> We'd need to make sure that rangeBounds() is never called before an action
> is performed. This could be tricky because it's called in the
> RangePartitioner.equals() method. Maybe it's sufficient to just compare the
> number of partitions, the ids of the RDDs used to create the
> RangePartitioner, and the sort ordering. This still supports the case where
> I range-partition one RDD and pass the same partitioner to a different RDD.
> It breaks support for the case where two range partitioners created on
> different RDDs happened to have the same rangeBounds(), but it seems unlikely
> that this would really harm performance since it's probably unlikely that the
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen? I'll send a PR on GitHub to start the
> discussion and testing.
--
This message was sent by Atlassian JIRA
(v6.2#6252)