[
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reynold Xin resolved SPARK-1021.
--------------------------------
Resolution: Fixed
Fix Version/s: 1.2.0
> sortByKey() launches a cluster job when it shouldn't
> ----------------------------------------------------
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
> Issue Type: Sub-task
> Components: Spark Core
> Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0
> Reporter: Andrew Ash
> Assignee: Erik Erlandson
> Labels: starter
> Fix For: 1.2.0
>
>
> The sortByKey() method is listed as a transformation, not an action, in the
> documentation. But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would
> fix this
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
> We'd need to make sure that rangeBounds() is never called before an action
> is performed. This could be tricky because it's called in the
> RangePartitioner.equals() method. Maybe it's sufficient to just compare the
> number of partitions, the ids of the RDDs used to create the
> RangePartitioner, and the sort ordering. This still supports the case where
> I range-partition one RDD and pass the same partitioner to a different RDD.
> It breaks support for the case where two range partitioners created on
> different RDDs happened to have the same rangeBounds(), but it seems unlikely
> that this would really harm performance since it's probably unlikely that the
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen? I'll send a PR on GitHub to start the
> discussion and testing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]