[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

Marcelo Vanzin (JIRA) Wed, 09 Apr 2014 13:59:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964658#comment-13964658
 ]


Marcelo Vanzin commented on SPARK-1021:
---------------------------------------

I actually played with the idea and just turning the {{rangeBounds}} variable 
into a lazy one doesn't work. That makes the variable be evaluated only when 
the transformation is executed on the worker nodes; at that point, you can't 
execute actions (which are needed to compute {{rangeBounds}}).

One way to work around this would be to have something be evaluated on the RDDs 
when the scheduler walks the graph before submitting jobs to the workers. I'm 
not aware of such functionality in the code, though. Or maybe there's something 
cleaner that can be done here?

> sortByKey() launches a cluster job when it shouldn't
> ----------------------------------------------------
>
>                 Key: SPARK-1021
>                 URL: https://issues.apache.org/jira/browse/SPARK-1021
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Andrew Ash
>              Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

Reply via email to