[jira] [Comment Edited] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior

Ilya Ganelin (JIRA) Wed, 11 Feb 2015 17:46:34 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317334#comment-14317334
 ]


Ilya Ganelin edited comment on SPARK-4423 at 2/12/15 1:46 AM:
--------------------------------------------------------------

Hi [~pwendell] and [~joshrosen], how do you guys feel about my adding a section 
to the Spark Programming Guide that discusses this issue specifically - local 
execution on the driver (in {{local}} mode) versus the division of labor 
between the driver and the executors (in {{cluster}} mode). Specifically, I'd 
like to discuss where the actual data is that the executors are operating on. 
This also becomes useful during performance tuning - for example using 
mapPartitions to avoid shuffle operations, since it ties in with data 
aggregation for executors. 

This section could be referenced within the shorter description for foreach, 
map, mapPartitions, mapPartitionsWIthIndex, and flatMap or some other set of 
operators we care about.




was (Author: ilganeli):
Hi [~pwendell] and [~joshrosen], how do you guys feel about my adding a section 
to the Spark Programming Guide that discusses this issue specifically - local 
execution on the driver (in {{local}} mode) versus the division of labor 
between the driver and the executors (in {{cluster}} mode). This is something 
that's a little un-intuitive and understanding it is vital to understanding 
Spark. This also becomes useful during performance tuning (for example using 
mapPartitions to avoid shuffle operations). 

This section could be referenced within the shorter description for foreach, 
map, mapPartitions, mapPartitionsWIthIndex, and flatMap or some other set of 
operators we care about.



> Improve foreach() documentation to avoid confusion between local- and 
> cluster-mode behavior
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4423
>                 URL: https://issues.apache.org/jira/browse/SPARK-4423
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Josh Rosen
>            Assignee: Ilya Ganelin
>
> {{foreach}} seems to be a common source of confusion for new users: in 
> {{local}} mode, {{foreach}} can be used to update local variables on the 
> driver, but programs that do this will not work properly when executed on 
> clusters, since the {{foreach}} will update per-executor variables (note that 
> this _will_ work correctly for accumulators, but not for other types of 
> mutable objects).
> Similarly, I've seen users become confused when {{.foreach(println)}} doesn't 
> print to the driver's standard output.
> At a minimum, we should improve the documentation to warn users against 
> unsafe uses of {{foreach}} that won't work properly when transitioning from 
> local mode to a real cluster.
> We might also consider changes to local mode so that its behavior more 
> closely matches the cluster modes; this will require some discussion, though, 
> since any change of behavior here would technically be a user-visible 
> backwards-incompatible change (I don't think that we made any explicit 
> guarantees about the current local-mode behavior, but someone might be 
> relying on the current implicit behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior

Reply via email to