Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/10385#discussion_r48314002
--- Diff: docs/programming-guide.md ---
@@ -806,7 +806,7 @@ However, in `cluster` mode, what happens is more
complicated, and the above may
What is happening here is that the variables within the closure sent to
each executor are now copies and thus, when **counter** is referenced within
the `foreach` function, it's no longer the **counter** on the driver node.
There is still a **counter** in the memory of the driver node but this is no
longer visible to the executors! The executors only see the copy from the
serialized closure. Thus, the final value of **counter** will still be zero
since all operations on **counter** were referencing the value within the
serialized closure.
-To ensure well-defined behavior in these sorts of scenarios one should use
an [`Accumulator`](#AccumLink). Accumulators in Spark are used specifically to
provide a mechanism for safely updating a variable when execution is split up
across worker nodes in a cluster. The Accumulators section of this guide
discusses these in more detail.
--- End diff --
nice!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]