[GitHub] spark pull request: [SPARK-12429][Streaming][Doc]Add Accumulator a...

tdas Tue, 22 Dec 2015 16:38:20 -0800

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10385#discussion_r48314002
  
    --- Diff: docs/programming-guide.md ---
    @@ -806,7 +806,7 @@ However, in `cluster` mode, what happens is more 
complicated, and the above may
     
     What is happening here is that the variables within the closure sent to 
each executor are now copies and thus, when **counter** is referenced within 
the `foreach` function, it's no longer the **counter** on the driver node. 
There is still a **counter** in the memory of the driver node but this is no 
longer visible to the executors! The executors only see the copy from the 
serialized closure. Thus, the final value of **counter** will still be zero 
since all operations on **counter** were referencing the value within the 
serialized closure.  
     
    -To ensure well-defined behavior in these sorts of scenarios one should use 
an [`Accumulator`](#AccumLink). Accumulators in Spark are used specifically to 
provide a mechanism for safely updating a variable when execution is split up 
across worker nodes in a cluster. The Accumulators section of this guide 
discusses these in more detail.  
    --- End diff --
    
    nice!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12429][Streaming][Doc]Add Accumulator a...

Reply via email to