Hi, Matei,   

Can you give some hint on how the current implementation guarantee the 
accumulator is only applied for once?

There is a pending PR trying to achieving this 
(https://github.com/apache/spark/pull/228/files), but from the current 
implementation, I didn’t see this has been done? (maybe I missed something)

Best,  

--  
Nan Zhu


On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:

> Hey Sandy,
>  
> On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com 
> (mailto:sandy.r...@cloudera.com)) wrote:
>  
> Hey All,  
>  
> A couple questions came up about shared variables recently, and I wanted to  
> confirm my understanding and update the doc to be a little more clear.  
>  
> *Broadcast variables*  
> Now that tasks data is automatically broadcast, the only occasions where it  
> makes sense to explicitly broadcast are:  
> * You want to use a variable from tasks in multiple stages.  
> * You want to have the variable stored on the executors in deserialized  
> form.  
> * You want tasks to be able to modify the variable and have those  
> modifications take effect for other tasks running on the same executor  
> (usually a very bad idea).  
>  
> Is that right?  
> Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also 
> matters. (We might later factor tasks in a different way to avoid 2, but it's 
> hard due to things like Hadoop JobConf objects in the tasks).
>  
>  
> *Accumulators*  
> Values are only counted for successful tasks. Is that right? KMeans seems  
> to use it in this way. What happens if a node goes away and successful  
> tasks need to be resubmitted? Or the stage runs again because a different  
> job needed it.  
> Accumulators are guaranteed to give a deterministic result if you only 
> increment them in actions. For each result stage, the accumulator's update 
> from each task is only applied once, even if that task runs multiple times. 
> If you use accumulators in transformations (i.e. in a stage that may be part 
> of multiple jobs), then you may see multiple updates, from each run. This is 
> kind of confusing but it was useful for people who wanted to use these for 
> debugging.
>  
> Matei
>  
>  
>  
>  
>  
> thanks,  
> Sandy  
>  
>  


Reply via email to