Hey Sandy,

On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote:

Hey All, 

A couple questions came up about shared variables recently, and I wanted to 
confirm my understanding and update the doc to be a little more clear. 

*Broadcast variables* 
Now that tasks data is automatically broadcast, the only occasions where it 
makes sense to explicitly broadcast are: 
* You want to use a variable from tasks in multiple stages. 
* You want to have the variable stored on the executors in deserialized 
form. 
* You want tasks to be able to modify the variable and have those 
modifications take effect for other tasks running on the same executor 
(usually a very bad idea). 

Is that right? 
Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also matters. 
(We might later factor tasks in a different way to avoid 2, but it's hard due 
to things like Hadoop JobConf objects in the tasks).


*Accumulators* 
Values are only counted for successful tasks. Is that right? KMeans seems 
to use it in this way. What happens if a node goes away and successful 
tasks need to be resubmitted? Or the stage runs again because a different 
job needed it. 
Accumulators are guaranteed to give a deterministic result if you only 
increment them in actions. For each result stage, the accumulator's update from 
each task is only applied once, even if that task runs multiple times. If you 
use accumulators in transformations (i.e. in a stage that may be part of 
multiple jobs), then you may see multiple updates, from each run. This is kind 
of confusing but it was useful for people who wanted to use these for debugging.

Matei





thanks, 
Sandy 

Reply via email to