If you think it as necessary to fix, I would like to resubmit that PR (seems to 
have some conflicts with the current DAGScheduler)  

My suggestion is to make it as an option in accumulator, e.g. some algorithms 
utilizing accumulator for result calculation, it needs a deterministic 
accumulator, while others implementing something like Hadoop counters may need 
the current implementation (count everything happened, including the 
duplications)

Your thoughts?  

--  
Nan Zhu


On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:

> Hmm, good point, this seems to have been broken by refactorings of the 
> scheduler, but it worked in the past. Basically the solution is simple -- in 
> a result stage, we should not apply the update for each task ID more than 
> once -- the same way we don't call job.listener.taskSucceeded more than once. 
> Your PR also tried to avoid this for resubmitted shuffle stages, but I don't 
> think we need to do that necessarily (though we could).
>  
> Matei  
>  
> On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com 
> (mailto:zhunanmcg...@gmail.com)) wrote:
>  
> > Hi, Matei,  
> >  
> > Can you give some hint on how the current implementation guarantee the 
> > accumulator is only applied for once?  
> >  
> > There is a pending PR trying to achieving this 
> > (https://github.com/apache/spark/pull/228/files), but from the current 
> > implementation, I didn’t see this has been done? (maybe I missed something) 
> >  
> >  
> > Best,  
> >  
> > --   
> > Nan Zhu
> >  
> >  
> > On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
> >  
> > > Hey Sandy,
> > >  
> > > On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com 
> > > (mailto:sandy.r...@cloudera.com)) wrote:  
> > >  
> > > Hey All,   
> > >  
> > > A couple questions came up about shared variables recently, and I wanted 
> > > to   
> > > confirm my understanding and update the doc to be a little more clear.  
> > >  
> > > *Broadcast variables*   
> > > Now that tasks data is automatically broadcast, the only occasions where 
> > > it  
> > > makes sense to explicitly broadcast are:  
> > > * You want to use a variable from tasks in multiple stages.  
> > > * You want to have the variable stored on the executors in deserialized  
> > > form.  
> > > * You want tasks to be able to modify the variable and have those  
> > > modifications take effect for other tasks running on the same executor  
> > > (usually a very bad idea).  
> > >  
> > > Is that right?   
> > > Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also 
> > > matters. (We might later factor tasks in a different way to avoid 2, but 
> > > it's hard due to things like Hadoop JobConf objects in the tasks).
> > >  
> > >  
> > > *Accumulators*   
> > > Values are only counted for successful tasks. Is that right? KMeans seems 
> > >  
> > > to use it in this way. What happens if a node goes away and successful  
> > > tasks need to be resubmitted? Or the stage runs again because a different 
> > >  
> > > job needed it.  
> > > Accumulators are guaranteed to give a deterministic result if you only 
> > > increment them in actions. For each result stage, the accumulator's 
> > > update from each task is only applied once, even if that task runs 
> > > multiple times. If you use accumulators in transformations (i.e. in a 
> > > stage that may be part of multiple jobs), then you may see multiple 
> > > updates, from each run. This is kind of confusing but it was useful for 
> > > people who wanted to use these for debugging.
> > >  
> > > Matei  
> > >  
> > >  
> > >  
> > >  
> > >  
> > > thanks,   
> > > Sandy  
> > >  
> > >  
> > >  
> >  
> >  

Reply via email to