[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

CodingCat Sat, 05 Apr 2014 08:21:21 -0700

Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39641216
  
    Hi, @pwendell , Thank you for your comments, here is my reply
    
    First, "whether accumulator value should be subject to change in the case 
of failure"...I think no, though during a long period, Spark runs in this way 
(this patch is actually resolving a very old TODO in DAGScheduler.scala)...I 
think accumulator is usually used to take the task which is more complicate 
than rdd.count/rdd.filter.count (Maybe I'm wrong), e.g. counting the sum of 
distance from the points to a potential center in K-means (see mllib), I think 
in this case, the health status of the cluster should be transparent to the 
user, i.e. the final result of K-means should be irrelevant to whether executor 
is lost, etc....
    
    Second, Good point, I can understand what the use scenario is, but do you 
mind providing more details like how to implement this in Spark? I guess this 
can be solved by providing a approximateValue API in Accumulator or 
SparkContext....
    
    Third, actually, this patch ensures that the value of the accumulator in a 
stage will only be available when this stage becomes independent (means that no 
job needs it any more), if a job finishes, and the other job still needs the 
certain stage, the accumulator value calculated in the stage will not be 
counted...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Reply via email to