Github user CodingCat commented on the pull request:
https://github.com/apache/spark/pull/228#issuecomment-39641216
Hi, @pwendell , Thank you for your comments, here is my reply
First, "whether accumulator value should be subject to change in the case
of failure"...I think no, though during a long period, Spark runs in this way
(this patch is actually resolving a very old TODO in DAGScheduler.scala)...I
think accumulator is usually used to take the task which is more complicate
than rdd.count/rdd.filter.count (Maybe I'm wrong), e.g. counting the sum of
distance from the points to a potential center in K-means (see mllib), I think
in this case, the health status of the cluster should be transparent to the
user, i.e. the final result of K-means should be irrelevant to whether executor
is lost, etc....
Second, Good point, I can understand what the use scenario is, but do you
mind providing more details like how to implement this in Spark? I guess this
can be solved by providing a approximateValue API in Accumulator or
SparkContext....
Third, actually, this patch ensures that the value of the accumulator in a
stage will only be available when this stage becomes independent (means that no
job needs it any more), if a job finishes, and the other job still needs the
certain stage, the accumulator value calculated in the stage will not be
counted...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---