Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/228#issuecomment-39628971
This patch is a little confusing to me because I always thought the
definition and semantics of an accumulator are that its value might be
different in the case of failures. Maybe the issue is we don't document this
clearly enough. Or maybe it's never been established as such.
The approach taken here changes a property of accumulators - it makes it so
the value of an accumulator cannot be accessed until the job is complete. This
seems like a loss of functionality. E.g. I could imagine people using
accumulators to decide when to terminate a stage (in the future) for things
like approximate calculations where you want to stop executing at some point
based on an accumulated value.
The approach here also seems to assume that once a job finishes the stages
couldn't ever be re-computed later on. Is that valid? For instance, if you have
executor failures that cause shuffle output or RDD partitions to be lost, won't
the scheduler need to re-compute partitions in a stage any ways? What if the
tasks update accumulator values that are necessary for re-creating that data?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---