[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

pwendell Fri, 04 Apr 2014 21:49:16 -0700

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/228#issuecomment-39628971
  
    This patch is a little confusing to me because I always thought the 
definition and semantics of an accumulator are that its value might be 
different in the case of failures. Maybe the issue is we don't document this 
clearly enough. Or maybe it's never been established as such.
    
    The approach taken here changes a property of accumulators - it makes it so 
the value of an accumulator cannot be accessed until the job is complete. This 
seems like a loss of functionality. E.g. I could imagine people using 
accumulators to decide when to terminate a stage (in the future) for things 
like approximate calculations where you want to stop executing at some point 
based on an accumulated value.
    
    The approach here also seems to assume that once a job finishes the stages 
couldn't ever be re-computed later on. Is that valid? For instance, if you have 
executor failures that cause shuffle output or RDD partitions to be lost, won't 
the scheduler need to re-compute partitions in a stage any ways? What if the 
tasks update accumulator values that are necessary for re-creating that data?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-732: eliminate duplicate update of the a...

Reply via email to