Github user squito commented on the pull request:
https://github.com/apache/spark/pull/7770#issuecomment-127411771
Other concerns aside ... can somebody explain the logic for resetting the
accumulator values to me?
IIUC you reset the values:
a) on the initial instance
b) if the RDD in the dependency has entirely been evicted from the cache
c) if a stage fails prior to any task completions (but any remaining tasks
from that stage attempt may still increment accumulators, I'm not certain)
d) if a stage fails in a way that we believe (in one go) that all map
output data has been lost
you do not reset the values if:
e) only some of parent RDD has been evicted from the cache
f) if the stage fails after we've already got some map output registered
g) a stage has lost all of its map output, but spark doesn't "realize" this
until after a few rounds of stage attempts (which is not that crazy at all --
perhaps even more likely than (d), though I am saying this anecdotally)
h) a stage has lost 99% of its map output data, but there is still one
partition hanging on for dear life.
Is my understanding wrong? Is this all intended behavior? or "good
enough" behavior?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]