Github user squito commented on the pull request:
https://github.com/apache/spark/pull/8090#issuecomment-129690335
Hi @carsonwang thanks for reporting this & suggesting a fix. This is
clearly really important to fix for 1.5, I'm glad you marked it as a blocker --
but I'd also like to understand it a bit better. Its not clear to me how this
happens. It seems like the first submission should always need to compute all
the partitions, which will create the accumulators, and then the same stage
should get reused during the retry. I just tried one example and it didn't
occur either (though it was basically the simplest possible thing). Any chance
you can share the complete logs from the driver on how this happened? I'd like
us to have add some test to make sure its right.
Its also disappointing that our existing tests on the DAGScheduler didn't
catch this, but I suppose that's because this error comes from interactions
between the DAGScheduler & Task execution with failures, but we don't have
proper tests for that in place.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]