[
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557620#comment-14557620
]
Josh Rosen commented on SPARK-7308:
-----------------------------------
I think that properly fixing this set of issues will involve both scheduler and
shuffle write path changes.
Spark's task cancellation is best-effort, so even if we fix the scheduler
issues it still is possible that a delayed task from an earlier stage might
conflict with a task from a subsequent attempt. I think that we should focus
first on making it safe for multiple attempts of the same task to be running
concurrently on the same executor, then focus on making the scheduler changes
to prevent this scenario from happening. I like Marcelo's suggestion that
different task attempts write their output to different files. Note, however,
that the name of the shuffle output file is an implicit interface that's used
by our external shuffle service. As a result, I think that we need to ensure
that the final "winning" task attempt renames its temporary / staging files to
the filenames that we're using now. To do this, I think that we can implement
some simple synchronization within an Executor JVM to implement
last-writer-wins atomic renaming / commit of output files (this is similar in
spirit to OutputCommitCoordinator, but _much_ simpler since it's local
coordination).
Once we fix the safety issue, we can then address the scheduler logic changes.
I think that addressing these two pieces in this order makes the most sense,
since scheduler changes have historically been very hard to perform correctly.
Since these sets of changes are largely orthogonal, splitting them into
separate patches will significantly lower our review burden and make things
easier for component-specific maintainers (e.g. it'll be easier for the
scheduler maintainers to review a smaller patch without a bunch of unrelated
changes to executor commit coordination).
Since it sounds like you already have a good test that reproduces all of the
bugs, I would welcome a patch which commits a failing test (we could just wrap
it in a try-catch block or add an expected exception to the test declaration).
This will help to keep the testing work that you've done so far from bitrotting
while we work on the fix.
> Should there be multiple concurrent attempts for one stage?
> -----------------------------------------------------------
>
> Key: SPARK-7308
> URL: https://issues.apache.org/jira/browse/SPARK-7308
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.3.1
> Reporter: Imran Rashid
> Assignee: Imran Rashid
> Attachments: SPARK-7308_discussion.pdf
>
>
> Currently, when there is a fetch failure, you can end up with multiple
> concurrent attempts for the same stage. Is this intended? At best, it leads
> to some very confusing behavior, and it makes it hard for the user to make
> sense of what is going on. At worst, I think this is cause of some very
> strange errors we've seen errors we've seen from users, where stages start
> executing before all the dependent stages have completed.
> This can happen in the following scenario: there is a fetch failure in
> attempt 0, so the stage is retried. attempt 1 starts. But, tasks from
> attempt 0 are still running -- some of them can also hit fetch failures after
> attempt 1 starts. That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already
> https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running. It really should
> check whether that **attempt** is still running, but there isn't enough info
> to do that.
> I'll also post some info on how to reproduce this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]