Josh Rosen created SPARK-10381:
----------------------------------
Summary: Infinite loop when OutputCommitCoordination is enabled
and OutputCommitter.commitTask throws exception
Key: SPARK-10381
URL: https://issues.apache.org/jira/browse/SPARK-10381
Project: Spark
Issue Type: Bug
Components: Scheduler
Affects Versions: 1.4.1, 1.3.1, 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
When speculative execution is enabled, consider a scenario where the authorized
committer of a particular output partition fails during the
OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is
supposed to release that committer's exclusive lock on committing once that
task fails. However, due to a unit mismatch the lock will not be released,
causing Spark to go into an infinite retry loop.
This bug was masked by the fact that the OutputCommitCoordinator does not have
enough end-to-end tests (the current tests use many mocks). Other factors
contributing to this bug are the fact that we have many similarly-named
identifiers that have different semantics but the same data types (e.g.
attemptNumber and taskAttemptId, with inconsistent variable naming which makes
them difficult to distinguish).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]