[
https://issues.apache.org/jira/browse/TEZ-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Badger updated TEZ-3696:
-----------------------------
Attachment: TEZ-3696.004.patch
bq. pendingAttempt can use TaskAttemptId as a key instead of TaskAttempt.
Uploading new patch that uses TezTaskAttemptID instead of TaskAttempt
> Jobs can hang when both concurrency and speculation are enabled
> ---------------------------------------------------------------
>
> Key: TEZ-3696
> URL: https://issues.apache.org/jira/browse/TEZ-3696
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Eric Badger
> Assignee: Eric Badger
> Attachments: TEZ-3696.001.patch, TEZ-3696.002.patch,
> TEZ-3696.003.patch, TEZ-3696.004.patch
>
>
> We can reproduce the hung job by doing the following:
> 1. Run a sleep job with a concurrency of 1, speculation enabled, and 3 tasks
> {noformat}
> HADOOP_CLASSPATH="$TEZ_HOME/*:$TEZ_HOME/lib/*:$TEZ_CONF_DIR" yarn jar
> $TEZ_HOME/tez-tests-*.jar mrrsleep -Dtez.am.vertex.max-task-concurrency=1
> -Dtez.am.speculation.enabled=true -Dtez.task.timeout-ms=60000 -m 3 -mt 60000
> -ir 0 -irt 0 -r 0 -rt 0
> {noformat}
> 2. Let the 1st task run to completion and then stop the 2nd task so that a
> speculative attempt is scheduled. Once the speculative attempt is scheduled
> for the 2nd task, continue the original attempt and let it complete.
> {noformat}
> kill -STOP <pid>
> // wait a few seconds for a speculative attempt to kick off
> kill -CONT <pid>
> {noformat}
> 3. Kill the 3rd task, which will create a 2nd attempt
> {noformat}
> kill -9 <pid>
> {noformat}
> 4. The next thing to be drawn off of the queue will be the speculative
> attempt of the 2nd task. However, it is already completed, so it will just
> sit in the final state and the job will hang.
> Basically, for the failure to happen, the number of speculative tasks that
> are scheduled, but not yet ran has to be >= the concurrency of the job and
> there has to be at least 1 task failure.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)