[
https://issues.apache.org/jira/browse/TEZ-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995911#comment-15995911
]
TezQA commented on TEZ-3696:
----------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12866276/TEZ-3696.004.patch
against master revision 7b30785.
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+1 tests included{color}. The patch appears to include 2 new
or modified test files.
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 javadoc{color}. There were no new javadoc warning messages.
{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 3.0.1) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:red}-1 core tests{color}. The patch failed these unit tests in :
org.apache.tez.common.TestTezSharedExecutor
org.apache.tez.runtime.library.common.writers.TestUnorderedPartitionedKVWriter
Test results:
https://builds.apache.org/job/PreCommit-TEZ-Build/2411//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2411//console
This message is automatically generated.
> Jobs can hang when both concurrency and speculation are enabled
> ---------------------------------------------------------------
>
> Key: TEZ-3696
> URL: https://issues.apache.org/jira/browse/TEZ-3696
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Eric Badger
> Assignee: Eric Badger
> Attachments: TEZ-3696.001.patch, TEZ-3696.002.patch,
> TEZ-3696.003.patch, TEZ-3696.004.patch
>
>
> We can reproduce the hung job by doing the following:
> 1. Run a sleep job with a concurrency of 1, speculation enabled, and 3 tasks
> {noformat}
> HADOOP_CLASSPATH="$TEZ_HOME/*:$TEZ_HOME/lib/*:$TEZ_CONF_DIR" yarn jar
> $TEZ_HOME/tez-tests-*.jar mrrsleep -Dtez.am.vertex.max-task-concurrency=1
> -Dtez.am.speculation.enabled=true -Dtez.task.timeout-ms=60000 -m 3 -mt 60000
> -ir 0 -irt 0 -r 0 -rt 0
> {noformat}
> 2. Let the 1st task run to completion and then stop the 2nd task so that a
> speculative attempt is scheduled. Once the speculative attempt is scheduled
> for the 2nd task, continue the original attempt and let it complete.
> {noformat}
> kill -STOP <pid>
> // wait a few seconds for a speculative attempt to kick off
> kill -CONT <pid>
> {noformat}
> 3. Kill the 3rd task, which will create a 2nd attempt
> {noformat}
> kill -9 <pid>
> {noformat}
> 4. The next thing to be drawn off of the queue will be the speculative
> attempt of the 2nd task. However, it is already completed, so it will just
> sit in the final state and the job will hang.
> Basically, for the failure to happen, the number of speculative tasks that
> are scheduled, but not yet ran has to be >= the concurrency of the job and
> there has to be at least 1 task failure.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)