[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496452#comment-14496452 ]
Rohini Palaniswamy commented on TEZ-2322: ----------------------------------------- No. Have only seen - TotalTasks come down when a new vertex is starting and tasks reduced due to auto parallelism with ShuffleVertexManager. - If the AM gets killed and a new one is launched, Succeeded goes to 0 and then increases as recovery kicks in. Have not seen Succeeded reduce to a non-zero count. But I have only seen AM relaunch due to OOM or other issues with very big jobs (30K+ tasks). So worthwhile to check if there is a second AM attempt launched. Pig prints that status every 20 secs and it is possible a new AM was launched and recovery recovered 181 tasks by then. > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -------------------------------------------------------------- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.5.2 > Environment: HDP 2.2 > Reporter: Hari Sekhon > Priority: Minor > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)