[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496452#comment-14496452
 ] 

Rohini Palaniswamy commented on TEZ-2322:
-----------------------------------------

No. Have only seen 
   - TotalTasks come down when a new vertex is starting and tasks reduced due 
to auto parallelism with ShuffleVertexManager. 
   - If the AM gets killed and a new one is launched, Succeeded goes to 0 and 
then increases as recovery kicks in. 

Have not seen Succeeded reduce to a non-zero count. But I have only seen AM 
relaunch due to OOM or other issues with very big jobs (30K+ tasks). So 
worthwhile to check if there is a second AM attempt launched. Pig prints that 
status every 20 secs and it is possible a new AM was launched and recovery 
recovered 181 tasks by then.

> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --------------------------------------------------------------
>
>                 Key: TEZ-2322
>                 URL: https://issues.apache.org/jira/browse/TEZ-2322
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.2
>         Environment: HDP 2.2
>            Reporter: Hari Sekhon
>            Priority: Minor
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to