[
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hari Sekhon updated TEZ-2322:
-----------------------------
Attachment: attempt2_syslog_dag_1427546104095_0146_1_post
attempt2_syslog_dag_1427546104095_0146_1
attempt2_syslog
attempt1_syslog_dag_1427546104095_0146_1
Iirc Ambari still doesn't support Job History server so that command fails, but
I've copied the logs out via RM.
> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --------------------------------------------------------------
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.2
> Environment: HDP 2.2
> Reporter: Hari Sekhon
> Priority: Minor
> Attachments: attempt1_syslog_dag_1427546104095_0146_1,
> attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1,
> attempt2_syslog_dag_1427546104095_0146_1_post
>
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed:
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed:
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed:
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed:
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed:
> 0
> {code}
> Now this may be because the tasks failed, some certainly did due to space
> exceptions having checked the logs, but surely once a task has finished
> successfully and is marked as succeeded it cannot then later be removed from
> the succeeded count? Perhaps the succeeded counter is incremented too early
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same
> time which is suspicious, I'm pretty sure there was no contending job to
> finish and release so much more resource to this Tez job, so it's also
> unclear how the running count count have jumped up to significantly given the
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)