[
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hari Sekhon updated TEZ-2322:
-----------------------------
Description:
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0
{code}
Now this may be because the tasks failed, some certainly did due to space
exceptions having checked the logs, but surely once a task has finished
successfully and is marked as succeeded it cannot then later be removed from
the succeeded count? Perhaps the succeeded counter is incremented too early
before the task results are really saved?
KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't
account for the large drop in number of succeeded tasks.
There was also a noticeable jump in Running tasks from 58 => 724 at the same
time which is suspicious, I'm pretty sure there was no contending job to finish
and release so much more resource to this Tez job, so it's also unclear how the
running count count have jumped up to significantly.
was:
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0
{code}
Now this may be because the tasks failed, some certainly did due to space
exceptions having checked the logs, but surely once a task has finished
successfully and is marked as succeeded it cannot then later be removed from
the succeeded count? Perhaps the succeeded counter is incremented too early
before the task results are really saved?
KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't
account for the large drop in number of succeeded tasks.
There was also a noticeable jump in Running tasks from 58 => 724 at the same
time which is suspicious, I'm pretty sure there was no contending job to finish
and release so much more resource to this Tez job.
> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --------------------------------------------------------------
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.2
> Environment: HDP 2.2
> Reporter: Hari Sekhon
> Priority: Minor
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed:
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed:
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed:
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed:
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed:
> 0
> {code}
> Now this may be because the tasks failed, some certainly did due to space
> exceptions having checked the logs, but surely once a task has finished
> successfully and is marked as succeeded it cannot then later be removed from
> the succeeded count? Perhaps the succeeded counter is incremented too early
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same
> time which is suspicious, I'm pretty sure there was no contending job to
> finish and release so much more resource to this Tez job, so it's also
> unclear how the running count count have jumped up to significantly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)