[
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated TEZ-2317:
------------------------------------
Attachment: AM-taskkill.log
For a complex DAG when there were lot of events generated and it could not
process the events fast enough, we (me and [~bikassaha]) saw that many tasks
were killed because only TA_SCHEDULE was processed and before it got to
processing the RUNNING event that it got a commit go/no-go request which is a
separate async call that does not go via the event queue. These issues were
mostly with ONE-ONE edges Pig was using for distributed order by with sampling
and since it was not doing much except partitioning they were finishing too
fast as well.
Issues to fix:
- Optimize by not sending a commit go/no-go request if there is no hdfs
output (DataSink) involved. In the above case, it is always intermediate output.
- Handle the commit go/no-go request after processing events in the event
queue. May be something like ask the task to come back after some time.
- We saw that for 3058 KilledTaskAttempts TA_KILL_REQUEST events was 383519.
This is way high.
- In the attached AM-taskkill.log which has grepped statements for a single
task that was killed, it has 327 repeats of below message. Need to see why so
much and fix that.
{code}
2015-04-13 23:19:11,126 INFO [IPC Server handler 22 on 53043]
app.TaskAttemptListenerImpTezDag: Commit go/no-go request from
attempt_1428329756093_374362_1_29_008426_0
2015-04-13 23:19:11,126 INFO [IPC Server handler 22 on 53043] impl.TaskImpl:
Task not running. Issuing kill to bad commit attempt
attempt_1428329756093_374362_1_29_008426_0
{code}
Please create separate jiras as required.
> Successful task attempts getting killed
> ---------------------------------------
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Attachments: AM-taskkill.log
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)