[jira] [Commented] (MAPREDUCE-4818) Easier identification of tasks that timeout during localization

Jason Lowe (JIRA) Mon, 25 Aug 2014 06:31:20 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109102#comment-14109102
 ]


Jason Lowe commented on MAPREDUCE-4818:
---------------------------------------

It introduces a slight amount of extra overhead per container, but we only need 
to consider the containers running on a single node.  This is a single-line 
write to a local-disk log file for each file localized, so we're not talking 
about a lot of write traffic here and other nodes are not impacted.  This log 
file will be aggregated with the other logs of the application, so it should 
have no impact on the number of log aggregation files in HDFS and only a 
minimal impact on the log aggregation itself as it picks up one more local log 
file during aggregation (a file that will only be a few dozen KBytes in almost 
all cases).

A problem with only creating logs for failures is that we don't know for sure 
which ones have failed.  Currently containers are often killed by the 
application framework during localization, but from YARN it doesn't see this as 
a localization failure but simply a container that was killed.  We'd have to 
treat each container being killed during localization as a "failed" case and 
write out the pertinent details of what was being localized at the time.  It 
also means we don't get any progress reports in the logs showing which file is 
being localized.  If we write this out for each container then the user can 
visit the logs of the container and see what files have been localized so far 
and which one is currently being processed.

> Easier identification of tasks that timeout during localization
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4818
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4818
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>              Labels: usability
>
> When a task is taking too long to localize and is killed by the AM due to 
> task timeout, the job UI/history is not very helpful.  The attempt simply 
> lists a diagnostic stating it was killed due to timeout, but there are no 
> logs for the attempt since it never actually got started.  There are log 
> messages on the NM that show the container never made it past localization by 
> the time it was killed, but users often do not have access to those logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-4818) Easier identification of tasks that timeout during localization

Reply via email to