[
https://issues.apache.org/jira/browse/MAPREDUCE-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109102#comment-14109102
]
Jason Lowe commented on MAPREDUCE-4818:
---------------------------------------
It introduces a slight amount of extra overhead per container, but we only need
to consider the containers running on a single node. This is a single-line
write to a local-disk log file for each file localized, so we're not talking
about a lot of write traffic here and other nodes are not impacted. This log
file will be aggregated with the other logs of the application, so it should
have no impact on the number of log aggregation files in HDFS and only a
minimal impact on the log aggregation itself as it picks up one more local log
file during aggregation (a file that will only be a few dozen KBytes in almost
all cases).
A problem with only creating logs for failures is that we don't know for sure
which ones have failed. Currently containers are often killed by the
application framework during localization, but from YARN it doesn't see this as
a localization failure but simply a container that was killed. We'd have to
treat each container being killed during localization as a "failed" case and
write out the pertinent details of what was being localized at the time. It
also means we don't get any progress reports in the logs showing which file is
being localized. If we write this out for each container then the user can
visit the logs of the container and see what files have been localized so far
and which one is currently being processed.
> Easier identification of tasks that timeout during localization
> ---------------------------------------------------------------
>
> Key: MAPREDUCE-4818
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4818
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mr-am
> Affects Versions: 0.23.3, 2.0.3-alpha
> Reporter: Jason Lowe
> Labels: usability
>
> When a task is taking too long to localize and is killed by the AM due to
> task timeout, the job UI/history is not very helpful. The attempt simply
> lists a diagnostic stating it was killed due to timeout, but there are no
> logs for the attempt since it never actually got started. There are log
> messages on the NM that show the container never made it past localization by
> the time it was killed, but users often do not have access to those logs.
--
This message was sent by Atlassian JIRA
(v6.2#6252)