[ https://issues.apache.org/jira/browse/MAPREDUCE-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109102#comment-14109102 ]
Jason Lowe commented on MAPREDUCE-4818: --------------------------------------- It introduces a slight amount of extra overhead per container, but we only need to consider the containers running on a single node. This is a single-line write to a local-disk log file for each file localized, so we're not talking about a lot of write traffic here and other nodes are not impacted. This log file will be aggregated with the other logs of the application, so it should have no impact on the number of log aggregation files in HDFS and only a minimal impact on the log aggregation itself as it picks up one more local log file during aggregation (a file that will only be a few dozen KBytes in almost all cases). A problem with only creating logs for failures is that we don't know for sure which ones have failed. Currently containers are often killed by the application framework during localization, but from YARN it doesn't see this as a localization failure but simply a container that was killed. We'd have to treat each container being killed during localization as a "failed" case and write out the pertinent details of what was being localized at the time. It also means we don't get any progress reports in the logs showing which file is being localized. If we write this out for each container then the user can visit the logs of the container and see what files have been localized so far and which one is currently being processed. > Easier identification of tasks that timeout during localization > --------------------------------------------------------------- > > Key: MAPREDUCE-4818 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4818 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am > Affects Versions: 0.23.3, 2.0.3-alpha > Reporter: Jason Lowe > Labels: usability > > When a task is taking too long to localize and is killed by the AM due to > task timeout, the job UI/history is not very helpful. The attempt simply > lists a diagnostic stating it was killed due to timeout, but there are no > logs for the attempt since it never actually got started. There are log > messages on the NM that show the container never made it past localization by > the time it was killed, but users often do not have access to those logs. -- This message was sent by Atlassian JIRA (v6.2#6252)