[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435603#comment-13435603
 ] 

Rahul Jain commented on MAPREDUCE-4559:
---------------------------------------

In our lab tests, we have a namenode that performs much slowly than regular 
namenodes being on a slower network.

This caused application master to take long time committing job (the last step 
after all map, reduces were done); and in some cases, was greater than 10 
minutes, which is more than the yarn.am.liveness-monitor.expiry-interval-ms 
(default 10 minutes).

Here is the RM logs snippet: 
{code}

{code}
05_0002    CONTAINERID=container_1344459886205_0002_01_000825
2012-08-08 23:57:34,881 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released 
container container_1344459886205_0002_01_000825 of capacity memory: 4096 on 
host sjc1-spr-msip-grid08.sjc1.carrieriq.com:26020, which currently has 0 
containers, memory: 0 used and memory: 80000 available, release resources=true
2012-08-08 23:57:34,881 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
Application appattempt_1344459886205_0002_000001 released container 
container_1344459886205_0002_01_000825 on node: host: 
sjc1-spr-msip-grid08.sjc1.carrieriq.com:26020 #containers=0 available=80000 
used=0 with event: FINISHED
2012-08-09 00:08:10,256 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
Expired:appattempt_1344459886205_0002_000001 Timed out after 600 secs
2012-08-09 00:08:10,256 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1344459886205_0002_000001 State change from RUNNING to FAILED
2012-08-09 00:08:10,256 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
application_1344459886205_0002 failed 1 times due to . Failing the application.
2012-08-09 00:08:10,257 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1344459886205_0002 State change from RUNNING to FAILED
{code}


On the application master:

{code}
2012-08-08 23:57:33,871 INFO [ContainerLauncher #13] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
the event EventType: CONTAINER_REMOTE_CLEANUP for container 
container_1344459886205_0002_01_000825 taskAttempt 
attempt_1344459886205_0002_m_000754_02012-08-08 23:57:33,871 INFO 
[ContainerLauncher #13] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING a
ttempt_1344459886205_0002_m_000754_02012-08-08 23:57:33,874 INFO 
[AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt
_1344459886205_0002_m_000754_0 TaskAttempt Transitioned from 
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED2012-08-08 23:57:33,874 INFO 
[AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
 with attempt attempt_1344459886205_0002_m_000754_0
2012-08-08 23:57:33,875 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_134445988
6205_0002_m_000754 Task Transitioned from RUNNING to SUCCEEDED
2012-08-08 23:57:33,875 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed T
asks: 8622012-08-09 00:08:10,263 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. 
Signaling 
RMCommunicator and JobHistoryEventHandler.2012-08-09 00:08:10,263 INFO 
[Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 
RMCommunicator notified that 
iSignalled was : true2012-08-09 00:08:10,263 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
JobHistoryEventHandler not
ified that isSignalled was true
2012-08-09 00:08:10,263 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
JobHistoryEventHandler. Size of the outstanding queue size is 02012-08-09 
00:08:10,263 INFO [Thread-50] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take 
interrupted. Returning
{code}

We'd expect such failure conditions to be the most important for job history 
availability.

The job history can still be accessed manually thru the aggregated logs on 
hdfs, but job history server has no idea about the above job after timeout.

                
> Job logs not accessible through job history server for AM killed due to 
> am.liveness-monitor expiry
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4559
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4559
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to