[ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gera Shegalov updated MAPREDUCE-5044: ------------------------------------- Attachment: MAPREDUCE-5044.v04.patch v04 to apply on top of YARN-1515.v05. It now makes sure that a thread dump is created in the uber mode. Added unit tests for a normal MR job and uber MR job. While working on this I realized that we actually need to discuss how mapreduce.task.timeout is treated in the ubermode. Right now it's basically ignored because AM does not kill itself, LocalContainerLauncher processes CONTAINER_REMOTE_CLEANUP inline with the stuck in SubtaskRunner. The liveness monitor for AM in RM does not catch the problem either because RMCommunicator heartbeats in a separate allocator thread. I am considering two options: - move heartbeat() into SubtaskRunner for ubermode such that the liveness monitor catches the stuck ubertask. - do System.exit(errorcode) when TA_TIMEOUT occurs. > Have AM trigger jstack on task attempts that timeout before killing them > ------------------------------------------------------------------------ > > Key: MAPREDUCE-5044 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am > Affects Versions: 2.1.0-beta > Reporter: Jason Lowe > Assignee: Gera Shegalov > Attachments: MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, > MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, Screen Shot 2013-11-12 at > 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png > > > When an AM expires a task attempt it would be nice if it triggered a jstack > output via SIGQUIT before killing the task attempt. This would be invaluable > for helping users debug their hung tasks, especially if they do not have > shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.1.5#6160)