[ https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776003#action_12776003 ]
Aaron Kimball commented on MAPREDUCE-1119: ------------------------------------------ Actually, I suppose that if it comes from the JT, then it's definitely a speculative task attempt, right? Task attempt timeouts are actually between the attempt and the TT, and the JT isn't involved at all. In the event of a timeout, markUnresponsiveTasks() calls TaskTracker.purgeTask(tip, wasFailure=true) calls tip.jobHasFinished(wasFailure) which calls tip.kill(wasFailure). Unfortunately, here's where the train of failure/non-failure data for why the task should be killed, ends. This calls TaskRunner.kill() which calls JvmManager.taskKilled(this), which calls JvmManagerForType.taskKilled(taskRunner), calls JvmMgrForType.killJvm(jvmId), calls JvmRunner.kill(), calls TaskController.destroyTaskJvm(TaskControllerContext). (Someone please correct me if I'm wrong.) But TaskRunner.kill() doesn't get a reason code like wasFailure. This could be changed, but then we'd also need to modify JvmManager, and add a synchronized/volatile call to hand off this data into the TaskControllerContext object. Is all this worth it just to avoid stack dumps in aborted speculative task attempts? > When tasks fail to report status, show tasks's stack dump before killing > ------------------------------------------------------------------------ > > Key: MAPREDUCE-1119 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: tasktracker > Affects Versions: 0.22.0 > Reporter: Todd Lipcon > Assignee: Aaron Kimball > Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch > > > When the TT kills tasks that haven't reported status, it should somehow > gather a stack dump for the task. This could be done either by sending a > SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to > gather the stack directly from Java. This may be somewhat tricky since the > child may be running as another user (so the SIGQUIT would have to go through > LinuxTaskController). This feature would make debugging these kinds of > failures much easier, especially if we could somehow get it into the > TaskDiagnostic message -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.