[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776003#action_12776003
 ] 

Aaron Kimball commented on MAPREDUCE-1119:
------------------------------------------

Actually, I suppose that if it comes from the JT, then it's definitely a 
speculative task attempt, right? Task attempt timeouts are actually between the 
attempt and the TT, and the JT isn't involved at all.

In the event of a timeout, markUnresponsiveTasks() calls 
TaskTracker.purgeTask(tip, wasFailure=true) calls 
tip.jobHasFinished(wasFailure) which calls tip.kill(wasFailure).

Unfortunately, here's where the train of failure/non-failure data for why the 
task should be killed, ends. This calls TaskRunner.kill() which calls 
JvmManager.taskKilled(this), which calls 
JvmManagerForType.taskKilled(taskRunner), calls JvmMgrForType.killJvm(jvmId), 
calls JvmRunner.kill(), calls 
TaskController.destroyTaskJvm(TaskControllerContext). (Someone please correct 
me if I'm wrong.)

But TaskRunner.kill() doesn't get a reason code like wasFailure. This could be 
changed, but then we'd also need to modify JvmManager, and add a 
synchronized/volatile call to hand off this data into the TaskControllerContext 
object. Is all this worth it just to avoid stack dumps in aborted speculative 
task attempts?


> When tasks fail to report status, show tasks's stack dump before killing
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1119
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to