[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Kimball updated MAPREDUCE-1119:
-------------------------------------

    Attachment: MAPREDUCE-1119.patch

Attaching a patch which performs this function.

Stack traces are added to the stdout of the task itself via {{SIGQUIT}}; this 
naturally lets it get collected in the {{stdout}} log of the task.

This patch modifies the API of {{TaskController}} to include a 
{{dumpTaskStack()}} method that invokes {{SIGQUIT}}.

In {{DefaultTaskController}}, this is actually performed by {{ProcessTree}}.  
The {{LinuxTaskController}} will send a new opcode 
{{TaskCommands.QUIT_TASK_JVM}} to the {{task-controller}} module; this sends 
the {{SIGQUIT}} signal itself to the client.

The existing behavior of {{TaskController.destroyTaskJVM()}} is to send 
{{SIGTERM}}, sleep for {{context.sleeptimeBeforeSigkill}} and then send 
{{SIGKILL}}; I've modified this method so that it goes 
SIGQUIT/sleep/SIGTERM/sleep/SIGKILL. The sleep is necessary after the SIGQUIT 
to give the task time to actually do the stack dump before it has to handle 
SIGTERM.

I tested this by running some jobs which time out and verified that they got 
the stack dumps in their task stdout logs; jobs which succeed do not. I did 
this with both the DefaultTaskController and the LinuxTaskController. I also 
added a unit test to the patch which checks that evidence of a stack dump 
appears in the stdout log for a task which is killed by the unit test.

While I was in the {{task-controller}} c++ module, I discovered a segfault 
which is also fixed in this patch. If {{HADOOP_CONF_DIR}} isn't defined, it 
expects {{argv[0]}} to be the full path to {{task-controller}} so it can find 
the {{conf}} dir based on this. If you just run {{./task-controller}}, this 
will try to malloc a negative amount of space. I changed it to gracefully exit 
with an error message in this case. (Simple fix; no unit test case.)



> When tasks fail to report status, show tasks's stack dump before killing
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1119
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>         Attachments: MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to