[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

Aaron Kimball (JIRA) Wed, 11 Nov 2009 12:55:13 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776648#action_12776648
 ]


Aaron Kimball commented on MAPREDUCE-1119:
------------------------------------------

I modeled this parameter after the fact that TaskTracker was already using this 
same name (see {{setTaskFailState()}}, {{jobHasFinished()}}, {{kill()}}, 
{{cleanUpOverMemoryTask()}}) to indicate whether a kill was failure-based or 
for other purposes (cleanup/preemption/etc).

I think there is perhaps a more systemic overhaul of failure reason tracking 
that should occur as a separate issue?

As for your table... if you look at {{TaskController.destroyTaskJVM()}} (line 
151), you can see that generates-stack is true iff {{wasFailure}} is true.

I ran some tests by running sleep job which slept for 60 seconds in each call 
to {{map()}}. Results follow:

|*Test case*|*Stack dump?*|
|set {{mapreduce.task.timeout}} to 10000 (task timeout)|yes|
|ran {{bin/mapred job -kill-task}} on attempts|no|
|ran {{bin/mapred job -fail-task}} on attempts|no|
|Let it complete successfully|no|
|ran {{bin/mapred job -kill}} on the job itself.|no|
|threw a RuntimeException in the mapper|no|

Thus, I believe that translates into the following for your table:

|*Reason*|*wasFailure*|*generates stack*|
|Child exception|maybe|maybe*|
|Other task failures|false|false|
|Task timeout|true|true|
|Task killed by user|false|false|
|Task failed by user|false|false|
|Job killed by user|false|false|

Looking at {{org.apache.hadoop.mapred.Child}}, there are a few different catch 
blocks in there:
* If a task throws a {{FSError}}, this triggers 
{{TaskUmbilicalProtocol.fsError()}}, which will cause a {{purgeTask(tip, 
wasFailure=true)}}.
* If a task throws any other sort of {{Exception}}, this does not trigger a 
particular response to the TUP; The exception string is passed to 
{{TaskTracker.reportDiagnosticInfo()}}, but this simply logs a string of text 
and takes no further action.
* If a map task throws any other {{Throwable}}, this triggers 
{{TUP.fatalError()}}, which also calls {{purgeTask(tip, wasFailure=true)}}.

But immediately after these catch blocks, it closes the RPC proxy, shuts down 
the logging thread, and exits the JVM. So fsError and fatalError *may* cause a 
stack dump if the TT processes the request fast enough and issues a SIGQUIT in 
the next few microseconds. But this is racing against the fact that the child 
task's next action is "exit immediately."

Note that job kill, task timeout, and the task exception cases are all covered 
in the unit test provided in this patch.


> When tasks fail to report status, show tasks's stack dump before killing
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1119
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1119) When tasks fail to report status, show tasks's stack dump before killing

Reply via email to