[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778288#action_12778288
 ] 

Vinod K V commented on MAPREDUCE-1119:
--------------------------------------

The patch looks very clean now! Thanks! It is very close, I have only few 
comments on the latest patch, most of them minor:
 - Care explain the changes to {{src/c++/task-controller/main.c}} w.r.t 
conf_dir_len? Both for my confirmation as well as for record's sake..
 - Change c comments for {{kill_user_task()}} in 
{{src/c++task-controller/task-controller.c}} mentioning that it can 
terminate/kill or dump-stack?
 - Now that the semantics have changed, I am not very sure we want to use the 
same configuration property for sleeping after dump-stack. (Thinking aloud..) 
Do we even need a sleep here? The signalling order is 
SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order of their 
arrival? If so, then we will not another sleep. If not, we may need a sleep 
here, but may or may not be driven by the same config item. What do you think?
 - All the three newly added methods in {{JvmManager}} can be package-private 
or private.
 - ProcessTree.java:
   -- Lot of refactoring. Nice!
   -- The variables SIG* and SIG*_STR can all be private, so can 
{{maybeSignalProcess()}} and {{maybeSignalProcessGroup()}} be.
 - TestJobKillAndFail
   -- Are we sure "PSPermGen" will always be there in the dump? Instead how 
about passing our own {{TaskController}} that does custom actions in 
{{TaskController.dumpStacks()}}, simplifying our verification that dump-stack 
is indeed called?
   -- The test now takes very long time. The test-time can be more than halved 
if we set max-map-attempts to one in both the tests via 
{{conf.setMaxMapAttempts(1);}}
 - We need a similar test for {{LinuxTaskController}} to test stack-dump when 
multiple users are involved. You can look at 
{{TestLocalizationWithLinuxTaskController}} and/or 
{{TestJobExecutionAsDifferentUser}} for inspiration.

> When tasks fail to report status, show tasks's stack dump before killing
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1119
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, 
> MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.patch
>
>
> When the TT kills tasks that haven't reported status, it should somehow 
> gather a stack dump for the task. This could be done either by sending a 
> SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to 
> gather the stack directly from Java. This may be somewhat tricky since the 
> child may be running as another user (so the SIGQUIT would have to go through 
> LinuxTaskController). This feature would make debugging these kinds of 
> failures much easier, especially if we could somehow get it into the 
> TaskDiagnostic message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to