[ https://issues.apache.org/jira/browse/MAPREDUCE-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778288#action_12778288 ]
Vinod K V commented on MAPREDUCE-1119: -------------------------------------- The patch looks very clean now! Thanks! It is very close, I have only few comments on the latest patch, most of them minor: - Care explain the changes to {{src/c++/task-controller/main.c}} w.r.t conf_dir_len? Both for my confirmation as well as for record's sake.. - Change c comments for {{kill_user_task()}} in {{src/c++task-controller/task-controller.c}} mentioning that it can terminate/kill or dump-stack? - Now that the semantics have changed, I am not very sure we want to use the same configuration property for sleeping after dump-stack. (Thinking aloud..) Do we even need a sleep here? The signalling order is SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order of their arrival? If so, then we will not another sleep. If not, we may need a sleep here, but may or may not be driven by the same config item. What do you think? - All the three newly added methods in {{JvmManager}} can be package-private or private. - ProcessTree.java: -- Lot of refactoring. Nice! -- The variables SIG* and SIG*_STR can all be private, so can {{maybeSignalProcess()}} and {{maybeSignalProcessGroup()}} be. - TestJobKillAndFail -- Are we sure "PSPermGen" will always be there in the dump? Instead how about passing our own {{TaskController}} that does custom actions in {{TaskController.dumpStacks()}}, simplifying our verification that dump-stack is indeed called? -- The test now takes very long time. The test-time can be more than halved if we set max-map-attempts to one in both the tests via {{conf.setMaxMapAttempts(1);}} - We need a similar test for {{LinuxTaskController}} to test stack-dump when multiple users are involved. You can look at {{TestLocalizationWithLinuxTaskController}} and/or {{TestJobExecutionAsDifferentUser}} for inspiration. > When tasks fail to report status, show tasks's stack dump before killing > ------------------------------------------------------------------------ > > Key: MAPREDUCE-1119 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1119 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: tasktracker > Affects Versions: 0.22.0 > Reporter: Todd Lipcon > Assignee: Aaron Kimball > Attachments: MAPREDUCE-1119.2.patch, MAPREDUCE-1119.3.patch, > MAPREDUCE-1119.4.patch, MAPREDUCE-1119.5.patch, MAPREDUCE-1119.patch > > > When the TT kills tasks that haven't reported status, it should somehow > gather a stack dump for the task. This could be done either by sending a > SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to > gather the stack directly from Java. This may be somewhat tricky since the > child may be running as another user (so the SIGQUIT would have to go through > LinuxTaskController). This feature would make debugging these kinds of > failures much easier, especially if we could somehow get it into the > TaskDiagnostic message -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.