[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287558#comment-13287558
 ] 

Daryn Sharp commented on MAPREDUCE-4302:
----------------------------------------

For a little background, the problem was detected due to a NN token issue.  The 
NMs all went down because log aggregation init failed to connect to the NN to 
create its log dirs.  The NMs were started up again, and they all went down 
again because the AMs were retrying the tasks.  The problem was also induced by 
restricting permissions on the log dir and stopping the NN.
                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs 
> during the init of log aggregation then the NM goes down.  The problem can be 
> induced by situations including, but certainly not limited to: transient rpc 
> connection issues, missing tokens, expired tokens, permissions, full/quota 
> exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down 
> either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to