[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270766#comment-13270766
 ] 

Robert Joseph Evans commented on MAPREDUCE-4233:
------------------------------------------------

@Bikas,

The situation we are in with one of our clusters is not an intermittent race.  
The cluster is kind of stuck in this case right now, although the race is also 
a possibility.

I was curious to see what were the situations where the scheduler's node list 
was updated.  It is updated whenever there is a NODE_ADDED or a NODE_REMOVED 
event sent to the scheduler.  The NODE_ADDED events happen when a node 
registers for the first time, when a node reconnects, and when a node's status 
transitions from unhealth to healthy.  Similarly the NODE_REMOVED event is sent 
when a node transitions from healthy to unhealthy, when the node is 
deactivated, or when a node reconnects (it is removed and then added back in).  
From that it appears that scheduler is intended to only store the list of 
healthy nodes.  By contrast the list of nodes in the RM Context is updated when 
nodes register, reconnect, or deactivate.  The difference between the two is 
unhealthy/healthy transitions.

I did not dig much further to see if there was more of a disconnect between the 
two lists.  Especially because in the other places that access either of these 
node lists they check for null return values so I assumed that it was simply a 
missed null check here as well.
                
> NPE can happen in RMNMNodeInfo.
> -------------------------------
>
>                 Key: MAPREDUCE-4233
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4233
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4233.txt
>
>
> {noformat}
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo.getLiveNodeManagers(RMNMInfo.java:96)
>         at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
>         at 
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
>         at 
> com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
>         at 
> com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:65)
>         at 
> com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:216)
>         at javax.management.StandardMBean.getAttribute(StandardMBean.java:358)
>         at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:666)
> {noformat}
> Looks like rmcontext.getRMNodes() is not kept in sync with 
> scheduler.getNodeReport(), so that the report can be null even though the 
> context still knowns about the node.
> The simple fix is to add in a null check.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to