[
https://issues.apache.org/jira/browse/MAPREDUCE-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270766#comment-13270766
]
Robert Joseph Evans commented on MAPREDUCE-4233:
------------------------------------------------
@Bikas,
The situation we are in with one of our clusters is not an intermittent race.
The cluster is kind of stuck in this case right now, although the race is also
a possibility.
I was curious to see what were the situations where the scheduler's node list
was updated. It is updated whenever there is a NODE_ADDED or a NODE_REMOVED
event sent to the scheduler. The NODE_ADDED events happen when a node
registers for the first time, when a node reconnects, and when a node's status
transitions from unhealth to healthy. Similarly the NODE_REMOVED event is sent
when a node transitions from healthy to unhealthy, when the node is
deactivated, or when a node reconnects (it is removed and then added back in).
From that it appears that scheduler is intended to only store the list of
healthy nodes. By contrast the list of nodes in the RM Context is updated when
nodes register, reconnect, or deactivate. The difference between the two is
unhealthy/healthy transitions.
I did not dig much further to see if there was more of a disconnect between the
two lists. Especially because in the other places that access either of these
node lists they check for null return values so I assumed that it was simply a
missed null check here as well.
> NPE can happen in RMNMNodeInfo.
> -------------------------------
>
> Key: MAPREDUCE-4233
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4233
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 0.23.3
> Reporter: Robert Joseph Evans
> Assignee: Robert Joseph Evans
> Priority: Critical
> Attachments: MR-4233.txt
>
>
> {noformat}
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo.getLiveNodeManagers(RMNMInfo.java:96)
> at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
> at
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
> at
> com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
> at
> com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:65)
> at
> com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:216)
> at javax.management.StandardMBean.getAttribute(StandardMBean.java:358)
> at
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:666)
> {noformat}
> Looks like rmcontext.getRMNodes() is not kept in sync with
> scheduler.getNodeReport(), so that the report can be null even though the
> context still knowns about the node.
> The simple fix is to add in a null check.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira