[ 
https://issues.apache.org/jira/browse/HDFS-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202341#comment-16202341
 ] 

Chen Liang commented on HDFS-12415:
-----------------------------------

I looked in this a little bit too. What was happening seems to be that 
{{SCMCommonPolicy#chooseDatanodes}} calls 
{{nodeManager.getNodes(OzoneProtos.NodeState.HEALTHY);}}, but the returned list 
contains a {{null}} datanode id entry. So the {{hasEnoughSpace(d, 
sizeRequired)}} call on the null d will fail with NPE. And the returned list 
with a null entry is returned by {{SCMNodeManager#getNodes}}, where seems there 
is some datanode id in {{healthyNodes}} but not present in {{nodes}} map.

I don't see how could a datanode id be present in {{healthyNodes}} but not in 
{{nodes}}, because the first thing of register is to always add that datanode 
to {{nodes}}, before {{healthyNodes}}. I can only think of the issue being just 
like [~msingh] mentioned, that it is probably due to some unexpected race 
condition behaviour when two register calls happen and change the HashMap 
{{nodes}} at the same time. So I would +1 on Mukul's change. Additionally, I 
ran {{TestXceiverClientManager}} several ten times with v005 patch applied. The 
test did not fail.

> Ozone: TestXceiverClientManager and TestAllocateContainer occasionally fails
> ----------------------------------------------------------------------------
>
>                 Key: HDFS-12415
>                 URL: https://issues.apache.org/jira/browse/HDFS-12415
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7240
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>         Attachments: HDFS-12415-HDFS-7240.001.patch, 
> HDFS-12415-HDFS-7240.002.patch, HDFS-12415-HDFS-7240.003.patch, 
> HDFS-12415-HDFS-7240.004.patch, HDFS-12415-HDFS-7240.005.patch
>
>
> TestXceiverClientManager seems to be occasionally failing in some jenkins 
> jobs,
> {noformat}
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.ozone.scm.node.SCMNodeManager.getNodeStat(SCMNodeManager.java:828)
>  at 
> org.apache.hadoop.ozone.scm.container.placement.algorithms.SCMCommonPolicy.hasEnoughSpace(SCMCommonPolicy.java:147)
>  at 
> org.apache.hadoop.ozone.scm.container.placement.algorithms.SCMCommonPolicy.lambda$chooseDatanodes$0(SCMCommonPolicy.java:125)
> {noformat}
> see more from [this 
> report|https://builds.apache.org/job/PreCommit-HDFS-Build/21065/testReport/]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to