[
https://issues.apache.org/jira/browse/HDFS-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247743#comment-15247743
]
Kuhu Shukla commented on HDFS-9958:
-----------------------------------
Thanks [~walter.k.su]. I looked at the test failure. I think the use of
'corruptReplicas' for numMachines may have been a wrong choice. In a case where
something leads to the inconsistent state , for eg.
{{findAndMarkBlockAsCorrupt}} adds the replica to the corruptsReplica Map
whether or not it is present in the blocksMap, the WARN log shows up but should
not cause the out of bounds exception. IMHO , we should use only blocksMap as
much as possible to decide the size of the array, which would mean that we fix
{{countNodes().corruptReplicas()}} to return the number after going thru all
storages( irrespective of their state) that have the corruptNodes (in this
case), since {{numNodes()}} is storage state agnostic.
Would appreciate your comments on this and please correct me if I am missing
something here. Thanks a lot!
> BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed
> storages.
> ------------------------------------------------------------------------------------
>
> Key: HDFS-9958
> URL: https://issues.apache.org/jira/browse/HDFS-9958
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.2
> Reporter: Kuhu Shukla
> Assignee: Kuhu Shukla
> Attachments: HDFS-9958-Test-v1.txt, HDFS-9958.001.patch,
> HDFS-9958.002.patch
>
>
> In a scenario where the corrupt replica is on a failed storage, before it is
> taken out of blocksMap, there is a race which causes the creation of
> LocatedBlock on a {{machines}} array element that is not populated.
> Following is the root cause,
> {code}
> final int numCorruptNodes = countNodes(blk).corruptReplicas();
> {code}
> countNodes only looks at nodes with storage state as NORMAL, which in the
> case where corrupt replica is on failed storage will amount to
> numCorruptNodes being zero.
> {code}
> final int numNodes = blocksMap.numNodes(blk);
> {code}
> However, numNodes will count all nodes/storages irrespective of the state of
> the storage. Therefore numMachines will include such (failed) nodes. The
> assert would fail only if the system is enabled to catch Assertion errors,
> otherwise it goes ahead and tries to create LocatedBlock object for that is
> not put in the {{machines}} array.
> Here is the stack trace:
> {code}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:45)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:40)
> at
> org.apache.hadoop.hdfs.protocol.LocatedBlock.<init>(LocatedBlock.java:84)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:878)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:826)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:799)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:899)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1849)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)