[ https://issues.apache.org/jira/browse/HDFS-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262694#comment-15262694 ]
Brahma Reddy Battula commented on HDFS-9958: -------------------------------------------- I think, we can fix this simple way, where we include HDFS-10343 here only.. {code} - final DatanodeStorageInfo[] machines = new DatanodeStorageInfo[numMachines]; + //final DatanodeStorageInfo[] machines = new DatanodeStorageInfo[numMachines]; + List<DatanodeStorageInfo> machinesList= new ArrayList<>(numMachines) ; final byte[] blockIndices = blk.isStriped() ? new byte[numMachines] : null; int j = 0, i = 0; if (numMachines > 0) { @@ -1048,7 +1049,9 @@ private LocatedBlock createLocatedBlock(final BlockInfo blk, final long pos) final DatanodeDescriptor d = storage.getDatanodeDescriptor(); final boolean replicaCorrupt = corruptReplicas.isReplicaCorrupt(blk, d); if (isCorrupt || (!replicaCorrupt)) { - machines[j++] = storage; + //machines[j++] = storage; + j++; + machinesList.add(storage); // TODO this can be more efficient if (blockIndices != null) { byte index = ((BlockInfoStriped) blk).getStorageBlockIndex(storage); @@ -1058,6 +1061,7 @@ private LocatedBlock createLocatedBlock(final BlockInfo blk, final long pos) } } } + final DatanodeStorageInfo[] machines=machinesList.toArray(new DatanodeStorageInfo[j]); {code} correct me if I am wrong.. thanks.. > BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed > storages. > ------------------------------------------------------------------------------------ > > Key: HDFS-9958 > URL: https://issues.apache.org/jira/browse/HDFS-9958 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.7.2 > Reporter: Kuhu Shukla > Assignee: Kuhu Shukla > Attachments: HDFS-9958-Test-v1.txt, HDFS-9958.001.patch, > HDFS-9958.002.patch, HDFS-9958.003.patch, HDFS-9958.004.patch, > HDFS-9958.005.patch > > > In a scenario where the corrupt replica is on a failed storage, before it is > taken out of blocksMap, there is a race which causes the creation of > LocatedBlock on a {{machines}} array element that is not populated. > Following is the root cause, > {code} > final int numCorruptNodes = countNodes(blk).corruptReplicas(); > {code} > countNodes only looks at nodes with storage state as NORMAL, which in the > case where corrupt replica is on failed storage will amount to > numCorruptNodes being zero. > {code} > final int numNodes = blocksMap.numNodes(blk); > {code} > However, numNodes will count all nodes/storages irrespective of the state of > the storage. Therefore numMachines will include such (failed) nodes. The > assert would fail only if the system is enabled to catch Assertion errors, > otherwise it goes ahead and tries to create LocatedBlock object for that is > not put in the {{machines}} array. > Here is the stack trace: > {code} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:45) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:40) > at > org.apache.hadoop.hdfs.protocol.LocatedBlock.<init>(LocatedBlock.java:84) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:878) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:826) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:799) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:899) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1849) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)