[
https://issues.apache.org/jira/browse/HDFS-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262694#comment-15262694
]
Brahma Reddy Battula commented on HDFS-9958:
--------------------------------------------
I think, we can fix this simple way, where we include HDFS-10343 here only..
{code}
- final DatanodeStorageInfo[] machines = new
DatanodeStorageInfo[numMachines];
+ //final DatanodeStorageInfo[] machines = new
DatanodeStorageInfo[numMachines];
+ List<DatanodeStorageInfo> machinesList= new ArrayList<>(numMachines) ;
final byte[] blockIndices = blk.isStriped() ? new byte[numMachines] : null;
int j = 0, i = 0;
if (numMachines > 0) {
@@ -1048,7 +1049,9 @@ private LocatedBlock createLocatedBlock(final BlockInfo
blk, final long pos)
final DatanodeDescriptor d = storage.getDatanodeDescriptor();
final boolean replicaCorrupt = corruptReplicas.isReplicaCorrupt(blk,
d);
if (isCorrupt || (!replicaCorrupt)) {
- machines[j++] = storage;
+ //machines[j++] = storage;
+ j++;
+ machinesList.add(storage);
// TODO this can be more efficient
if (blockIndices != null) {
byte index = ((BlockInfoStriped)
blk).getStorageBlockIndex(storage);
@@ -1058,6 +1061,7 @@ private LocatedBlock createLocatedBlock(final BlockInfo
blk, final long pos)
}
}
}
+ final DatanodeStorageInfo[] machines=machinesList.toArray(new
DatanodeStorageInfo[j]);
{code}
correct me if I am wrong.. thanks..
> BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed
> storages.
> ------------------------------------------------------------------------------------
>
> Key: HDFS-9958
> URL: https://issues.apache.org/jira/browse/HDFS-9958
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.2
> Reporter: Kuhu Shukla
> Assignee: Kuhu Shukla
> Attachments: HDFS-9958-Test-v1.txt, HDFS-9958.001.patch,
> HDFS-9958.002.patch, HDFS-9958.003.patch, HDFS-9958.004.patch,
> HDFS-9958.005.patch
>
>
> In a scenario where the corrupt replica is on a failed storage, before it is
> taken out of blocksMap, there is a race which causes the creation of
> LocatedBlock on a {{machines}} array element that is not populated.
> Following is the root cause,
> {code}
> final int numCorruptNodes = countNodes(blk).corruptReplicas();
> {code}
> countNodes only looks at nodes with storage state as NORMAL, which in the
> case where corrupt replica is on failed storage will amount to
> numCorruptNodes being zero.
> {code}
> final int numNodes = blocksMap.numNodes(blk);
> {code}
> However, numNodes will count all nodes/storages irrespective of the state of
> the storage. Therefore numMachines will include such (failed) nodes. The
> assert would fail only if the system is enabled to catch Assertion errors,
> otherwise it goes ahead and tries to create LocatedBlock object for that is
> not put in the {{machines}} array.
> Here is the stack trace:
> {code}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:45)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:40)
> at
> org.apache.hadoop.hdfs.protocol.LocatedBlock.<init>(LocatedBlock.java:84)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:878)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:826)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:799)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:899)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1849)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)