[jira] [Commented] (HDFS-9958) BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed storages.

Walter Su (JIRA) Tue, 19 Apr 2016 22:53:14 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249312#comment-15249312
 ]


Walter Su commented on HDFS-9958:
---------------------------------

bq. we fix countNodes().corruptReplicas() to return the number after going thru 
all storages( irrespective of their state) that have the corruptNodes (in this 
case), since numNodes() is storage state agnostic.
I think {{countNodes(blk)}} going thru all storages is unnecessary. Also I 
think {{numMachines}} should only include NORMAL and READ_ONLY. So 
{{createLocatedBlock(..)}} going thru all storages is unnecessary.
{code}
    if (numMachines > 0) {
      for(DatanodeStorageInfo storage : blocksMap.getStorages(blk)) {
{code}

btw, which is not related to this topic, I think 
{{findAndMarkBlockAsCorrupt(..)}} shouldn't support adding blk to the map if 
the storage is not found.

ping [~jingzhao] to check if he has any comment.

> BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed 
> storages.
> ------------------------------------------------------------------------------------
>
>                 Key: HDFS-9958
>                 URL: https://issues.apache.org/jira/browse/HDFS-9958
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.2
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>         Attachments: HDFS-9958-Test-v1.txt, HDFS-9958.001.patch, 
> HDFS-9958.002.patch
>
>
> In a scenario where the corrupt replica is on a failed storage, before it is 
> taken out of blocksMap, there is a race which causes the creation of 
> LocatedBlock on a {{machines}} array element that is not populated. 
> Following is the root cause,
> {code}
> final int numCorruptNodes = countNodes(blk).corruptReplicas();
> {code}
> countNodes only looks at nodes with storage state as NORMAL, which in the 
> case where corrupt replica is on failed storage will amount to 
> numCorruptNodes being zero. 
> {code}
> final int numNodes = blocksMap.numNodes(blk);
> {code}
> However, numNodes will count all nodes/storages irrespective of the state of 
> the storage. Therefore numMachines will include such (failed) nodes. The 
> assert would fail only if the system is enabled to catch Assertion errors, 
> otherwise it goes ahead and tries to create LocatedBlock object for that is 
> not put in the {{machines}} array.
> Here is the stack trace:
> {code}
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:45)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:40)
>       at 
> org.apache.hadoop.hdfs.protocol.LocatedBlock.<init>(LocatedBlock.java:84)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:878)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:826)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:799)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:899)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1849)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9958) BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed storages.

Reply via email to