[jira] [Updated] (HDFS-9361) Default block placement policy causes TestReplaceDataNodeOnFailure to fail intermittently

Wei-Chiu Chuang (JIRA) Mon, 02 Nov 2015 15:06:57 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wei-Chiu Chuang updated HDFS-9361:
----------------------------------
    Description: 
TestReplaceDatanodeOnFailure sometimes fail (See HDFS-6101).
(For background information, the test case set up a cluster with three data 
nodes, add two more data nodes, remove one data nodes, and verify that clients 
can correctly recover from the failure and set up three replicas)

I traced down and found that some times a client only set up a pipeline with 
only two data nodes, which is one less than configured in the test case, even 
though the test case configures to always replace failed nodes.

Digging into the log, I saw:
{noformat}
2015-11-02 12:07:38,634 [IPC Server handler 8 on 50673] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(355)) - Failed to place enough 
replicas, still in nee
d of 1 to reach 3 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 [
Node /rack0/127.0.0.1:32931 [
  Datanode 127.0.0.1:32931 is not chosen since the rack has too many chosen 
nodes .
]
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:723)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:624)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:429)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:342)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:220)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:105)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:120)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1727)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2457)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:796)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
{noformat}

So from the log, it seems the policy causes the pipeline selection to give up 
on the data node.
I wonder whether this is appropriate or not. If the load factor exceeds certain 
threshold, but the file is insufficient of replicas, should it accept it as is, 
or should it attempt to acquire more replicas? 

I am filing this JIRA for discussion. I am very unfamiliar with block 
placement, so I may be wrong about my hypothesis.

  was:
TestReplaceDatanodeOnFailure sometimes fail (See HDFS-6101).
(For background information, the test case set up a cluster with three data 
nodes, add two more data nodes, remove one data nodes, and verify that clients 
can correctly recover from the failure and set up three replicas)

I traced down and found that some times a client only set up a pipeline with 
only two data nodes, which is one less than configured in the test case, even 
though the test case configures to always replace failed nodes.

Digging into the log, I saw:
{noformat}
2015-11-02 12:07:38,634 [IPC Server handler 8 on 50673] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(355)) - Failed to place enough 
replicas, still in nee
d of 1 to reach 3 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 [
Node /rack0/127.0.0.1:32931 [
  Datanode 127.0.0.1:32931 is not chosen since the rack has too many chosen 
nodes .
]
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:723)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:624)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:429)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:342)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:220)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:105)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:120)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1727)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2457)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:796)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
{noformat}

So from the log, I wonder whether this is appropriate or not. If the load 
factor exceeds certain threshold, but the file is insufficient of replicas, 
should it accept it as is, or should it attempt to acquire more replicas?

I am filing this JIRA for discussion.


> Default block placement policy causes TestReplaceDataNodeOnFailure to fail 
> intermittently
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-9361
>                 URL: https://issues.apache.org/jira/browse/HDFS-9361
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: HDFS
>            Reporter: Wei-Chiu Chuang
>
> TestReplaceDatanodeOnFailure sometimes fail (See HDFS-6101).
> (For background information, the test case set up a cluster with three data 
> nodes, add two more data nodes, remove one data nodes, and verify that 
> clients can correctly recover from the failure and set up three replicas)
> I traced down and found that some times a client only set up a pipeline with 
> only two data nodes, which is one less than configured in the test case, even 
> though the test case configures to always replace failed nodes.
> Digging into the log, I saw:
> {noformat}
> 2015-11-02 12:07:38,634 [IPC Server handler 8 on 50673] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(355)) - Failed to place enough 
> replicas, still in nee
> d of 1 to reach 3 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
>  [
> Node /rack0/127.0.0.1:32931 [
>   Datanode 127.0.0.1:32931 is not chosen since the rack has too many chosen 
> nodes .
> ]
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:723)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:624)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:429)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:342)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:220)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:105)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:120)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1727)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2457)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:796)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
> {noformat}
> So from the log, it seems the policy causes the pipeline selection to give up 
> on the data node.
> I wonder whether this is appropriate or not. If the load factor exceeds 
> certain threshold, but the file is insufficient of replicas, should it accept 
> it as is, or should it attempt to acquire more replicas? 
> I am filing this JIRA for discussion. I am very unfamiliar with block 
> placement, so I may be wrong about my hypothesis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-9361) Default block placement policy causes TestReplaceDataNodeOnFailure to fail intermittently

Reply via email to