[jira] [Updated] (HDFS-9361) Default block placement policy causes TestReplaceDataNodeOnFailure to fail intermittently

Tsz Wo Nicholas Sze (JIRA) Tue, 24 Nov 2015 15:16:45 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tsz Wo Nicholas Sze updated HDFS-9361:
--------------------------------------
    Component/s:     (was: HDFS)
                 namenode

> Default block placement policy causes TestReplaceDataNodeOnFailure to fail 
> intermittently
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-9361
>                 URL: https://issues.apache.org/jira/browse/HDFS-9361
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Wei-Chiu Chuang
>
> TestReplaceDatanodeOnFailure sometimes fail (See HDFS-6101).
> (For background information, the test case set up a cluster with three data 
> nodes, add two more data nodes, remove one data nodes, and verify that 
> clients can correctly recover from the failure and set up three replicas)
> I traced down and found that some times a client only set up a pipeline with 
> only two data nodes, which is one less than configured in the test case, even 
> though the test case configures to always replace failed nodes.
> Digging into the log, I saw:
> {noformat}
> 2015-11-02 12:07:38,634 [IPC Server handler 8 on 50673] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(355)) - Failed to place enough 
> replicas, still in nee
> d of 1 to reach 3 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
>  [
> Node /rack0/127.0.0.1:32931 [
>   Datanode 127.0.0.1:32931 is not chosen since the rack has too many chosen 
> nodes .
> ]
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:723)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:624)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:429)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:342)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:220)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:105)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:120)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1727)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2457)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:796)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
> {noformat}
> So from the log, it seems the policy causes the pipeline selection to give up 
> on the data node.
> I wonder whether this is appropriate or not. If the load factor exceeds 
> certain threshold, but the file is insufficient of replicas, should it accept 
> it as is, or should it attempt to acquire more replicas? 
> I am filing this JIRA for discussion. I am very unfamiliar with block 
> placement, so I may be wrong about my hypothesis.
> (Edit: I turned on DEBUG option for Log4j, and changed the logging message a 
> bit to make it show the stack trace)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-9361) Default block placement policy causes TestReplaceDataNodeOnFailure to fail intermittently

Reply via email to