Wei-Chiu Chuang created HDFS-9361:
-------------------------------------
Summary: Default block placement policy causes
TestReplaceDataNodeOnFailure to fail intermittently
Key: HDFS-9361
URL: https://issues.apache.org/jira/browse/HDFS-9361
Project: Hadoop HDFS
Issue Type: Improvement
Components: HDFS
Reporter: Wei-Chiu Chuang
TestReplaceDatanodeOnFailure sometimes fail (See HDFS-6101).
(For background information, the test case set up a cluster with three data
nodes, add two more data nodes, remove one data nodes, and verify that clients
can correctly recover from the failure and set up three replicas)
I traced down and found that some times a client only set up a pipeline with
only two data nodes, which is one less than configured in the test case, even
though the test case configures to always replace failed nodes.
Digging into the log, I saw:
{noformat}
2015-11-02 12:07:38,634 [IPC Server handler 8 on 50673] WARN
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(355)) - Failed to place enough
replicas, still in nee
d of 1 to reach 3 (unavailableStorages=[],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
[
Node /rack0/127.0.0.1:32931 [
Datanode 127.0.0.1:32931 is not chosen since the rack has too many chosen
nodes .
]
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:723)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:624)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:429)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:342)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:220)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:105)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:120)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1727)
at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2457)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:796)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
{noformat}
So from the log, I wonder whether this is appropriate or not. If the load
factor exceeds certain threshold, but the file is insufficient of replicas,
should it accept it as is, or should it attempt to acquire more replicas?
I am filing this JIRA for discussion.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)