Wei-Chiu Chuang created HDFS-9619:
-------------------------------------
Summary: DataNode sometimes can not find blockpool for the correct
namenode
Key: HDFS-9619
URL: https://issues.apache.org/jira/browse/HDFS-9619
Project: Hadoop HDFS
Issue Type: Bug
Affects Versions: 3.0.0
Environment: Jenkins
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
We sometimes see TestBalancerWithMultipleNameNodes.testBalancer failed to
replicate a file, because a data node is excluded.
{noformat}
File /tmp.txt could only be replicated to 0 nodes instead of minReplication
(=1). There are 1 datanode(s) running and 1 node(s) are excluded in this
operation.
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745)
at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
{noformat}
Relevent logs suggest root cause is due to block pool not found.
{noformat}
2016-01-03 22:11:43,174 [DataXceiver for client
DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block
BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR
datanode.DataNode (DataXceiver.java:run(280)) - host0.foo.com:49997:DataXceiver
error processing WRITE_BLOCK operation src: /127.0.0.1:47318 dst:
/127.0.0.1:49997
java.io.IOException: Non existent blockpool
BP-1927700312-172.26.2.1-1451887902222
at
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583)
at
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955)
at
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:203)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
at java.lang.Thread.run(Thread.java:745)
{noformat}
For a bit more context, this test starts a cluster with two name nodes and one
data node. The block pools are added, but one of them is not found after added.
The root cause is due to an undetected concurrent access in a hash map in
SimulatedFSDataset. The solution would be to use a thread safe class instead,
like ConcurrentHashMap.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)