[
https://issues.apache.org/jira/browse/HDFS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinayakumar B updated HDFS-9619:
--------------------------------
Summary: SimulatedFSDataset sometimes can not find blockpool for the
correct namenode (was: DataNode sometimes can not find blockpool for the
correct namenode)
> SimulatedFSDataset sometimes can not find blockpool for the correct namenode
> ----------------------------------------------------------------------------
>
> Key: HDFS-9619
> URL: https://issues.apache.org/jira/browse/HDFS-9619
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, test
> Affects Versions: 3.0.0
> Environment: Jenkins
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Labels: test
> Attachments: HDFS-9619.001.patch, HDFS-9619.002.patch
>
>
> We sometimes see {{TestBalancerWithMultipleNameNodes.testBalancer}} failed to
> replicate a file, because a data node is excluded.
> {noformat}
> File /tmp.txt could only be replicated to 0 nodes instead of minReplication
> (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this
> operation.
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
> {noformat}
> Relevent logs suggest root cause is due to block pool not found.
> {noformat}
> 2016-01-03 22:11:43,174 [DataXceiver for client
> DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block
> BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR
> datanode.DataNode (DataXceiver.java:run(280)) -
> host0.foo.com:49997:DataXceiver error processing WRITE_BLOCK operation src:
> /127.0.0.1:47318 dst: /127.0.0.1:49997
> java.io.IOException: Non existent blockpool
> BP-1927700312-172.26.2.1-1451887902222
> at
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583)
> at
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955)
> at
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:203)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> For a bit more context, this test starts a cluster with two name nodes and
> one data node. The block pools are added, but one of them is not found after
> added. The root cause is due to an undetected concurrent access in a hash map
> in SimulatedFSDataset (two block pools are added simultaneously). I added
> some logs to print blockMap, and saw a few ConcurrentModificationExceptions.
> The solution would be to use a thread safe class instead, like
> ConcurrentHashMap.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)