[
https://issues.apache.org/jira/browse/HDFS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDFS-9619:
----------------------------------
Attachment: HDFS-9619.002.patch
Rev02: Added a test case.
In this test case {{TestSimulatedFSDataset.testConcurrentAddBlockPool()}}, it
starts two threads, which add different block pools concurrently, and then
attempt to add a block into the pool. If the block pool is not found, it throws
an IOException.
Without the rev01 patch that uses ConcurrentHashMap, this test case always fail
because it can not find an added block pool; after the patch, I am not seeing
any failures.
> DataNode sometimes can not find blockpool for the correct namenode
> ------------------------------------------------------------------
>
> Key: HDFS-9619
> URL: https://issues.apache.org/jira/browse/HDFS-9619
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, test
> Affects Versions: 3.0.0
> Environment: Jenkins
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Labels: test
> Attachments: HDFS-9619.001.patch, HDFS-9619.002.patch
>
>
> We sometimes see {{TestBalancerWithMultipleNameNodes.testBalancer}} failed to
> replicate a file, because a data node is excluded.
> {noformat}
> File /tmp.txt could only be replicated to 0 nodes instead of minReplication
> (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this
> operation.
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
> {noformat}
> Relevent logs suggest root cause is due to block pool not found.
> {noformat}
> 2016-01-03 22:11:43,174 [DataXceiver for client
> DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block
> BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR
> datanode.DataNode (DataXceiver.java:run(280)) -
> host0.foo.com:49997:DataXceiver error processing WRITE_BLOCK operation src:
> /127.0.0.1:47318 dst: /127.0.0.1:49997
> java.io.IOException: Non existent blockpool
> BP-1927700312-172.26.2.1-1451887902222
> at
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583)
> at
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955)
> at
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:203)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> For a bit more context, this test starts a cluster with two name nodes and
> one data node. The block pools are added, but one of them is not found after
> added. The root cause is due to an undetected concurrent access in a hash map
> in SimulatedFSDataset (two block pools are added simultaneously). I added
> some logs to print blockMap, and saw a few ConcurrentModificationExceptions.
> The solution would be to use a thread safe class instead, like
> ConcurrentHashMap.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)