WenJin Ma created HDFS-4515: ------------------------------- Summary: ReplicaMap thread-safe synchronization lead to a large number of threads blocking Key: HDFS-4515 URL: https://issues.apache.org/jira/browse/HDFS-4515 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.0.3-alpha Environment: CentOS release 6.3 Reporter: WenJin Ma
1.I use programs simulate 3000 users to write 100K small files in HDFS。(3 datanodes) 2.Found that after a period of time the client and datanodes a lot of timeout errors or errors caused by the socket timeout example:log1.txt 3.dump datanode java stack log(view file:java.log) cat java.log |grep BLOCKED|wc -l 2635 A large number of threads blocking in ReplicaMap, because there used synchronize guarantee thread safety {code} "DataXceiver for client DFSClient_NONMAPREDUCE_-1528766469_27 at /10.28.171.254:59064 [Receiving block BP-560172827-10.28.171.226-1360119691522:blk_4662142724658079555_330234]" daemon prio=10 tid=0x00000000428d2800 nid=0x9e32 waiting for monitor entry [0x00007f20ea56d000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:670) - waiting to lock <0x00000000f81c05d0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:89) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:159) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:393) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219) at java.lang.Thread.run(Thread.java:662) {code} Linux users resources may be exhausted after running for some time. HDFS this realization in our test environment can only guarantee stable operation in 800 users. 4. I will be synchronized mechanism is changed to read-write locks, and try to run in the 4500 user does not appear a large number of thread blocks. 5. dump java stack log(view file java3.log). cat java3.log |grep BLOCKED|wc -l 0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira