WenJin Ma created HDFS-4515:
-------------------------------
Summary: ReplicaMap thread-safe synchronization lead to a large
number of threads blocking
Key: HDFS-4515
URL: https://issues.apache.org/jira/browse/HDFS-4515
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.0.3-alpha
Environment: CentOS release 6.3
Reporter: WenJin Ma
1.I use programs simulate 3000 users to write 100K small files in HDFS。(3
datanodes)
2.Found that after a period of time the client and datanodes a lot of timeout
errors or errors caused by the socket timeout
example:log1.txt
3.dump datanode java stack log(view file:java.log)
cat java.log |grep BLOCKED|wc -l
2635
A large number of threads blocking in ReplicaMap, because there used
synchronize guarantee thread safety
{code}
"DataXceiver for client DFSClient_NONMAPREDUCE_-1528766469_27 at
/10.28.171.254:59064 [Receiving block
BP-560172827-10.28.171.226-1360119691522:blk_4662142724658079555_330234]"
daemon prio=10 tid=0x00000000428d2800 nid=0x9e32 waiting for monitor entry
[0x00007f20ea56d000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:670)
- waiting to lock <0x00000000f81c05d0> (a
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:89)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:159)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:393)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
at java.lang.Thread.run(Thread.java:662)
{code}
Linux users resources may be exhausted after running for some time.
HDFS this realization in our test environment can only guarantee stable
operation in 800 users.
4. I will be synchronized mechanism is changed to read-write locks, and try to
run in the 4500 user does not appear a large number of thread blocks.
5. dump java stack log(view file java3.log).
cat java3.log |grep BLOCKED|wc -l
0
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira