[
https://issues.apache.org/jira/browse/HDFS-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371926#comment-14371926
]
Frode Halvorsen commented on HDFS-7212:
---------------------------------------
No. This is not the same. When I earlier experienced that the number of
connections exceeded the maximum, I increased the maximum.
My issue is the same as in this bug-entry.
My datanodes runs fine with 70-80 threads, the suddenly one node with a lot of
blocks just stops writing the recieved blocks , and the thread keeps hanging on
the reciever. Then the threads just accumulate until I have at least 600 bloked
threads.
I get one more line like this for each thread :
2015-03-20 20:15:56,102 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving BP-874555352-10.34.17.40-1403595404176:blk_1120355987_46618200 src:
/62.148.41.204:38037 dest: /62.148.41.209:50010
But it never reports the block recieved as it normally does.
Then I can go in via MissionControl, and look at the thread.graph, ans find it
rising massivly. after 650 secs, the name-node states that the datanode is
dead, but it actually is not. I can stop/start metrics (via mbeans), and
sometimes the datanode just flushes (kills) all blocked threads, and reconnects
to the namenode. many times, however, I have to restart the datanode. It uses a
good half hour on the step where it adds the blocks to the pool, and when it
reconnets to the namenode, they first of all cleans up the over-replicated
blocks. The namenode, of course, stosp all other processing when the datanode
'arrives', so any process adding files to the cluster is put 'on hold' by the
namenode.
Very often during the cleanup with one datanode, another starts the same
process with just starting the recieve-thread, and piles up a few hundred of
them i blocked state.
My stacktrace (on the blocked thred) is like this:
DataXceiver for client at /62.148.41.204:38919 [Receiving block
BP-874555352-10.34.17.40-1403595404176:blk_1128803518_55065733] [51396]
(BLOCKED)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary
line: 1226
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary
line: 114
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init> line: 179
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock line: 615
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock line: 137
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp line: 74
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run line: 235
java.lang.Thread.run line: 745
And just now, my datanode has appx 700 of those threads.
> Huge number of BLOCKED threads rendering DataNodes useless
> ----------------------------------------------------------
>
> Key: HDFS-7212
> URL: https://issues.apache.org/jira/browse/HDFS-7212
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.4.0
> Environment: PROD
> Reporter: Istvan Szukacs
>
> There are 3000 - 8000 threads in each datanode JVM, blocking the entire VM
> and rendering the service unusable, missing heartbeats and stopping data
> access. The threads look like this:
> {code}
> 3415 (state = BLOCKED)
> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may
> be imprecise)
> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
> line=186 (Compiled frame)
> -
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()
> @bci=1, line=834 (Interpreted frame)
> -
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node,
> int) @bci=67, line=867 (Interpreted frame)
> - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) @bci=17,
> line=1197 (Interpreted frame)
> - java.util.concurrent.locks.ReentrantLock$NonfairSync.lock() @bci=21,
> line=214 (Compiled frame)
> - java.util.concurrent.locks.ReentrantLock.lock() @bci=4, line=290 (Compiled
> frame)
> -
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(org.apache.hadoop.net.unix.DomainSocket,
> org.apache.hadoop.net.unix.DomainSocketWatcher$Handler) @bci=4, line=286
> (Interpreted frame)
> -
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(java.lang.String,
> org.apache.hadoop.net.unix.DomainSocket) @bci=169, line=283 (Interpreted
> frame)
> -
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(java.lang.String)
> @bci=212, line=413 (Interpreted frame)
> -
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(java.io.DataInputStream)
> @bci=13, line=172 (Interpreted frame)
> -
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(org.apache.hadoop.hdfs.protocol.datatransfer.Op)
> @bci=149, line=92 (Compiled frame)
> - org.apache.hadoop.hdfs.server.datanode.DataXceiver.run() @bci=510, line=232
> (Compiled frame)
> - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
> {code}
> Has anybody seen this before?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)