[ 
https://issues.apache.org/jira/browse/HDFS-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371926#comment-14371926
 ] 

Frode Halvorsen commented on HDFS-7212:
---------------------------------------

No. This is not the same. When I earlier experienced that the number of 
connections exceeded the maximum, I increased the maximum.
My issue is the same as in this bug-entry.

My datanodes runs fine with 70-80 threads, the suddenly one node with a lot of 
blocks just stops writing the recieved blocks , and the thread keeps hanging on 
the reciever. Then the threads just accumulate until I have at least 600 bloked 
threads. 
I get one more line like this for each thread :
2015-03-20 20:15:56,102 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-874555352-10.34.17.40-1403595404176:blk_1120355987_46618200 src: 
/62.148.41.204:38037 dest: /62.148.41.209:50010
But it never reports the block recieved as it normally does. 
Then I can go in via MissionControl, and look at the thread.graph, ans find it 
rising massivly. after 650 secs, the name-node states that the datanode is 
dead, but it actually is not. I can stop/start metrics (via mbeans), and 
sometimes the datanode just flushes (kills) all blocked threads, and reconnects 
to the namenode. many times, however, I have to restart the datanode. It uses a 
good half hour on the step where it adds the blocks to the pool, and when it 
reconnets to the namenode, they first of all cleans up the over-replicated 
blocks. The namenode, of course, stosp all other processing when the datanode 
'arrives', so any process adding files to the cluster is put 'on hold' by the 
namenode.
Very often during the cleanup with one datanode, another starts the same 
process with just starting the recieve-thread, and piles up a few hundred of 
them i blocked state.

My stacktrace (on the blocked thred) is like this:
DataXceiver for client  at /62.148.41.204:38919 [Receiving block 
BP-874555352-10.34.17.40-1403595404176:blk_1128803518_55065733] [51396] 
(BLOCKED)
   
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary
 line: 1226 
   
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary
 line: 114 
   org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init> line: 179 
   org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock line: 615 
   org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock line: 137 
   org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp line: 74 
   org.apache.hadoop.hdfs.server.datanode.DataXceiver.run line: 235 
   java.lang.Thread.run line: 745 

And just now, my datanode has appx 700 of those threads. 

> Huge number of BLOCKED threads rendering DataNodes useless
> ----------------------------------------------------------
>
>                 Key: HDFS-7212
>                 URL: https://issues.apache.org/jira/browse/HDFS-7212
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.4.0
>         Environment: PROD
>            Reporter: Istvan Szukacs
>
> There are 3000 - 8000 threads in each datanode JVM, blocking the entire VM 
> and rendering the service unusable, missing heartbeats and stopping data 
> access. The threads look like this:
> {code}
> 3415 (state = BLOCKED)
> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may 
> be imprecise)
> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=186 (Compiled frame)
> - 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() 
> @bci=1, line=834 (Interpreted frame)
> - 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node,
>  int) @bci=67, line=867 (Interpreted frame)
> - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) @bci=17, 
> line=1197 (Interpreted frame)
> - java.util.concurrent.locks.ReentrantLock$NonfairSync.lock() @bci=21, 
> line=214 (Compiled frame)
> - java.util.concurrent.locks.ReentrantLock.lock() @bci=4, line=290 (Compiled 
> frame)
> - 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(org.apache.hadoop.net.unix.DomainSocket,
>  org.apache.hadoop.net.unix.DomainSocketWatcher$Handler) @bci=4, line=286 
> (Interpreted frame)
> - 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(java.lang.String,
>  org.apache.hadoop.net.unix.DomainSocket) @bci=169, line=283 (Interpreted 
> frame)
> - 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(java.lang.String)
>  @bci=212, line=413 (Interpreted frame)
> - 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(java.io.DataInputStream)
>  @bci=13, line=172 (Interpreted frame)
> - 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(org.apache.hadoop.hdfs.protocol.datatransfer.Op)
>  @bci=149, line=92 (Compiled frame)
> - org.apache.hadoop.hdfs.server.datanode.DataXceiver.run() @bci=510, line=232 
> (Compiled frame)
> - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
> {code}
> Has anybody seen this before?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to