[
https://issues.apache.org/jira/browse/HADOOP-11333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin Patrick McCabe updated HADOOP-11333:
------------------------------------------
Resolution: Fixed
Fix Version/s: 2.7.0
Target Version/s: 2.7.0
Status: Resolved (was: Patch Available)
Committed to 2.7.
Note that this bug is likely to show up only on OSes where the pipe buffer size
is small. So it probably won't show up on Linux.
> Fix deadlock in DomainSocketWatcher when the notification pipe is full
> ----------------------------------------------------------------------
>
> Key: HADOOP-11333
> URL: https://issues.apache.org/jira/browse/HADOOP-11333
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: zhaoyunjiong
> Assignee: zhaoyunjiong
> Fix For: 2.7.0
>
> Attachments: 11241021, 11241023, 11241025, HADOOP-11333-1.patch,
> HADOOP-11333.patch
>
>
> I found some of our DataNodes will run "exceeds the limit of concurrent
> xciever", the limit is 4K.
> After check the stack, I suspect that
> org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by
> DomainSocketWatcher.kick stuck:
> {quote}
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation
> #1]" daemon prio=10 tid=0x00007f55c5576000 nid=0x385d waiting on condition
> [0x00007f558d5d4000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000740df9c90> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
> at
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
> at
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> --
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation
> #1]" daemon prio=10 tid=0x00007f7de034c800 nid=0x7b7 runnable
> [0x00007f7db06c5000]
> java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
> at
> org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
> at
> org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:745)
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation
> #1]" daemon prio=10 tid=0x00007f55c5574000 nid=0x377a waiting on condition
> [0x00007f558d7d6000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000740df9cb0> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:745)
>
> "Thread-163852" daemon prio=10 tid=0x00007f55c811c800 nid=0x6757 runnable
> [0x00007f55aef6e000]
> java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native
> Method)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)