[
https://issues.apache.org/jira/browse/HADOOP-11333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226874#comment-14226874
]
Colin Patrick McCabe commented on HADOOP-11333:
-----------------------------------------------
Thanks, [~zhaoyunjiong]... I understand the problem now.
Yes, this code is in hadoop-common so we should have the JIRA there. Thanks
for moving it.
The patch looks good to me. If I understand correctly, the write should not
block when the pipe is at less than its pipe capacity. This patch only relies
on a pipe capacity of 1 byte, which is well below the minimum POSIX specifies.
Just two comments:
* can you move the {{kicked = false}} to {{NotificationHandler#handle}}? This
is a non-static inner class, so it should have access to this variable. I
think it's more appropriate to put this there, since that is the function which
is handling the kick.
* let's add a JavaDoc comment to the declaration of {{boolean kicked}}. Maybe
something like:
bq. True if we have written a byte to the notification socket. We should not
write anything else to the socket until the notification handler has had a
chance to run. Otherwise, our thread might block, causing deadlock. See
HADOOP-11333 for details.
> DomainSocketWatcher.kick stuck
> ------------------------------
>
> Key: HADOOP-11333
> URL: https://issues.apache.org/jira/browse/HADOOP-11333
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: zhaoyunjiong
> Assignee: zhaoyunjiong
> Attachments: 11241021, 11241023, 11241025, HADOOP-11333.patch
>
>
> I found some of our DataNodes will run "exceeds the limit of concurrent
> xciever", the limit is 4K.
> After check the stack, I suspect that
> org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by
> DomainSocketWatcher.kick stuck:
> {quote}
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation
> #1]" daemon prio=10 tid=0x00007f55c5576000 nid=0x385d waiting on condition
> [0x00007f558d5d4000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000740df9c90> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
> at
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
> at
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> --
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation
> #1]" daemon prio=10 tid=0x00007f7de034c800 nid=0x7b7 runnable
> [0x00007f7db06c5000]
> java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
> at
> org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
> at
> org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:745)
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation
> #1]" daemon prio=10 tid=0x00007f55c5574000 nid=0x377a waiting on condition
> [0x00007f558d7d6000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000740df9cb0> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:745)
>
> "Thread-163852" daemon prio=10 tid=0x00007f55c811c800 nid=0x6757 runnable
> [0x00007f55aef6e000]
> java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native
> Method)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)