[ 
https://issues.apache.org/jira/browse/HDFS-8429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552095#comment-14552095
 ] 

zhouyingchao commented on HDFS-8429:
------------------------------------

Colin, thank you for the great comments.  In this case, I think the bottom line 
is that the death of the watcher thread should not block other threads and the 
client side should be indicated to fall through to other ways as quick as 
possible.
I created a patch trying to resolve the blocking. Besides that, I also changed 
the native getAndClearReadableFds method to throw exception as Colin mentioned. 
 Please feel free to post your thoughts and comments. Thank you.

> Death of watcherThread making other local read blocked
> ------------------------------------------------------
>
>                 Key: HDFS-8429
>                 URL: https://issues.apache.org/jira/browse/HDFS-8429
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: zhouyingchao
>            Assignee: zhouyingchao
>
> In our cluster, an application is hung when doing a short circuit read of 
> local hdfs block. By looking into the log, we found the DataNode's 
> DomainSocketWatcher.watcherThread has exited with following log:
> {code}
> ERROR org.apache.hadoop.net.unix.DomainSocketWatcher: 
> Thread[Thread-25,5,main] terminating on unexpected exception
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:463)
>         at java.lang.Thread.run(Thread.java:662)
> {code}
> The line 463 is following code snippet:
> {code}
>          try {
>             for (int fd : fdSet.getAndClearReadableFds()) {
>               sendCallbackAndRemove("getAndClearReadableFds", entries, fdSet,
>                 fd);
>             }
> {code}
> getAndClearReadableFds is a native method which will malloc an int array. 
> Since our memory is very tight, it looks like the malloc failed and a NULL 
> pointer is returned.
> The bad thing is that other threads then blocked in stack like this:
> {code}
> "DataXceiver for client 
> unix:/home/work/app/hdfs/c3prc-micloud/datanode/dn_socket [Waiting for 
> operation #1]" daemon prio=10 tid=0x00007f0c9c086d90 nid=0x8fc3 waiting on 
> condition [0x00007f09b9856000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000007b0174808> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:323)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:322)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:403)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:214)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:95)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
>         at java.lang.Thread.run(Thread.java:662)
> {code}
> IMO, we should exit the DN so that the users can know that something go  
> wrong  and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to