[ 
https://issues.apache.org/jira/browse/HDFS-14533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853340#comment-16853340
 ] 

Daryn Sharp commented on HDFS-14533:
------------------------------------

On further inspection, the jammed threads were actually clients operating 
within the context of the DN's webhdfs handler.
{noformat}
"nioEventLoopGroup-3-144" #136101 prio=10 os_prio=0 tid=0x00007fb50c7d3800 
nid=0x2257 waiting on condition [0x00007fb4e59fa000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e84c3248> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at org.apache.hadoop.util.Waitable.await(Waitable.java:36)
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetch(ShortCircuitCache.java:722)
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:689)
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:486)
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:367)
        at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:714)
        at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:655)
        - locked <0x00000000e84bc300> (a 
org.apache.hadoop.hdfs.DFSInputStream){noformat}

Types of errors in log:
{noformat}
[DataXceiver for client unix:/.../dn_socket [Waiting for operation #1]] WARN 
datanode.DataNode: Failed to shut down socket in error handler
java.nio.channels.ClosedChannelException
        at 
org.apache.hadoop.util.CloseableReferenceCount.reference(CloseableReferenceCount.java:57)
        at 
org.apache.hadoop.net.unix.DomainSocket.shutdown(DomainSocket.java:393)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:536)

[DataXceiver for client unix:/.../dn_socket [Waiting for operation #1]] ERROR 
datanode.DataNode: host:1004:DataXceiver error processing 
REQUEST_SHORT_CIRCUIT_SHM operation  src: unix:/.../dn_socket dst: <local>
java.nio.channels.ClosedChannelException
        at 
org.apache.hadoop.util.CloseableReferenceCount.reference(CloseableReferenceCount.java:57)
        at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:568)
        at 
com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
        at 
com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
        at 
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.sendShmSuccessResponse(DataXceiver.java:470)

[DataXceiver for client unix:/.../dn_socket [Waiting for operation #1]] ERROR 
datanode.DataNode: host:1004:DataXceiver error processing 
REQUEST_SHORT_CIRCUIT_SHM operation  src: unix:/.../dn_socket dst: <local>
java.net.SocketException: write(2) error: Broken pipe
        at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
        at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
        at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:571)
        at 
com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
        at 
com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
        at 
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.sendShmSuccessResponse(DataXceiver.java:470)
{noformat}



> Datanode short circuit cache can become blocked
> -----------------------------------------------
>
>                 Key: HDFS-14533
>                 URL: https://issues.apache.org/jira/browse/HDFS-14533
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Daryn Sharp
>            Priority: Major
>
> Errors in the short circuit cache can leave clients indefinitely blocked in 
> {{ShortCircuitCache#fetch}} on a waitable's condition that will never be 
> signaled.  The condition wait should be bounded with a timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to