[
https://issues.apache.org/jira/browse/HDFS-14533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853340#comment-16853340
]
Daryn Sharp commented on HDFS-14533:
------------------------------------
On further inspection, the jammed threads were actually clients operating
within the context of the DN's webhdfs handler.
{noformat}
"nioEventLoopGroup-3-144" #136101 prio=10 os_prio=0 tid=0x00007fb50c7d3800
nid=0x2257 waiting on condition [0x00007fb4e59fa000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000e84c3248> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.hadoop.util.Waitable.await(Waitable.java:36)
at
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetch(ShortCircuitCache.java:722)
at
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:689)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:486)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:367)
at
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:714)
at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:655)
- locked <0x00000000e84bc300> (a
org.apache.hadoop.hdfs.DFSInputStream){noformat}
Types of errors in log:
{noformat}
[DataXceiver for client unix:/.../dn_socket [Waiting for operation #1]] WARN
datanode.DataNode: Failed to shut down socket in error handler
java.nio.channels.ClosedChannelException
at
org.apache.hadoop.util.CloseableReferenceCount.reference(CloseableReferenceCount.java:57)
at
org.apache.hadoop.net.unix.DomainSocket.shutdown(DomainSocket.java:393)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:536)
[DataXceiver for client unix:/.../dn_socket [Waiting for operation #1]] ERROR
datanode.DataNode: host:1004:DataXceiver error processing
REQUEST_SHORT_CIRCUIT_SHM operation src: unix:/.../dn_socket dst: <local>
java.nio.channels.ClosedChannelException
at
org.apache.hadoop.util.CloseableReferenceCount.reference(CloseableReferenceCount.java:57)
at
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:568)
at
com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
at
com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
at
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.sendShmSuccessResponse(DataXceiver.java:470)
[DataXceiver for client unix:/.../dn_socket [Waiting for operation #1]] ERROR
datanode.DataNode: host:1004:DataXceiver error processing
REQUEST_SHORT_CIRCUIT_SHM operation src: unix:/.../dn_socket dst: <local>
java.net.SocketException: write(2) error: Broken pipe
at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
at
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
at
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:571)
at
com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
at
com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
at
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.sendShmSuccessResponse(DataXceiver.java:470)
{noformat}
> Datanode short circuit cache can become blocked
> -----------------------------------------------
>
> Key: HDFS-14533
> URL: https://issues.apache.org/jira/browse/HDFS-14533
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: Daryn Sharp
> Priority: Major
>
> Errors in the short circuit cache can leave clients indefinitely blocked in
> {{ShortCircuitCache#fetch}} on a waitable's condition that will never be
> signaled. The condition wait should be bounded with a timeout.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]