[
https://issues.apache.org/jira/browse/HDFS-13828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581356#comment-16581356
]
Amithsha edited comment on HDFS-13828 at 8/15/18 5:06 PM:
----------------------------------------------------------
Agree on xceiver count may be not sufficient but why for a particular node. And
also its not happening on one node it's on a particular set of nodes.
Adding the thread dump and datanode log.
"DataXceiver for client
DFSClient_attempt_1526704594842_1801529_m_008193_0_1144052212_1 at
/x.x.x.x:38313 [Waiting for operation #28|#28]" #55366018 daemon prio=5
os_prio=0 tid=0x00007fcdaa0ca000 nid=0x128c runnable [0x00007fcd24485000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x0000000794fde658> (a sun.nio.ch.Util$2)
- locked <0x0000000794fde640> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000007b072db98> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x00000005e9dc1de0> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readShort(DataInputStream.java:312)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)
at java.lang.Thread.run(Thread.java:745)
"DataXceiver for client
DFSClient_attempt_1526704594842_1801529_m_003865_0_-1268040697_1 at
/x.x.x.x:9258 [Sending block
BP-1733841164-x.x.x.x-1440204182440:blk_8704233925_7644500095]" #55361352
daemon prio=5 os_prio=0 tid=0x00007fcdaa360000 nid=0xc93d runnable
[0x00007fcca559e000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000007a5ae6e30> (a sun.nio.ch.Util$2)
- locked <0x00000007a5ae6e18> (a java.util.Collections$UnmodifiableSet)
- locked <0x000000079d242470> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
was (Author: amithsha):
Agree on xceiver count may be not sufficient but why for a particular node. And
also its no on one node its on particular set of nodes.
Adding the thread dump and datanode log.
"DataXceiver for client
DFSClient_attempt_1526704594842_1801529_m_008193_0_1144052212_1 at
/x.x.x.x:38313 [Waiting for operation #28]" #55366018 daemon prio=5 os_prio=0
tid=0x00007fcdaa0ca000 nid=0x128c runnable [0x00007fcd24485000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x0000000794fde658> (a sun.nio.ch.Util$2)
- locked <0x0000000794fde640> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000007b072db98> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x00000005e9dc1de0> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readShort(DataInputStream.java:312)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)
at java.lang.Thread.run(Thread.java:745)
"DataXceiver for client
DFSClient_attempt_1526704594842_1801529_m_003865_0_-1268040697_1 at
/x.x.x.x:9258 [Sending block
BP-1733841164-x.x.x.x-1440204182440:blk_8704233925_7644500095]" #55361352
daemon prio=5 os_prio=0 tid=0x00007fcdaa360000 nid=0xc93d runnable
[0x00007fcca559e000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000007a5ae6e30> (a sun.nio.ch.Util$2)
- locked <0x00000007a5ae6e18> (a java.util.Collections$UnmodifiableSet)
- locked <0x000000079d242470> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
> DataNode breaching Xceiver Count
> --------------------------------
>
> Key: HDFS-13828
> URL: https://issues.apache.org/jira/browse/HDFS-13828
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.7.1
> Reporter: Amithsha
> Priority: Critical
>
> We were observing the breach of the xceiver count 4096, On a particular set
> of nodes from 5 - 8 nodes in a 900 nodes cluster.
> And we stopped the datanode services on those nodes and made to replicate
> across the cluster. After that also, we observed the same issue on a new set
> of nodes.
> Q1: Why on a particular node, and also after decommissioning the node the
> data should be replicated across the cluster, But why again difference set of
> node?
> Assumptions :
> Reading a particular block/ data on that node might be the cause for this but
> it should be mitigated after the decommission but not why? So suspected that
> those MR jobs are triggered from Hive, so the query might be referring to the
> same block mulitple times in different stages and creating this issue?
> From Thread Dump :
> Thread dump of datanode says that out of 4090+ xceiver threads created on
> that node nearly 4000+ where belong to the same AppId of multiple mappers
> with state no operation.
>
> Any suggestions on this?
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]