[ 
https://issues.apache.org/jira/browse/HDFS-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436987#comment-16436987
 ] 

Karthik Palanisamy commented on HDFS-13010:
-------------------------------------------

[~gopalv] You can optionally configure the Queue size in 
ipc.server.listen.queue.size (core-site.xml).  By default, 128 is allowed 
length in the connection queue. it fits in the most environment who have a fast 
network. However, you would need tuning if necessary. Make sure we have enough 
buffering and tuning at TCP parameters, and confirm there is no delayed ack and 
packet retransmission.

net.core.rmem_max
net.core.wmem_max 
net.ipv4.tcp_rmem
net.ipv4.tcp_wmem
net.core.netdev_max_backlog
net.core.somaxconn
net.ipv4.tcp_sack
net.ipv4.tcp_max_syn_backlog
net.ipv4.tcp_syncookies

In addition,  Switch logs will have some clue.

> DataNode: Listen queue is always 128
> ------------------------------------
>
>                 Key: HDFS-13010
>                 URL: https://issues.apache.org/jira/browse/HDFS-13010
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0
>            Reporter: Gopal V
>            Assignee: Ajay Kumar
>            Priority: Major
>
> DFS write-heavy workloads are failing with 
> {code}
> 18/01/11 05:02:34 INFO mapreduce.Job: Task Id : 
> attempt_1515660475578_0007_m_000387_0, Status : FAILED
> Error: java.io.IOException: Could not get block locations. Source file 
> "/tmp/tpcds-generate/10000/_temporary/1/_temporary/attempt_1515660475578_0007_m_000387_0/inventory/data-m-00387"
>  - Aborting...block==null
>         at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1477)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
>         at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> This was tracked to 
> {code}
> Caused by: java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
>         at 
> org.apache.hadoop.hdfs.DataStreamer$StreamerStreams.<init>(DataStreamer.java:162)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.transfer(DataStreamer.java:1450)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1407)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
>         at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
>         at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> {code}
> # ss -tl | grep 50010
> LISTEN     0      128        *:50010                    *:*   
> {code}
> However, the system is configured with a much higher somaxconn
> {code}
> # sysctl -a | grep somaxconn
> net.core.somaxconn = 16000
> {code}
> Yet, the SNMP counters show connections being refused with {{127 times the 
> listen queue of a socket overflowed}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to