[
https://issues.apache.org/jira/browse/HDFS-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436987#comment-16436987
]
Karthik Palanisamy commented on HDFS-13010:
-------------------------------------------
[~gopalv] You can optionally configure the Queue size in
ipc.server.listen.queue.size (core-site.xml). By default, 128 is allowed
length in the connection queue. it fits in the most environment who have a fast
network. However, you would need tuning if necessary. Make sure we have enough
buffering and tuning at TCP parameters, and confirm there is no delayed ack and
packet retransmission.
net.core.rmem_max
net.core.wmem_max
net.ipv4.tcp_rmem
net.ipv4.tcp_wmem
net.core.netdev_max_backlog
net.core.somaxconn
net.ipv4.tcp_sack
net.ipv4.tcp_max_syn_backlog
net.ipv4.tcp_syncookies
In addition, Switch logs will have some clue.
> DataNode: Listen queue is always 128
> ------------------------------------
>
> Key: HDFS-13010
> URL: https://issues.apache.org/jira/browse/HDFS-13010
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 3.0.0
> Reporter: Gopal V
> Assignee: Ajay Kumar
> Priority: Major
>
> DFS write-heavy workloads are failing with
> {code}
> 18/01/11 05:02:34 INFO mapreduce.Job: Task Id :
> attempt_1515660475578_0007_m_000387_0, Status : FAILED
> Error: java.io.IOException: Could not get block locations. Source file
> "/tmp/tpcds-generate/10000/_temporary/1/_temporary/attempt_1515660475578_0007_m_000387_0/inventory/data-m-00387"
> - Aborting...block==null
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1477)
> at
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> This was tracked to
> {code}
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
> at
> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
> at
> org.apache.hadoop.hdfs.DataStreamer$StreamerStreams.<init>(DataStreamer.java:162)
> at
> org.apache.hadoop.hdfs.DataStreamer.transfer(DataStreamer.java:1450)
> at
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1407)
> at
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> {code}
> # ss -tl | grep 50010
> LISTEN 0 128 *:50010 *:*
> {code}
> However, the system is configured with a much higher somaxconn
> {code}
> # sysctl -a | grep somaxconn
> net.core.somaxconn = 16000
> {code}
> Yet, the SNMP counters show connections being refused with {{127 times the
> listen queue of a socket overflowed}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]