[ https://issues.apache.org/jira/browse/HDFS-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436987#comment-16436987 ]
Karthik Palanisamy commented on HDFS-13010: ------------------------------------------- [~gopalv] You can optionally configure the Queue size in ipc.server.listen.queue.size (core-site.xml). By default, 128 is allowed length in the connection queue. it fits in the most environment who have a fast network. However, you would need tuning if necessary. Make sure we have enough buffering and tuning at TCP parameters, and confirm there is no delayed ack and packet retransmission. net.core.rmem_max net.core.wmem_max net.ipv4.tcp_rmem net.ipv4.tcp_wmem net.core.netdev_max_backlog net.core.somaxconn net.ipv4.tcp_sack net.ipv4.tcp_max_syn_backlog net.ipv4.tcp_syncookies In addition, Switch logs will have some clue. > DataNode: Listen queue is always 128 > ------------------------------------ > > Key: HDFS-13010 > URL: https://issues.apache.org/jira/browse/HDFS-13010 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.0.0 > Reporter: Gopal V > Assignee: Ajay Kumar > Priority: Major > > DFS write-heavy workloads are failing with > {code} > 18/01/11 05:02:34 INFO mapreduce.Job: Task Id : > attempt_1515660475578_0007_m_000387_0, Status : FAILED > Error: java.io.IOException: Could not get block locations. Source file > "/tmp/tpcds-generate/10000/_temporary/1/_temporary/attempt_1515660475578_0007_m_000387_0/inventory/data-m-00387" > - Aborting...block==null > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1477) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) > {code} > This was tracked to > {code} > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) > at > org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253) > at > org.apache.hadoop.hdfs.DataStreamer$StreamerStreams.<init>(DataStreamer.java:162) > at > org.apache.hadoop.hdfs.DataStreamer.transfer(DataStreamer.java:1450) > at > org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1407) > at > org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) > {code} > {code} > # ss -tl | grep 50010 > LISTEN 0 128 *:50010 *:* > {code} > However, the system is configured with a much higher somaxconn > {code} > # sysctl -a | grep somaxconn > net.core.somaxconn = 16000 > {code} > Yet, the SNMP counters show connections being refused with {{127 times the > listen queue of a socket overflowed}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org