[
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
JiangHua Zhu updated HDFS-16565:
--------------------------------
Description:
There is a strange phenomenon here, DataNode holds a large number of
connections in CLOSE_WAIT state and does not release.
netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
LISTEN 20
CLOSE_WAIT 17707
ESTABLISHED 1450
TIME_WAIT 12
It can be found that the connections with the CLOSE_WAIT state have reached 17k
and are still growing. View these CLOSE_WAITs through the lsof command, and get
the following phenomenon:
lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
!screenshot-1.png!
It can be seen that the reason for this phenomenon is that Socket#close() is
not called correctly, and DataNode interacts with other nodes as Client.
was:
When DataTransfer runs, the local node needs to connect to another DataNode,
which is through socket. Once the connection fails, a NoRouteToHostException
will be generated.
Exception information:
2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(xxxx.xxxx.xxxx.xxxx:1004,
datanodeUuid=xxxx.xxxx.xxxx.xxxx, infoPort=1006 , infoSecurePort=0,
ipcPort=8025,
storageInfo=lv=-57;cid=xxxx.xxxx.xxxx.xxxx;nsid=961284063;c=1589290804417):Failed
to transfer BP-1375239094-xxxx.xxxx.xxxx.xxxx-
1589290804417:blk_-9223372035798255743_66037710 to xxxx.xxxx.xxx.xxxx:1004 got
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The source of the accident:
sock = newSocket();
NetUtils.connect(sock, curTarget, dnConf.socketTimeout);
sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay());
sock.setSoTimeout(targets.length * dnConf.socketTimeout);
When a NoRouteToHostException occurs, the Block will be added to the
VolumeScanner, and the VolumeScanner will start working to scan the Block. This
should not happen because this is not a real IOException.
> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -----------------------------------------------------------------------------
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode
> Affects Versions: 3.3.0
> Reporter: JiangHua Zhu
> Assignee: JiangHua Zhu
> Priority: Major
> Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached
> 17k and are still growing. View these CLOSE_WAITs through the lsof command,
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
> !screenshot-1.png!
> It can be seen that the reason for this phenomenon is that Socket#close() is
> not called correctly, and DataNode interacts with other nodes as Client.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]