[ https://issues.apache.org/jira/browse/HDFS-15709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243755#comment-17243755 ]
Yushi Hayasaka commented on HDFS-15709: --------------------------------------- How to reproduce (tested on Hadoop 3.3.0): 1. Create a directory for EC, set policy and put a file in the directory. ``` # hdfs dfs -mkdir /test # hdfs ec -setPolicy -path /test # hdfs ec -getPolicy -path /test RS-6-3-1024k # hdfs dfs -put 100mb.bin /test/100mb.bin ``` Block locations: ``` root@d4c831a69d28:/# hdfs fsck /test/100mb.bin -blocks -locations -files Connecting to namenode via http://namenode:9870/fsck?ugi=root&blocks=1&locations=1&files=1&path=%2Ftest%2F100mb.bin FSCK started by root (auth:SIMPLE) from /172.20.0.2 for path /test/100mb.bin at Fri Nov 27 11:44:26 UTC 2020 /test/100mb.bin 104857600 bytes, erasure-coded: policy=RS-6-3-1024k, 1 block(s): OK 0. BP-84905024-172.19.0.2-1606471594228:blk_-9223372036854775776_1008 len=104857600 Live_repl=9 [blk_-9223372036854775776:DatanodeInfoWithStorage[172.20.0.10:9866,DS-b5226020-67f1-4155-86c1-dafccbd97b3d,DISK], blk_-9223372036854775775:DatanodeInfoW ithStorage[172.20.0.5:9866,DS-280e68b2-03c4-4c3c-b466-86a433520691,DISK], blk_-9223372036854775774:DatanodeInfoWithStorage[172.20.0.8:9866,DS-99b5dc6f-067b-45b5-b912-8bb18727e34a,DISK], blk_-9223372036854775773:DatanodeInfoWithStorage[172.20.0.11:98 66,DS-430b8ea8-06a4-446c-b6c9-23564dcb8629,DISK], blk_-9223372036854775772:DatanodeInfoWithStorage[172.20.0.6:9866,DS-9577f0e0-8dc5-49c5-853b-2f5d79558d39,DISK], blk_-9223372036854775771:DatanodeInfoWithStorage[172.20.0.3:9866,DS-f61f221d-12d1-40ca- 96fb-016f5a87377b,DISK], blk_-9223372036854775770:DatanodeInfoWithStorage[172.20.0.4:9866,DS-c8ac5cc6-ddb6-4a32-a6e8-c1317a71d76f,DISK], blk_-9223372036854775769:DatanodeInfoWithStorage[172.20.0.7:9866,DS-33d680ec-2894-4030-a732-6014712ffe9e,DISK], blk_-9223372036854775768:DatanodeInfoWithStorage[172.20.0.9:9866,DS-0a06cd9c-2cb3-451e-84d2-47cc95c77bbd,DISK]] # ... ``` 2. Stop one of DNs which contain a data block of the created file above for triggering reconstruction. In this example, we stopped a datanode whose address is 172.20.0.5. 3. Get the checksum of file. According to the debug log, the client get it from 172.20.0.10 through Op.BLOCK_CHECKSUM_GROUP. ``` $ HADOOP_ROOT_LOGGER=DEBUG,console hdfs dfs -checksum /test/100mb.bin # ... 2020-11-27 11:02:24,496 DEBUG hdfs.FileChecksumHelper: got reply from DatanodeInfoWithStorage[172.20.0.10:9866,DS-b5226020-67f1-4155-86c1-dafccbd97b3d,DISK]: blockChecksum=5a9bd78c5031dd85249ac98435a2f57d, blockChecksumType=MD5CRC /test/100mb.bin MD5-of-0MD5-of-512CRC32C 000002000000000000000000db96bc93fdf485e930ed02bb69a189b0 # ... ``` 4. Enter 172.20.0.10 (hostname: 24debf6a19bf) and check sockets using lsof. Then, you can see some CLOSE_WAIT connections (9866 is a port of data transfer). ``` root@24debf6a19bf:/# lsof -p 346 | grep CLOSE_WAIT java 346 root 309u IPv4 929333 0t0 TCP 24debf6a19bf:60100->24debf6a19bf:9866 (CLOSE_WAIT) java 346 root 479u IPv4 929336 0t0 TCP 24debf6a19bf:34132->datanode5.docker-hadoop_default:9866 (CLOSE_WAIT) java 346 root 483u IPv4 929337 0t0 TCP 24debf6a19bf:42052->datanode2.docker-hadoop_default:9866 (CLOSE_WAIT) java 346 root 484u IPv4 929338 0t0 TCP 24debf6a19bf:46856->datanode6.docker-hadoop_default:9866 (CLOSE_WAIT) java 346 root 485u IPv4 930279 0t0 TCP 24debf6a19bf:46922->datanode9.docker-hadoop_default:9866 (CLOSE_WAIT) java 346 root 486u IPv4 930281 0t0 TCP 24debf6a19bf:39360->datanode3.docker-hadoop_default:9866 (CLOSE_WAIT) ``` > Socket file descriptor leak in StripedBlockChecksumReconstructor > ---------------------------------------------------------------- > > Key: HDFS-15709 > URL: https://issues.apache.org/jira/browse/HDFS-15709 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, ec, erasure-coding > Reporter: Yushi Hayasaka > Priority: Major > > We found a socket file descriptor leak when we tried to get the checksum of > EC file with reconstruction happened during the operation. > The cause of the leak seems that the StripedBlockChecksumReconstructor does > not close StripedReader. Making the reader closed, the CLOSE_WAIT connections > are gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org