[ 
https://issues.apache.org/jira/browse/HDFS-15709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243755#comment-17243755
 ] 

Yushi Hayasaka commented on HDFS-15709:
---------------------------------------

How to reproduce (tested on Hadoop 3.3.0):
1. Create a directory for EC, set policy and put a file in the directory.
```
# hdfs dfs -mkdir /test
# hdfs ec -setPolicy -path /test
# hdfs ec -getPolicy -path /test
RS-6-3-1024k
# hdfs dfs -put 100mb.bin /test/100mb.bin
```

Block locations:
```
root@d4c831a69d28:/# hdfs fsck /test/100mb.bin -blocks -locations -files
Connecting to namenode via 
http://namenode:9870/fsck?ugi=root&blocks=1&locations=1&files=1&path=%2Ftest%2F100mb.bin
FSCK started by root (auth:SIMPLE) from /172.20.0.2 for path /test/100mb.bin at 
Fri Nov 27 11:44:26 UTC 2020

/test/100mb.bin 104857600 bytes, erasure-coded: policy=RS-6-3-1024k, 1 
block(s): OK
0. BP-84905024-172.19.0.2-1606471594228:blk_-9223372036854775776_1008 
len=104857600 Live_repl=9 
[blk_-9223372036854775776:DatanodeInfoWithStorage[172.20.0.10:9866,DS-b5226020-67f1-4155-86c1-dafccbd97b3d,DISK],
 blk_-9223372036854775775:DatanodeInfoW
ithStorage[172.20.0.5:9866,DS-280e68b2-03c4-4c3c-b466-86a433520691,DISK], 
blk_-9223372036854775774:DatanodeInfoWithStorage[172.20.0.8:9866,DS-99b5dc6f-067b-45b5-b912-8bb18727e34a,DISK],
 blk_-9223372036854775773:DatanodeInfoWithStorage[172.20.0.11:98
66,DS-430b8ea8-06a4-446c-b6c9-23564dcb8629,DISK], 
blk_-9223372036854775772:DatanodeInfoWithStorage[172.20.0.6:9866,DS-9577f0e0-8dc5-49c5-853b-2f5d79558d39,DISK],
 
blk_-9223372036854775771:DatanodeInfoWithStorage[172.20.0.3:9866,DS-f61f221d-12d1-40ca-
96fb-016f5a87377b,DISK], 
blk_-9223372036854775770:DatanodeInfoWithStorage[172.20.0.4:9866,DS-c8ac5cc6-ddb6-4a32-a6e8-c1317a71d76f,DISK],
 
blk_-9223372036854775769:DatanodeInfoWithStorage[172.20.0.7:9866,DS-33d680ec-2894-4030-a732-6014712ffe9e,DISK],
blk_-9223372036854775768:DatanodeInfoWithStorage[172.20.0.9:9866,DS-0a06cd9c-2cb3-451e-84d2-47cc95c77bbd,DISK]]
# ...
```

2. Stop one of DNs which contain a data block of the created file above for 
triggering reconstruction. In this example, we stopped a datanode whose address 
is 172.20.0.5.

3. Get the checksum of file. According to the debug log, the client get it from 
172.20.0.10 through Op.BLOCK_CHECKSUM_GROUP.
```
$ HADOOP_ROOT_LOGGER=DEBUG,console hdfs dfs -checksum /test/100mb.bin
# ...
2020-11-27 11:02:24,496 DEBUG hdfs.FileChecksumHelper: got reply from 
DatanodeInfoWithStorage[172.20.0.10:9866,DS-b5226020-67f1-4155-86c1-dafccbd97b3d,DISK]:
 blockChecksum=5a9bd78c5031dd85249ac98435a2f57d, blockChecksumType=MD5CRC
/test/100mb.bin MD5-of-0MD5-of-512CRC32C 
000002000000000000000000db96bc93fdf485e930ed02bb69a189b0
# ...
```

4. Enter 172.20.0.10 (hostname: 24debf6a19bf) and check sockets using lsof. 
Then, you can see some CLOSE_WAIT connections (9866 is a port of data transfer).
```
root@24debf6a19bf:/# lsof -p 346 | grep CLOSE_WAIT
java 346 root 309u IPv4 929333 0t0 TCP 24debf6a19bf:60100->24debf6a19bf:9866 
(CLOSE_WAIT)
java 346 root 479u IPv4 929336 0t0 TCP 
24debf6a19bf:34132->datanode5.docker-hadoop_default:9866 (CLOSE_WAIT)
java 346 root 483u IPv4 929337 0t0 TCP 
24debf6a19bf:42052->datanode2.docker-hadoop_default:9866 (CLOSE_WAIT)
java 346 root 484u IPv4 929338 0t0 TCP 
24debf6a19bf:46856->datanode6.docker-hadoop_default:9866 (CLOSE_WAIT)
java 346 root 485u IPv4 930279 0t0 TCP 
24debf6a19bf:46922->datanode9.docker-hadoop_default:9866 (CLOSE_WAIT)
java 346 root 486u IPv4 930281 0t0 TCP 
24debf6a19bf:39360->datanode3.docker-hadoop_default:9866 (CLOSE_WAIT)
```

> Socket file descriptor leak in StripedBlockChecksumReconstructor
> ----------------------------------------------------------------
>
>                 Key: HDFS-15709
>                 URL: https://issues.apache.org/jira/browse/HDFS-15709
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, ec, erasure-coding
>            Reporter: Yushi Hayasaka
>            Priority: Major
>
> We found a socket file descriptor leak when we tried to get the checksum of 
> EC file with reconstruction happened during the operation.
> The cause of the leak seems that the StripedBlockChecksumReconstructor does 
> not close StripedReader. Making the reader closed, the CLOSE_WAIT connections 
> are gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to