YCozy created HDFS-15367:
----------------------------
Summary: Fail to get file checksum even if there's an available
replica.
Key: HDFS-15367
URL: https://issues.apache.org/jira/browse/HDFS-15367
Project: Hadoop HDFS
Issue Type: Bug
Components: dfsclient, namenode
Affects Versions: 2.10.0
Reporter: YCozy
DFSClient can fail to get file checksum even when there's an available replica.
One possible triggering process of the bug is as follows:
* Start a cluster with three DNs (DN1, DN2, DN3). The default replication
factor is set to 2.
* Both DN1 and DN3 register with NN, as can be seen from NN's log (DN1 uses
port 9866 while DN3 uses port 9666):
{noformat}
2020-05-21 01:24:57,196 INFO org.apache.hadoop.net.NetworkTopology: Adding a
new node: /default-rack/127.0.0.1:9866
2020-05-21 01:25:06,155 INFO org.apache.hadoop.net.NetworkTopology: Adding a
new node: /default-rack/127.0.0.1:9666{noformat}
* DN1 sends block report to NN, as can be seen from NN's log:
{noformat}
2020-05-21 01:24:57,336 INFO BlockStateChange: BLOCK* processReport
0x3ae7e5805f2e704e: from storage DS-638ee5ae-e435-4d82-ae4f-9066bc7eb850 node
DatanodeRegistration(127.0.0.1:9866,
datanodeUuid=b0702574-968f-4817-a660-42ec1c475606, infoPort=9864,
infoSecurePort=0, ipcPort=9867,
storageInfo=lv=-57;cid=CID-75860997-47d0-4957-a4e6-4edbd79d64b8;nsid=49920454;c=1590024277030),
blocks: 0, hasStaleStorage: false, processing time: 3 msecs,
invalidatedBlocks: 0{noformat}
* DN3 fails to send the block report to NN because of a network partition. We
inject network partition to fail DN3's blockReport RPC. Also, NN's log does not
contain the "processReport" entry for DN3.
* DFSClient uploads a file. NN chooses DN1 and DN3 to host the replicas. The
network partition on DN3 stops, so the file is uploaded successfully. This can
be verified by NN's log:
{noformat}
2020-05-21 01:25:13,644 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocate blk_1073741825_1001, replicas=127.0.0.1:9666, 127.0.0.1:9866 for
/dir1/file1._COPYING_{noformat}
* Stop DN1, as can be seen from DN1's log:
{noformat}
2020-05-21 01:25:21,114 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
SHUTDOWN_MSG:{noformat}
* DFSClient tries to get the file checksum. It fails to connect to DN1 and
gives up. The bug is triggered.
{noformat}
20/05/21 01:25:34 INFO hdfs.DFSClient: Connecting to datanode 127.0.0.1:9866
20/05/21 01:25:34 WARN hdfs.DFSClient: src=/dir1/file1,
datanodes[0]=DatanodeInfoWithStorage[127.0.0.1:9866,DS-638ee5ae-e435-4d82-ae4f-9066bc7eb850,DISK]
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.hdfs.DFSClient.connectToDN(DFSClient.java:1925)
at org.apache.hadoop.hdfs.DFSClient.getFileChecksum(DFSClient.java:1798)
at
org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1638)
at
org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1635)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1646)
at
org.apache.hadoop.fs.shell.Display$Checksum.processPath(Display.java:199)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:327)
at
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:299)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:281)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:265)
at
org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:380)
checksum: Fail to get block MD5 for
BP-2092781073-172.17.0.4-1590024277030:blk_1073741825_1001{noformat}
Since DN3 also has a replica of the file, DFSClient should try to contact DN3
to get the checksum.
To verify that DFSClient didn't connect to DN3, we changed the DEBUG log in
DFSClient.connectToDN() to INFO log. From the above error messages printed by
DFSClient we can see that the DFSClient only tries to connect to DN1.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]