YCozy created HDFS-15367:
----------------------------

             Summary: Fail to get file checksum even if there's an available 
replica.
                 Key: HDFS-15367
                 URL: https://issues.apache.org/jira/browse/HDFS-15367
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: dfsclient, namenode
    Affects Versions: 2.10.0
            Reporter: YCozy


DFSClient can fail to get file checksum even when there's an available replica. 
One possible triggering process of the bug is as follows:
 * Start a cluster with three DNs (DN1, DN2, DN3). The default replication 
factor is set to 2.
 * Both DN1 and DN3 register with NN, as can be seen from NN's log (DN1 uses 
port 9866 while DN3 uses port 9666):

{noformat}
2020-05-21 01:24:57,196 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/127.0.0.1:9866
2020-05-21 01:25:06,155 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/127.0.0.1:9666{noformat}
 * DN1 sends block report to NN, as can be seen from NN's log:

{noformat}
2020-05-21 01:24:57,336 INFO BlockStateChange: BLOCK* processReport 
0x3ae7e5805f2e704e: from storage DS-638ee5ae-e435-4d82-ae4f-9066bc7eb850 node 
DatanodeRegistration(127.0.0.1:9866, 
datanodeUuid=b0702574-968f-4817-a660-42ec1c475606, infoPort=9864, 
infoSecurePort=0, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-75860997-47d0-4957-a4e6-4edbd79d64b8;nsid=49920454;c=1590024277030),
 blocks: 0, hasStaleStorage: false, processing time: 3 msecs, 
invalidatedBlocks: 0{noformat}
 * DN3 fails to send the block report to NN because of a network partition. We 
inject network partition to fail DN3's blockReport RPC. Also, NN's log does not 
contain the "processReport" entry for DN3.
 * DFSClient uploads a file. NN chooses DN1 and DN3 to host the replicas. The 
network partition on DN3 stops, so the file is uploaded successfully. This can 
be verified by NN's log:

{noformat}
2020-05-21 01:25:13,644 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocate blk_1073741825_1001, replicas=127.0.0.1:9666, 127.0.0.1:9866 for 
/dir1/file1._COPYING_{noformat}
 * Stop DN1, as can be seen from DN1's log:

{noformat}
2020-05-21 01:25:21,114 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
SHUTDOWN_MSG:{noformat}
 * DFSClient tries to get the file checksum. It fails to connect to DN1 and 
gives up. The bug is triggered.

{noformat}
20/05/21 01:25:34 INFO hdfs.DFSClient: Connecting to datanode 127.0.0.1:9866
20/05/21 01:25:34 WARN hdfs.DFSClient: src=/dir1/file1, 
datanodes[0]=DatanodeInfoWithStorage[127.0.0.1:9866,DS-638ee5ae-e435-4d82-ae4f-9066bc7eb850,DISK]
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at org.apache.hadoop.hdfs.DFSClient.connectToDN(DFSClient.java:1925)
        at org.apache.hadoop.hdfs.DFSClient.getFileChecksum(DFSClient.java:1798)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1638)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1635)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1646)
        at 
org.apache.hadoop.fs.shell.Display$Checksum.processPath(Display.java:199)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:327)
        at 
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:299)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:281)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:265)
        at 
org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:380)
checksum: Fail to get block MD5 for 
BP-2092781073-172.17.0.4-1590024277030:blk_1073741825_1001{noformat}
Since DN3 also has a replica of the file, DFSClient should try to contact DN3 
to get the checksum.

To verify that DFSClient didn't connect to DN3, we changed the DEBUG log in 
DFSClient.connectToDN() to INFO log. From the above error messages printed by 
DFSClient we can see that the DFSClient only tries to connect to DN1.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to