[ https://issues.apache.org/jira/browse/HDFS-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354076#comment-16354076 ]
Kihwal Lee commented on HDFS-13111: ----------------------------------- Here is relevant log lines of the latest example. This node was under a heavy I/O load. {noformat} 2018-02-06 00:00:03,413 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] INFO Receiving BP-YYY:blk_7654321_1234567 src: /1.2.3.4:57710 dest: /1.2.3.5:1004 2018-02-06 00:09:58,840 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN Slow BlockReceiver write data to disk cost:462ms (threshold=300ms) 2018-02-06 00:10:40,148 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN Slow BlockReceiver write data to disk cost:11155ms (threshold=300ms) 2018-02-06 00:10:46,053 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN Slow BlockReceiver write data to disk cost:1577ms (threshold=300ms) 2018-02-06 00:11:02,376 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN Slow BlockReceiver write data to disk cost:327ms (threshold=300ms) 2018-02-06 00:11:53,064 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] INFO Receiving BP-YYY:blk_7654321_1234567 src: /1.2.3.4:40532 dest: /1.2.3.5:1004 2018-02-06 00:12:09,782 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] INFO Recover failed close BP-YYY:blk_7654321_1234567 2018-02-06 00:12:13,081 [DataXceiver for client DFSClient_XXX at /1.2.3.7:46522] INFO Receiving BP-YYY:blk_7654321_1234567 src: /1.2.3.7:46522 dest: /1.2.3.5:1004 2018-02-06 00:12:13,081 [DataXceiver for client DFSClient_XXX at /1.2.3.7:46522] INFO Recover failed close BP-YYY:blk_7654321_1234567 2018-02-06 00:12:17,276 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] WARN Lock held time above threshold: lock identifier: org.apache.hadoop.hdfs.server.datanode .fsdataset.impl.FsDatasetImpl lockHeldTimeMs=7492 ms. Suppressed 0 lock warnings. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1556) ... // it was recoverClose() 2018-02-06 00:12:17,276 [DataXceiver for client DFSClient_XXX at /1.2.3.7:46522] INFO Received BP-YYY:blk_7654321_1135832806836 src: /1.2.3.7:46522 dest: /1.2.3.5:1004 of size xx 2018-02-06 00:12:20,103 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] INFO Received BP-YYY:blk_7654321_1135832805246 src: /1.2.3.4:40532 dest: /1.2.3.5:1004 of size xx 2018-02-06 00:12:38,353 [PacketResponder: BP-YYY:blk_7654321_1234567, type=LAST_IN_PIPELINE] INFO DataNode.clienttrace: src: /1.2.3.4:57710, dest: /1.2.3.5:1004, bytes: 134217728, op: HDFS_WRITE, cliID: DFSClient_XXX, offset: 0, srvID: ZZZ, blockid: BP-YYY:blk_7654321_1234567, duration: looong {noformat} Note the client port number to identify each writer thread. After two "successful" {{recoverClose()}}, the original writer comes around and also declares a success. This must have resulted in the reported gen stamp going backward. On disk actually was the latest one. This clearly illustrates that it is wrong to time out on the writer termination and continue with the recovery. > Close recovery may incorrectly mark blocks corrupt > -------------------------------------------------- > > Key: HDFS-13111 > URL: https://issues.apache.org/jira/browse/HDFS-13111 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.8.0 > Reporter: Daryn Sharp > Priority: Critical > > Close recovery can leave a block marked corrupt until the next FBR arrives > from one of the DNs. The reason is unclear but has happened multiple times > when a DN has io saturated disks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org