[ https://issues.apache.org/jira/browse/HDFS-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16362677#comment-16362677 ]
Kihwal Lee commented on HDFS-13111: ----------------------------------- It looks like "trying forever" might be actually part of the problem. {code:java} public Replica recoverClose(....) throws IOException { while (true) { try { try(AutoCloseableLock lock = datasetLock.acquire()) { // check replica's state ReplicaInfo replicaInfo = recoverCheck(b, newGS, expectedBlockLen); // update the replica state/gs and finalize if necessary. return replicaInfo; } } catch (MustStopExistingWriter e) { e.getReplica().stopWriter(datanode.getDnConf().getXceiverStopTimeout()); } } } {code} When the I/O frees up and the original writer (normally the packet responder) and the xceiver thread doing {{recoverClose()}} can finish in non-deterministic order. If {{recoverClose()}} finishes last, everything is good. If the packer responder finishes last as the example above, the replica will be marked as corrupt until the next full block report. > Close recovery may incorrectly mark blocks corrupt > -------------------------------------------------- > > Key: HDFS-13111 > URL: https://issues.apache.org/jira/browse/HDFS-13111 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.8.0 > Reporter: Daryn Sharp > Priority: Critical > > Close recovery can leave a block marked corrupt until the next FBR arrives > from one of the DNs. The reason is unclear but has happened multiple times > when a DN has io saturated disks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org