Chen Liang created HDFS-14941:
---------------------------------

             Summary: Potential editlog race condition can cause corrupted file
                 Key: HDFS-14941
                 URL: https://issues.apache.org/jira/browse/HDFS-14941
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
            Reporter: Chen Liang


Recently we encountered an issue that, after a failover, NameNode complains 
corrupted file/missing blocks. The blocks did recover after full block reports, 
so the blocks are not actually missing. After further investigation, we believe 
this is what happened:

First of all, on SbN, it is possible that it receives block reports before 
corresponding edit tailing happened. In which case SbN postpones processing the 
DN block report, handled by the guarding logic below:
{code:java}
      if (shouldPostponeBlocksFromFuture &&
          namesystem.isGenStampInFuture(iblk)) {
        queueReportedBlock(storageInfo, iblk, reportedState,
            QUEUE_REASON_FUTURE_GENSTAMP);
        continue;
      }
{code}
Basically if reported block has a future generation stamp, the DN report gets 
requeued.

However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
{code:java}
      // allocate new block, record block locations in INode.
      newBlock = createNewBlock();
      INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
      saveAllocatedBlock(src, inodesInPath, newBlock, targets);

      persistNewBlock(src, pendingFile);
      offset = pendingFile.computeFileSize();
{code}
The line
 {{newBlock = createNewBlock();}}
 Would log an edit entry {{OP_SET_GENSTAMP_V2}} to bump generation stamp on 
Standby
 while the following line
 {{persistNewBlock(src, pendingFile);}}
 would log another edit entry {{OP_ADD_BLOCK}} to actually add the block on 
Standby.

Then the race condition is that, imagine Standby has just processed 
{{OP_SET_GENSTAMP_V2}}, but not yet {{OP_ADD_BLOCK}} (if they just happen to be 
in different setment). Now a block report with new generation stamp comes in.

Since the genstamp bump has already been processed, the reported block may not 
be considered as future block. So the guarding logic passes. But actually, the 
block hasn't been added to blockmap, because the second edit is yet to be 
tailed. So, the block then gets added to invalidate block list and we saw 
messages like:
{code:java}
BLOCK* addBlock: block XXX on node XXX size XXX does not belong to any file
{code}
Even worse, since this IBR is effectively lost, the NameNode has no information 
about this block, until the next full block report. So after a failover, the NN 
marks it as corrupt.

This issue won't happen though, if both of the edit entries get tailed all 
together, so no IBR processing can happen in between. But in our case, we set 
edit tailing interval to super low (to allow Standby read), so when under high 
workload, there is a much much higher chance that the two entries are tailed 
separately, causing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to