[ 
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-14941:
---------------------------------------
    Attachment: HDFS-14941.001.patch

> Potential editlog race condition can cause corrupted file
> ---------------------------------------------------------
>
>                 Key: HDFS-14941
>                 URL: https://issues.apache.org/jira/browse/HDFS-14941
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Chen Liang
>            Assignee: Chen Liang
>            Priority: Major
>              Labels: ha
>         Attachments: HDFS-14941.001.patch
>
>
> Recently we encountered an issue that, after a failover, NameNode complains 
> corrupted file/missing blocks. The blocks did recover after full block 
> reports, so the blocks are not actually missing. After further investigation, 
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before 
> corresponding edit tailing happened. In which case SbN postpones processing 
> the DN block report, handled by the guarding logic below:
> {code:java}
>       if (shouldPostponeBlocksFromFuture &&
>           namesystem.isGenStampInFuture(iblk)) {
>         queueReportedBlock(storageInfo, iblk, reportedState,
>             QUEUE_REASON_FUTURE_GENSTAMP);
>         continue;
>       }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets 
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
>       // allocate new block, record block locations in INode.
>       newBlock = createNewBlock();
>       INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
>       saveAllocatedBlock(src, inodesInPath, newBlock, targets);
>       persistNewBlock(src, pendingFile);
>       offset = pendingFile.computeFileSize();
> {code}
> The line
>  {{newBlock = createNewBlock();}}
>  Would log an edit entry {{OP_SET_GENSTAMP_V2}} to bump generation stamp on 
> Standby
>  while the following line
>  {{persistNewBlock(src, pendingFile);}}
>  would log another edit entry {{OP_ADD_BLOCK}} to actually add the block on 
> Standby.
> Then the race condition is that, imagine Standby has just processed 
> {{OP_SET_GENSTAMP_V2}}, but not yet {{OP_ADD_BLOCK}} (if they just happen to 
> be in different setment). Now a block report with new generation stamp comes 
> in.
> Since the genstamp bump has already been processed, the reported block may 
> not be considered as future block. So the guarding logic passes. But 
> actually, the block hasn't been added to blockmap, because the second edit is 
> yet to be tailed. So, the block then gets added to invalidate block list and 
> we saw messages like:
> {code:java}
> BLOCK* addBlock: block XXX on node XXX size XXX does not belong to any file
> {code}
> Even worse, since this IBR is effectively lost, the NameNode has no 
> information about this block, until the next full block report. So after a 
> failover, the NN marks it as corrupt.
> This issue won't happen though, if both of the edit entries get tailed all 
> together, so no IBR processing can happen in between. But in our case, we set 
> edit tailing interval to super low (to allow Standby read), so when under 
> high workload, there is a much much higher chance that the two entries are 
> tailed separately, causing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to