[
https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Shvachko updated HDFS-14941:
---------------------------------------
Status: Patch Available (was: Open)
Attached a patch, which fixes the problem. Also provides a unit tests to
reproduce the race condition. This is based on [~vagarychen]'s original patch.
> Potential editlog race condition can cause corrupted file
> ---------------------------------------------------------
>
> Key: HDFS-14941
> URL: https://issues.apache.org/jira/browse/HDFS-14941
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Reporter: Chen Liang
> Assignee: Chen Liang
> Priority: Major
> Labels: ha
> Attachments: HDFS-14941.001.patch
>
>
> Recently we encountered an issue that, after a failover, NameNode complains
> corrupted file/missing blocks. The blocks did recover after full block
> reports, so the blocks are not actually missing. After further investigation,
> we believe this is what happened:
> First of all, on SbN, it is possible that it receives block reports before
> corresponding edit tailing happened. In which case SbN postpones processing
> the DN block report, handled by the guarding logic below:
> {code:java}
> if (shouldPostponeBlocksFromFuture &&
> namesystem.isGenStampInFuture(iblk)) {
> queueReportedBlock(storageInfo, iblk, reportedState,
> QUEUE_REASON_FUTURE_GENSTAMP);
> continue;
> }
> {code}
> Basically if reported block has a future generation stamp, the DN report gets
> requeued.
> However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code:
> {code:java}
> // allocate new block, record block locations in INode.
> newBlock = createNewBlock();
> INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
> saveAllocatedBlock(src, inodesInPath, newBlock, targets);
> persistNewBlock(src, pendingFile);
> offset = pendingFile.computeFileSize();
> {code}
> The line
> {{newBlock = createNewBlock();}}
> Would log an edit entry {{OP_SET_GENSTAMP_V2}} to bump generation stamp on
> Standby
> while the following line
> {{persistNewBlock(src, pendingFile);}}
> would log another edit entry {{OP_ADD_BLOCK}} to actually add the block on
> Standby.
> Then the race condition is that, imagine Standby has just processed
> {{OP_SET_GENSTAMP_V2}}, but not yet {{OP_ADD_BLOCK}} (if they just happen to
> be in different setment). Now a block report with new generation stamp comes
> in.
> Since the genstamp bump has already been processed, the reported block may
> not be considered as future block. So the guarding logic passes. But
> actually, the block hasn't been added to blockmap, because the second edit is
> yet to be tailed. So, the block then gets added to invalidate block list and
> we saw messages like:
> {code:java}
> BLOCK* addBlock: block XXX on node XXX size XXX does not belong to any file
> {code}
> Even worse, since this IBR is effectively lost, the NameNode has no
> information about this block, until the next full block report. So after a
> failover, the NN marks it as corrupt.
> This issue won't happen though, if both of the edit entries get tailed all
> together, so no IBR processing can happen in between. But in our case, we set
> edit tailing interval to super low (to allow Standby read), so when under
> high workload, there is a much much higher chance that the two entries are
> tailed separately, causing the issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]