[
https://issues.apache.org/jira/browse/HDFS-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jing Zhao updated HDFS-5672:
----------------------------
Attachment: HDFS-5672.000.patch
Upload a patch to fix.
We can consistently reproduce the issue with this change in
TestHASafeMode#testSafeBlockTracking:
{code}
} finally {
+ cluster.shutdownNameNode(1);
for (FSDataOutputStream stm : stms) {
IOUtils.closeStream(stm);
}
}
{code}
And the fix is just one line in BlockManager#processReportedBlock:
{code}
if (isBlockUnderConstruction(storedBlock, ucState, reportedState)) {
- toUC.add(new StatefulBlockInfo(
- (BlockInfoUnderConstruction)storedBlock, block, reportedState));
+ toUC.add(new StatefulBlockInfo((BlockInfoUnderConstruction) storedBlock,
+ new Block(block), reportedState));
return storedBlock;
}
{code}
The issue is that when BlockManager#reportDiff iteratively calls
processReportedBlock to process reported blocks, the parameter block for
processReportedBlock is always the same block object in BlockReportIterator.
This makes the toUC list contain incorrect information. And the wrong
information in the toUC list will later be recorded as ReplicaUnderConstruction
in the corresponding BlockInfo object. Later, when the corresponding file gets
closed, the NN will check the replicas for the block and mark these replicas as
stale if it finds inconsistency in generation stamp. This will finally affect
the safe block count calculation.
In the unit test, when the standby NN restarts, if all the DNs have pending IBR
for it, SBN will first process IBR before processing the first full block
report. Then SBN will call processReport, instead of processFirstBlockReport,
to process full block reports from all the DNs. In this way, the above bug will
be hit 3 times and the safe block count cannot get increased for the
corresponding blocks.
> TestHASafeMode#testSafeBlockTracking fails in trunk
> ---------------------------------------------------
>
> Key: HDFS-5672
> URL: https://issues.apache.org/jira/browse/HDFS-5672
> Project: Hadoop HDFS
> Issue Type: Test
> Affects Versions: 2.4.0
> Reporter: Ted Yu
> Assignee: Jing Zhao
> Attachments: HDFS-5672.000.patch
>
>
> From build #1614:
> {code}
> TestHASafeMode.testSafeBlockTracking:623->assertSafeMode:488 Bad safemode
> status: 'Safe mode is ON. The reported blocks 3 needs additional 7 blocks to
> reach the threshold 0.9990 of total blocks 10.
> Safe mode will be turned off automatically'
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)