[ 
https://issues.apache.org/jira/browse/HDFS-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-5672:
----------------------------

    Attachment: HDFS-5672.000.patch

Upload a patch to fix.

We can consistently reproduce the issue with this change in 
TestHASafeMode#testSafeBlockTracking:
{code}
     } finally {
+     cluster.shutdownNameNode(1);
       for (FSDataOutputStream stm : stms) {
         IOUtils.closeStream(stm);
       }
    }
{code}

And the fix is just one line in BlockManager#processReportedBlock:
{code}
     if (isBlockUnderConstruction(storedBlock, ucState, reportedState)) {
-      toUC.add(new StatefulBlockInfo(
-          (BlockInfoUnderConstruction)storedBlock, block, reportedState));
+      toUC.add(new StatefulBlockInfo((BlockInfoUnderConstruction) storedBlock,
+          new Block(block), reportedState));
       return storedBlock;
     }
{code}

The issue is that when BlockManager#reportDiff iteratively calls 
processReportedBlock to process reported blocks, the parameter block for 
processReportedBlock is always the same block object in BlockReportIterator. 
This makes the toUC list contain incorrect information. And the wrong 
information in the toUC list will later be recorded as ReplicaUnderConstruction 
in the corresponding BlockInfo object. Later, when the corresponding file gets 
closed, the NN will check the replicas for the block and mark these replicas as 
stale if it finds inconsistency in generation stamp. This will finally affect 
the safe block count calculation.

In the unit test, when the standby NN restarts, if all the DNs have pending IBR 
for it, SBN will first process IBR before processing the first full block 
report. Then SBN will call processReport, instead of processFirstBlockReport, 
to process full block reports from all the DNs. In this way, the above bug will 
be hit 3 times and the safe block count cannot get increased for the 
corresponding blocks.

> TestHASafeMode#testSafeBlockTracking fails in trunk
> ---------------------------------------------------
>
>                 Key: HDFS-5672
>                 URL: https://issues.apache.org/jira/browse/HDFS-5672
>             Project: Hadoop HDFS
>          Issue Type: Test
>    Affects Versions: 2.4.0
>            Reporter: Ted Yu
>            Assignee: Jing Zhao
>         Attachments: HDFS-5672.000.patch
>
>
> From build #1614:
> {code}
>  TestHASafeMode.testSafeBlockTracking:623->assertSafeMode:488 Bad safemode 
> status: 'Safe mode is ON. The reported blocks 3 needs additional 7 blocks to 
> reach the threshold 0.9990 of total blocks 10.
> Safe mode will be turned off automatically'
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to