[ 
https://issues.apache.org/jira/browse/HDFS-11616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920863#comment-16920863
 ] 

hemanthboyina commented on HDFS-11616:
--------------------------------------

{code:java}
       else { // COMPLETE block, same genstamp
        if (reportedState == ReplicaState.RBW) {
          .....
          LOG.info("Received an RBW replica for {} on {}: ignoring it, since "
                  + "it is complete with the same genstamp", storedBlock, dn);
          return null;
        } else {
          return new BlockToMarkCorrupt(new Block(reported), storedBlock,
              "reported replica has invalid state " + reportedState,
              Reason.INVALID_STATE);
        } {code}
we add replica to corrupt map , with reason INVALID state
but while removing  from corrupt map we only  check reason GENSTAMP_MISMATCH.

the bug exists , any suggestions [~shahrs87] [~jojochuang] ??

> Namenode doesn't mark the block as non-corrupt if the reason for corruption 
> was INVALID_STATE
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11616
>                 URL: https://issues.apache.org/jira/browse/HDFS-11616
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.7.3
>            Reporter: Rushabh S Shah
>            Priority: Major
>
> Due to power failure event, we hit HDFS-5042.
> We lost many racks across the cluster.
> There were couple of missing blocks.
> For a  given missing block, following is the output of fsck.
> {noformat}
> [hdfs@XXX rushabhs]$ hdfs fsck -blockId blk_8566436445
> Connecting to namenode via 
> http://nn1:50070/fsck?ugi=hdfs&blockId=blk_8566436445+&path=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from XXX at Mon Apr 03 16:22:48 UTC 
> 2017
> Block Id: blk_8566436445
> Block belongs to: <file>
> No. of Expected Replica: 3
> No. of live Replica: 0
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 3
> Block replica on datanode/rack: datanodeA is CORRUPT   ReasonCode: 
> INVALID_STATE
> Block replica on datanode/rack: datanodeB is CORRUPT   ReasonCode: 
> INVALID_STATE
> Block replica on datanode/rack: datanodeC is CORRUPT   ReasonCode: 
> INVALID_STATE
> {noformat}
> After the power event, when we restarted the datanode, the blocks were in rbw 
> directory.
> When full block report is sent to namenode, all the blocks from rbw directory 
> gets converted into RWR state and the namenode marked it as corrupt with 
> reason Reason.INVALID_STATE.
> After sometime (in this case after 31 hours) when I went to recover missing 
> blocks, I noticed the following things.
> All the datanodes has their copy of the block in rbw directory but the file 
> was complete according to namenode.
> All the replicas had the right size and correct genstamp and {{hdfs debug 
> verify}} command also succeeded.
> I went to dnA and moved the block from rbw directory to finalized directory.
> Restarted the datanode (making sure the replicas file was not present during 
> startup).
> I forced a FBR and made sure the datanode block reported to namenode.
> After waiting for sometime, still that block was missing.
> I expected the missing block to go away since the replica is in FINALIZED 
> directory.
> On investigating more, I found out that namenode will remove the replica from 
> corrupt map only if the reason for corruption was {{GENSTAMP_MISMATCH}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to