[ 
https://issues.apache.org/jira/browse/HDFS-11616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-11616:
------------------------------
    Target Version/s: 2.8.3  (was: 2.8.1)

> Namenode doesn't mark the block as non-corrupt if the reason for corruption 
> was INVALID_STATE
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11616
>                 URL: https://issues.apache.org/jira/browse/HDFS-11616
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.7.3
>            Reporter: Rushabh S Shah
>
> Due to power failure event, we hit HDFS-5042.
> We lost many racks across the cluster.
> There were couple of missing blocks.
> For a  given missing block, following is the output of fsck.
> {noformat}
> [hdfs@XXX rushabhs]$ hdfs fsck -blockId blk_8566436445
> Connecting to namenode via 
> http://nn1:50070/fsck?ugi=hdfs&blockId=blk_8566436445+&path=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from XXX at Mon Apr 03 16:22:48 UTC 
> 2017
> Block Id: blk_8566436445
> Block belongs to: <file>
> No. of Expected Replica: 3
> No. of live Replica: 0
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 3
> Block replica on datanode/rack: datanodeA is CORRUPT   ReasonCode: 
> INVALID_STATE
> Block replica on datanode/rack: datanodeB is CORRUPT   ReasonCode: 
> INVALID_STATE
> Block replica on datanode/rack: datanodeC is CORRUPT   ReasonCode: 
> INVALID_STATE
> {noformat}
> After the power event, when we restarted the datanode, the blocks were in rbw 
> directory.
> When full block report is sent to namenode, all the blocks from rbw directory 
> gets converted into RWR state and the namenode marked it as corrupt with 
> reason Reason.INVALID_STATE.
> After sometime (in this case after 31 hours) when I went to recover missing 
> blocks, I noticed the following things.
> All the datanodes has their copy of the block in rbw directory but the file 
> was complete according to namenode.
> All the replicas had the right size and correct genstamp and {{hdfs debug 
> verify}} command also succeeded.
> I went to dnA and moved the block from rbw directory to finalized directory.
> Restarted the datanode (making sure the replicas file was not present during 
> startup).
> I forced a FBR and made sure the datanode block reported to namenode.
> After waiting for sometime, still that block was missing.
> I expected the missing block to go away since the replica is in FINALIZED 
> directory.
> On investigating more, I found out that namenode will remove the replica from 
> corrupt map only if the reason for corruption was {{GENSTAMP_MISMATCH}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to