[
https://issues.apache.org/jira/browse/HDFS-11616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junping Du updated HDFS-11616:
------------------------------
Target Version/s: 2.8.3 (was: 2.8.1)
> Namenode doesn't mark the block as non-corrupt if the reason for corruption
> was INVALID_STATE
> ---------------------------------------------------------------------------------------------
>
> Key: HDFS-11616
> URL: https://issues.apache.org/jira/browse/HDFS-11616
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 2.7.3
> Reporter: Rushabh S Shah
>
> Due to power failure event, we hit HDFS-5042.
> We lost many racks across the cluster.
> There were couple of missing blocks.
> For a given missing block, following is the output of fsck.
> {noformat}
> [hdfs@XXX rushabhs]$ hdfs fsck -blockId blk_8566436445
> Connecting to namenode via
> http://nn1:50070/fsck?ugi=hdfs&blockId=blk_8566436445+&path=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from XXX at Mon Apr 03 16:22:48 UTC
> 2017
> Block Id: blk_8566436445
> Block belongs to: <file>
> No. of Expected Replica: 3
> No. of live Replica: 0
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 3
> Block replica on datanode/rack: datanodeA is CORRUPT ReasonCode:
> INVALID_STATE
> Block replica on datanode/rack: datanodeB is CORRUPT ReasonCode:
> INVALID_STATE
> Block replica on datanode/rack: datanodeC is CORRUPT ReasonCode:
> INVALID_STATE
> {noformat}
> After the power event, when we restarted the datanode, the blocks were in rbw
> directory.
> When full block report is sent to namenode, all the blocks from rbw directory
> gets converted into RWR state and the namenode marked it as corrupt with
> reason Reason.INVALID_STATE.
> After sometime (in this case after 31 hours) when I went to recover missing
> blocks, I noticed the following things.
> All the datanodes has their copy of the block in rbw directory but the file
> was complete according to namenode.
> All the replicas had the right size and correct genstamp and {{hdfs debug
> verify}} command also succeeded.
> I went to dnA and moved the block from rbw directory to finalized directory.
> Restarted the datanode (making sure the replicas file was not present during
> startup).
> I forced a FBR and made sure the datanode block reported to namenode.
> After waiting for sometime, still that block was missing.
> I expected the missing block to go away since the replica is in FINALIZED
> directory.
> On investigating more, I found out that namenode will remove the replica from
> corrupt map only if the reason for corruption was {{GENSTAMP_MISMATCH}}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]