[
https://issues.apache.org/jira/browse/HDFS-11616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920863#comment-16920863
]
hemanthboyina commented on HDFS-11616:
--------------------------------------
{code:java}
else { // COMPLETE block, same genstamp
if (reportedState == ReplicaState.RBW) {
.....
LOG.info("Received an RBW replica for {} on {}: ignoring it, since "
+ "it is complete with the same genstamp", storedBlock, dn);
return null;
} else {
return new BlockToMarkCorrupt(new Block(reported), storedBlock,
"reported replica has invalid state " + reportedState,
Reason.INVALID_STATE);
} {code}
we add replica to corrupt map , with reason INVALID state
but while removing from corrupt map we only check reason GENSTAMP_MISMATCH.
the bug exists , any suggestions [~shahrs87] [~jojochuang] ??
> Namenode doesn't mark the block as non-corrupt if the reason for corruption
> was INVALID_STATE
> ---------------------------------------------------------------------------------------------
>
> Key: HDFS-11616
> URL: https://issues.apache.org/jira/browse/HDFS-11616
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 2.7.3
> Reporter: Rushabh S Shah
> Priority: Major
>
> Due to power failure event, we hit HDFS-5042.
> We lost many racks across the cluster.
> There were couple of missing blocks.
> For a given missing block, following is the output of fsck.
> {noformat}
> [hdfs@XXX rushabhs]$ hdfs fsck -blockId blk_8566436445
> Connecting to namenode via
> http://nn1:50070/fsck?ugi=hdfs&blockId=blk_8566436445+&path=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from XXX at Mon Apr 03 16:22:48 UTC
> 2017
> Block Id: blk_8566436445
> Block belongs to: <file>
> No. of Expected Replica: 3
> No. of live Replica: 0
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 3
> Block replica on datanode/rack: datanodeA is CORRUPT ReasonCode:
> INVALID_STATE
> Block replica on datanode/rack: datanodeB is CORRUPT ReasonCode:
> INVALID_STATE
> Block replica on datanode/rack: datanodeC is CORRUPT ReasonCode:
> INVALID_STATE
> {noformat}
> After the power event, when we restarted the datanode, the blocks were in rbw
> directory.
> When full block report is sent to namenode, all the blocks from rbw directory
> gets converted into RWR state and the namenode marked it as corrupt with
> reason Reason.INVALID_STATE.
> After sometime (in this case after 31 hours) when I went to recover missing
> blocks, I noticed the following things.
> All the datanodes has their copy of the block in rbw directory but the file
> was complete according to namenode.
> All the replicas had the right size and correct genstamp and {{hdfs debug
> verify}} command also succeeded.
> I went to dnA and moved the block from rbw directory to finalized directory.
> Restarted the datanode (making sure the replicas file was not present during
> startup).
> I forced a FBR and made sure the datanode block reported to namenode.
> After waiting for sometime, still that block was missing.
> I expected the missing block to go away since the replica is in FINALIZED
> directory.
> On investigating more, I found out that namenode will remove the replica from
> corrupt map only if the reason for corruption was {{GENSTAMP_MISMATCH}}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]