[
https://issues.apache.org/jira/browse/HDFS-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649466#comment-13649466
]
Todd Lipcon commented on HDFS-4799:
-----------------------------------
The issue is the following:
When the last "good" replica block report comes in, the number of good live
replicas has reached the correct replication count, and it calls
{{invalidateCorruptReplicas}}:
{code}
if ((corruptReplicasCount > 0) && (numLiveReplicas >= fileReplication))
invalidateCorruptReplicas(storedBlock);
{code}
This function then calls {{invalidateBlock}} on the corrupt replicas
(correctly). However, since it is the first block report after being active,
the node which is block-reporting is still considered to have "stale block
information", so this corruption invalidation is postponed:
{code}
2013-04-30 17:21:40,945 INFO BlockStateChange: BLOCK* invalidateBlock:
blk_-XXX_5512300(same as stored) on XXX:1004
2013-04-30 17:21:40,945 INFO BlockStateChange: BLOCK* invalidateBlocks:
postponing invalidation of blk_XXX_5512300(same as stored) on XXX:1004 because
1 replica(s) are located on nodes with potentially out-of-date block reports
{code}
However, this code path in {{invalidateCorruptReplicas}} still runs:
{code}
// Remove the block from corruptReplicasMap
if (!gotException)
corruptReplicas.removeFromCorruptReplicasMap(blk);
{code}
So, after this function, all of the replicas of the block are considered live,
rather than corrupt.
Then, at the end of block report processing, it marks this node as no longer
stale, re-processes the postponed invalidations, and sees that the block is
over-replicated (6 replicas instead of 3). Since nothing is in
{{corruptReplicasMap}} anymore, {{chooseExcessReplicates}} may mistakenly
choose all of the good replicas for removal instead of the bad replicas.
I'm working on a unit test and fix:
- invalidateCorruptReplicas should not remove the replicas from the
corruptReplicasMap unless the replicas were also removed from the blocks map.
- separately, I think that the block-reporting node should not be considered
stale during its block report -- i.e it's incorrect to postpone blocks in the
above context, though I will target this as a separate patch.
> Corrupt replica can be prematurely removed from corruptReplicas map
> -------------------------------------------------------------------
>
> Key: HDFS-4799
> URL: https://issues.apache.org/jira/browse/HDFS-4799
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.0.4-alpha
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Blocker
>
> We saw the following sequence of events in a cluster result in losing the
> most recent genstamp of a block:
> - client is writing to a pipeline of 3
> - the pipeline had nodes fail over some period of time, such that it left 3
> old-genstamp replicas on the original three nodes, having recruited 3 new
> replicas with a later genstamp.
> -- so, we have 6 total replicas in the cluster, three with old genstamps on
> downed nodes, and 3 with the latest genstamp
> - cluster reboots, and the nodes with old genstamps blockReport first. The
> replicas are correctly added to the corrupt replicas map since they have a
> too-old genstamp
> - the nodes with the new genstamp block report. When the latest one block
> reports, chooseExcessReplicates is called and incorrectly decides to remove
> the three good replicas, leaving only the old-genstamp replicas.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira