[
https://issues.apache.org/jira/browse/HDFS-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang reassigned HDFS-7150:
-------------------------------------
Assignee: Wei-Chiu Chuang
> MissingBlocks > 0 when all replicas are on decomm-in-progress nodes
> -------------------------------------------------------------------
>
> Key: HDFS-7150
> URL: https://issues.apache.org/jira/browse/HDFS-7150
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Ming Ma
> Assignee: Wei-Chiu Chuang
>
> Our clusters recently have this false alert, where NN metrics MissingBlocks >
> 0 while all replicas of these blocks are on decomm-in-progress nodes.
> Normally, when you have replicas only on decomm-in-progress nodes, the blocks
> won't be counted as missing. It turns out if decomm-in-progress nodes lost
> heartbeat and reconnect to NN, this could happen. The scenario is the
> following.
> 1. Kick off decomm on several nodes across different racks.
> 2. NN lost heartbeat from 3 decomm-in-progress nodes around the same time.
> BM's neededReplications will be updated as part of BM.removeStoredBlock
> process. If block A's 3 replicas happen to be on these 3 nodes, block A will
> be moved to BM's neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue. So at
> this point, block A will be counted as missing.
> 3. These 3 nodes reconnect with NNs. However, block A remains in BM's
> neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue, until the block A is
> replicated to other live nodes.
> The issue will be mitigated by HDFS-7128 with faster decommission. But it is
> better to fix the correctness issue. When decomm-in-progress nodes reconnect
> with NN, blocks should be moved out of BM's
> neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue. This will also give
> replication of these blocks higher priority.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)