[ 
https://issues.apache.org/jira/browse/HDFS-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-7150:
-------------------------------------

    Assignee: Wei-Chiu Chuang

> MissingBlocks > 0 when all replicas are on decomm-in-progress nodes
> -------------------------------------------------------------------
>
>                 Key: HDFS-7150
>                 URL: https://issues.apache.org/jira/browse/HDFS-7150
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Wei-Chiu Chuang
>
> Our clusters recently have this false alert, where NN metrics MissingBlocks > 
> 0 while all replicas of these blocks are on decomm-in-progress nodes. 
> Normally, when you have replicas only on decomm-in-progress nodes, the blocks 
> won't be counted as missing. It turns out if decomm-in-progress nodes lost 
> heartbeat and reconnect to NN, this could happen. The scenario is the 
> following.
> 1. Kick off decomm on several nodes across different racks.
> 2. NN lost heartbeat from 3 decomm-in-progress nodes around the same time. 
> BM's neededReplications will be updated as part of BM.removeStoredBlock 
> process. If block A's 3 replicas happen to be on these 3 nodes, block A will 
> be moved to BM's neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue. So at 
> this point, block A will be counted as missing.
> 3. These 3 nodes reconnect with NNs. However, block A remains in BM's 
> neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue, until the block A is 
> replicated to other live nodes.
> The issue will be mitigated by HDFS-7128 with faster decommission. But it is 
> better to fix the correctness issue. When decomm-in-progress nodes reconnect 
> with NN, blocks should be moved out of BM's 
> neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue. This will also give 
> replication of these blocks higher priority.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to