[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244876#comment-15244876
 ] 

Konstantin Shvachko commented on HDFS-10301:
--------------------------------------------

More details.
# My DataNode has 6 storages. It sends a block report and times out, then it 
sends the same block report five more times with different blockReportIds.
# The NameNode starts executing all six reports around the same time, and 
interleaves them, that is it processes the first storage of BR2 before it 
process the last storage of BR1. (Color coded logs are coming)
# While processing storages from BR2 NameNode changes the lastBlockReportId 
field to the id of BR2. This messes with processing storages from BR1, which 
have not been processed yet. Namely these storages are considered zombie, and 
all replicas are removed from those storages along with the storage itself.
# The storage is then reconstructed by the NameNode when it receives a 
heartbeat from the DataNode, but this storage is marked as "stale", but the 
replicas will not be reconstructed until the next block report, which in my 
case is a few hours later.
# I noticed missing blocks because several DataNodes exhibited the same 
behavior and all replicas of the same block were lost.
# The replicas eventually reappeared (several hours later), because DataNodes 
do not physically remove the replicas and report them in the next block report.

The behavior was introduced by HDFS-7960 as a part of hot-swap feature. I did 
not do hot-swap, and did not failover the NameNode.

> Blocks removed by thousands due to falsely detected zombie storages
> -------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Priority: Critical
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to