[
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244876#comment-15244876
]
Konstantin Shvachko commented on HDFS-10301:
--------------------------------------------
More details.
# My DataNode has 6 storages. It sends a block report and times out, then it
sends the same block report five more times with different blockReportIds.
# The NameNode starts executing all six reports around the same time, and
interleaves them, that is it processes the first storage of BR2 before it
process the last storage of BR1. (Color coded logs are coming)
# While processing storages from BR2 NameNode changes the lastBlockReportId
field to the id of BR2. This messes with processing storages from BR1, which
have not been processed yet. Namely these storages are considered zombie, and
all replicas are removed from those storages along with the storage itself.
# The storage is then reconstructed by the NameNode when it receives a
heartbeat from the DataNode, but this storage is marked as "stale", but the
replicas will not be reconstructed until the next block report, which in my
case is a few hours later.
# I noticed missing blocks because several DataNodes exhibited the same
behavior and all replicas of the same block were lost.
# The replicas eventually reappeared (several hours later), because DataNodes
do not physically remove the replicas and report them in the next block report.
The behavior was introduced by HDFS-7960 as a part of hot-swap feature. I did
not do hot-swap, and did not failover the NameNode.
> Blocks removed by thousands due to falsely detected zombie storages
> -------------------------------------------------------------------
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.6.1
> Reporter: Konstantin Shvachko
> Priority: Critical
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it
> sends the block report again. Then NameNode while process these two reports
> at the same time can interleave processing storages from different reports.
> This screws up the blockReportId field, which makes NameNode think that some
> storages are zombie. Replicas from zombie storages are immediately removed,
> causing missing blocks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)