[
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427387#comment-15427387
]
Konstantin Shvachko commented on HDFS-10301:
--------------------------------------------
Took some time to look into heartbeat processing and consulting with Vinitha.
So heartbeats currently have logic to remove failed storages reported by DNs
via {{VolumeFailureSummary}}. This happens in three steps
# If DN reports a failed volume in a heartbeat (HDFS-7604), NN marks the
corresponding {{DatanodeStorageInfo}} as FAILED. See
{{DatanodeDescriptor.updateFailedStorage()}}.
# When the {{HeartbeatManager.Monitor}} kicks in it checks the FAILED flag on
the storage and does {{removeBlocksAssociatedTo(failedStorage)}}. But it does
not remove the storage itself. HDFS-7208
# On next heartbeat the DN will not report the storage that was previously
reported as failed. This triggers NN to prune the storage
{{DatanodeDescriptor.pruneStorageMap()}} because it doesn't contain replicas.
HDFS-7596
Essentially we already have dual mechanism of deleting storages - one through
heartbeats another via block reports. So we can remove the redundancy.
[~daryn]'s idea simplifies a lot of code, does not require changes in any RPCs,
is fully backward compatible, and eliminates the notion of zombie storage,
which solves the interleaving report problem. I think we should go for it.
Initially I was concerned about removing storages in heartbeats, but
# We already do it anyway
# All heartbeats hold FSN.readLock whether with failed storages or not. The
scanning of the storages takes a lock on the corresponding
{{DatanodeDescriptor.storageMap}}, which is fine-grain.
# Storages are not actually removed in a heartbeat, only flagged as FAILED. The
replica removal is performed by a background Montor.
# If we decide to implement lock-less heartbeats we can move the storage
reporting logic into a separate RPC periodically sent by DNs independently of
and less frequently than regular heartbeats.
> BlockReport retransmissions may lead to storages falsely being declared
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.6.1
> Reporter: Konstantin Shvachko
> Assignee: Vinitha Reddy Gankidi
> Priority: Critical
> Fix For: 2.7.4
>
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch,
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch,
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch,
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch,
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch,
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it
> sends the block report again. Then NameNode while process these two reports
> at the same time can interleave processing storages from different reports.
> This screws up the blockReportId field, which makes NameNode think that some
> storages are zombie. Replicas from zombie storages are immediately removed,
> causing missing blocks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]