[
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299313#comment-15299313
]
Konstantin Shvachko commented on HDFS-10301:
--------------------------------------------
Hey Colin, let's decide on the way to move forward. I do not see a point in
making this change in two steps.
* Your changes will essentially be completely removed by Vinitha's patch.
* I do not see her patch introducing incompatible changes. So it can and should
be backported through to branch 2.6.
A thorough review is needed and will be quite helpful. I think the [004
patch|https://issues.apache.org/jira/secure/attachment/12805798/HDFS-10301.004.patch]
covers
* the upgrade case, that is, it works consistently for both old (pre-patch) and
new (patched) DataNodes block reports
* the case when the entire block report is sent in a single RPC and
* the case when block reports are split into multiple RPCs
* the leases
So apart from the failed test I do not see any issues. It would be good if you
could take a fresh look, see if any corner cases were missed.
> BlockReport retransmissions may lead to storages falsely being declared
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.6.1
> Reporter: Konstantin Shvachko
> Assignee: Colin Patrick McCabe
> Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch,
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.01.patch,
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it
> sends the block report again. Then NameNode while process these two reports
> at the same time can interleave processing storages from different reports.
> This screws up the blockReportId field, which makes NameNode think that some
> storages are zombie. Replicas from zombie storages are immediately removed,
> causing missing blocks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]