[
https://issues.apache.org/jira/browse/HDFS-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488914#comment-13488914
]
Todd Lipcon commented on HDFS-4114:
-----------------------------------
Konstantin: could you please elaborate on how you use the BackupNode?
As discussed in the thread, it's difficult to see how it's usable in its
current state, and there has been no work in Apache to move it forward.
Here are the issues I see with the backupnode:
- It doesn't provide a hot standby, since it doesn't get any block information.
I've seen your prototype using a "load duplicator", but that software is not
available in Apache, and I don't think it would correctly handle the majority
of the corner cases we had to solve during HDFS-1623 development.
- Even with the above addressed, there is no functionality to "promote" a
backup node to active, so it doesn't provide HA at all.
- Because it uses RPC to transfer edits, it ties the availability and response
time of the Active to the availability and response time of the Backup. Up
until recently (HDFS-3126) there was no RPC timeout configured at all on the
backup stream, so if the backup lost its network connection or otherwise froze,
the active would freeze for several minutes if not indefinitely. Thus it
actually _reduces_ availability in all currently released branches.
After adding the timeout, there is now the possibility that the active and
backup are not synchronized. Without external synchronization there is no way
to know whether the two nodes are synchronized, and thus even if we _had_ a way
to promote the backup, there's be no safe way to do so automatically without
risking rollback of the namespace. So the backup cannot be used for automatic
failover in its current form without substantial design changes.
- Even if you are using the BN in an older version or a private fork, it is
clear that you aren't maintaining it in current releases. The backupnode tests
were failing for many months earlier this year with no one stepping up to fix
them. Other contributors have had to step in and maintain the code, eg with
fixes like HDFS-2666, HDFS-2764, HDFS-3625.
So, to summarize, please justify your -1 with an explanation of how you are
using the BackupNode to provide some feature which is not already more mature
and production-ready elsewhere in Hadoop 2.x.
> Remove the CheckpointNode
> -------------------------
>
> Key: HDFS-4114
> URL: https://issues.apache.org/jira/browse/HDFS-4114
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Eli Collins
> Assignee: Eli Collins
>
> Per the thread on hdfs-dev@ (http://s.apache.org/tMT) let's remove the
> BackupNode and CheckpointNode.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira