[ 
https://issues.apache.org/jira/browse/HDFS-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488914#comment-13488914
 ] 

Todd Lipcon commented on HDFS-4114:
-----------------------------------

Konstantin: could you please elaborate on how you use the BackupNode?

As discussed in the thread, it's difficult to see how it's usable in its 
current state, and there has been no work in Apache to move it forward.

Here are the issues I see with the backupnode:

- It doesn't provide a hot standby, since it doesn't get any block information. 
I've seen your prototype using a "load duplicator", but that software is not 
available in Apache, and I don't think it would correctly handle the majority 
of the corner cases we had to solve during HDFS-1623 development.

- Even with the above addressed, there is no functionality to "promote" a 
backup node to active, so it doesn't provide HA at all.

- Because it uses RPC to transfer edits, it ties the availability and response 
time of the Active to the availability and response time of the Backup. Up 
until recently (HDFS-3126) there was no RPC timeout configured at all on the 
backup stream, so if the backup lost its network connection or otherwise froze, 
the active would freeze for several minutes if not indefinitely. Thus it 
actually _reduces_ availability in all currently released branches.

After adding the timeout, there is now the possibility that the active and 
backup are not synchronized. Without external synchronization there is no way 
to know whether the two nodes are synchronized, and thus even if we _had_ a way 
to promote the backup, there's be no safe way to do so automatically without 
risking rollback of the namespace. So the backup cannot be used for automatic 
failover in its current form without substantial design changes.

- Even if you are using the BN in an older version or a private fork, it is 
clear that you aren't maintaining it in current releases. The backupnode tests 
were failing for many months earlier this year with no one stepping up to fix 
them. Other contributors have had to step in and maintain the code, eg with 
fixes like HDFS-2666, HDFS-2764, HDFS-3625.

So, to summarize, please justify your -1 with an explanation of how you are 
using the BackupNode to provide some feature which is not already more mature 
and production-ready elsewhere in Hadoop 2.x.
                
> Remove the CheckpointNode
> -------------------------
>
>                 Key: HDFS-4114
>                 URL: https://issues.apache.org/jira/browse/HDFS-4114
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>
> Per the thread on hdfs-dev@ (http://s.apache.org/tMT) let's remove the 
> BackupNode and CheckpointNode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to