[ 
https://issues.apache.org/jira/browse/HDFS-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243633#comment-14243633
 ] 

Jesse Yates commented on HDFS-6440:
-----------------------------------

bq. Does this mean that there might be multiple SNNs marking themselves as 
'primary checkpointer' during the same time period, since it is determined by 
SNN itself

Yes, that is a possibility, which I was getting at with my comment about the 
primary checkpointer "ping-ponging". The images would have small deltas, but 
the ANN would be kept up to date. As the updates slow down, one of the 
checkpointers would eventually win. However, either (a) we haven't seen this 
show up on any of our clusters or (b) have never noticed any service issues 
because of it.

bq. Would it be reasonable to also let ANN to reject fsimage upload request?

Sure, its possible. My concern was around ensuring that the ANN had to most up 
to date checkpoint and let the SNNs sort themselves out. It seems a bit more 
intrusive in the code since you also need to differentiate the source - you 
don't want to reject an update from the primary checkpointer if it occurs just 
because of the time elapsed. I'd say worth looking into in a follow up jira 
though - this is already a pretty large change.

> Support more than 2 NameNodes
> -----------------------------
>
>                 Key: HDFS-6440
>                 URL: https://issues.apache.org/jira/browse/HDFS-6440
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: auto-failover, ha, namenode
>    Affects Versions: 2.4.0
>            Reporter: Jesse Yates
>            Assignee: Jesse Yates
>         Attachments: Multiple-Standby-NameNodes_V1.pdf, 
> hdfs-6440-cdh-4.5-full.patch, hdfs-multiple-snn-trunk-v0.patch
>
>
> Most of the work is already done to support more than 2 NameNodes (one 
> active, one standby). This would be the last bit to support running multiple 
> _standby_ NameNodes; one of the standbys should be available for fail-over.
> Mostly, this is a matter of updating how we parse configurations, some 
> complexity around managing the checkpointing, and updating a whole lot of 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to