[ 
https://issues.apache.org/jira/browse/HDFS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837721#comment-16837721
 ] 

Chen Liang commented on HDFS-12979:
-----------------------------------

Thanks [~zero45]. 

{{Is it valuable for all the Standbys to produce checkpoint images?}}
In my opinion, first, this allows more (standby) node failures; second, even if 
we want only one standby doing checkpointing, it doesn't seem any easier here 
because then we need to come up with ways to determine who is the one creating 
checkpoints. Which is likely going to involve ANN or even ZKFC since Standby 
don't talk to each and can't reach consensus by themselves.

Post v004 patch. Couple key points for reviewers:
1. Each Standby keeps track of which fsImage receiver it has uploaded to, for 
other receivers it only checks once in a while. This is essentially per 
receiver primary, bringing same idea from v003 patch.
2. {{StandbyCheckpointer#doCheckpoint}} no longer stops at the very first 
successfully, nor break the loop on the very first Exception. Instead it loops 
through all NNs trying to upload regardless. This is because previously, this 
loop exits on first upload success/exception. But now since we have more than 
one fsImage receivers, upload to the NNs after the first success/exception 
should be attempted.
3. Receiver side {{ImageServlet}} adds a logic to send back error message if a 
upload request turns out to be unnecessary (i.e. not enough interval yet, or 
too small the delta).
4. Still hasn't renamed {{activeNNAddresses}} though to make it easy for 
reviews. Will change in later patches.

> StandbyNode should upload FsImage to ObserverNode after checkpointing.
> ----------------------------------------------------------------------
>
>                 Key: HDFS-12979
>                 URL: https://issues.apache.org/jira/browse/HDFS-12979
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs
>            Reporter: Konstantin Shvachko
>            Assignee: Chen Liang
>            Priority: Major
>         Attachments: HDFS-12979.001.patch, HDFS-12979.002.patch, 
> HDFS-12979.003.patch, HDFS-12979.004.patch
>
>
> ObserverNode does not create checkpoints. So it's fsimage file can get very 
> old making bootstrap of ObserverNode too long. A StandbyNode should copy 
> latest fsimage to ObserverNode(s) along with ANN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to