[
https://issues.apache.org/jira/browse/HDFS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837721#comment-16837721
]
Chen Liang commented on HDFS-12979:
-----------------------------------
Thanks [~zero45].
{{Is it valuable for all the Standbys to produce checkpoint images?}}
In my opinion, first, this allows more (standby) node failures; second, even if
we want only one standby doing checkpointing, it doesn't seem any easier here
because then we need to come up with ways to determine who is the one creating
checkpoints. Which is likely going to involve ANN or even ZKFC since Standby
don't talk to each and can't reach consensus by themselves.
Post v004 patch. Couple key points for reviewers:
1. Each Standby keeps track of which fsImage receiver it has uploaded to, for
other receivers it only checks once in a while. This is essentially per
receiver primary, bringing same idea from v003 patch.
2. {{StandbyCheckpointer#doCheckpoint}} no longer stops at the very first
successfully, nor break the loop on the very first Exception. Instead it loops
through all NNs trying to upload regardless. This is because previously, this
loop exits on first upload success/exception. But now since we have more than
one fsImage receivers, upload to the NNs after the first success/exception
should be attempted.
3. Receiver side {{ImageServlet}} adds a logic to send back error message if a
upload request turns out to be unnecessary (i.e. not enough interval yet, or
too small the delta).
4. Still hasn't renamed {{activeNNAddresses}} though to make it easy for
reviews. Will change in later patches.
> StandbyNode should upload FsImage to ObserverNode after checkpointing.
> ----------------------------------------------------------------------
>
> Key: HDFS-12979
> URL: https://issues.apache.org/jira/browse/HDFS-12979
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs
> Reporter: Konstantin Shvachko
> Assignee: Chen Liang
> Priority: Major
> Attachments: HDFS-12979.001.patch, HDFS-12979.002.patch,
> HDFS-12979.003.patch, HDFS-12979.004.patch
>
>
> ObserverNode does not create checkpoints. So it's fsimage file can get very
> old making bootstrap of ObserverNode too long. A StandbyNode should copy
> latest fsimage to ObserverNode(s) along with ANN.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]