[
https://issues.apache.org/jira/browse/HDFS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837527#comment-16837527
]
Chen Liang commented on HDFS-12979:
-----------------------------------
Thanks for review [~zero45]! Having new configs seems a bit too easy to get
messy when there are multiple SbN and Observer, and when a standby fails, it
can be difficult to let other standby taking over. Actually it looks like we
need to revisit things a little bit more. As I had some offline discussion with
[~shv] and [~ga]. As it seems, even without Standby read, when there are
multiple SbN, the image uploading process still has some issue. Specifically,
it does not seem to actually honor the checkpoint period setting.
For example: say we set checkpoint period to 6 hours and we have two SbN: S1
and S2. However, S2 gets started 3 hours after S1. Then we seem to be in the
situation that S1 uploads an image on the 6th hour, S2 uploads an image on the
9th hour, then S1 uploads on 12th hour, S2 uploads on 15th hour, etc. As a
result of this, Active is seeing an image every 3 hours, while admins are most
likely be expecting one image per 6 hour given the configuration. When there
are more Standby nodes, images get uploaded even more often. Plus, we have the
logic of "primaryCheckpointer" where non-primary checkpointer standby goes to
longer sleeps. Effectively, this means even S2 starts at the same time as S1,
this time difference can happen due to S2 being non-primary.
I think ultimately, this is because there is no way for fsImage sender side
(standby nodes) to actually know whether it is a good time to upload so it only
makes guesses based success/fail of uploading fsImage, and the receiver side
(active and observer) does not enforce the check for checkpointing period. I
think the receiver side should do more check to see if it should accept this
image. Will post another patch and probably a short doc for more detail
> StandbyNode should upload FsImage to ObserverNode after checkpointing.
> ----------------------------------------------------------------------
>
> Key: HDFS-12979
> URL: https://issues.apache.org/jira/browse/HDFS-12979
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs
> Reporter: Konstantin Shvachko
> Assignee: Chen Liang
> Priority: Major
> Attachments: HDFS-12979.001.patch, HDFS-12979.002.patch,
> HDFS-12979.003.patch
>
>
> ObserverNode does not create checkpoints. So it's fsimage file can get very
> old making bootstrap of ObserverNode too long. A StandbyNode should copy
> latest fsimage to ObserverNode(s) along with ANN.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]