[
https://issues.apache.org/jira/browse/HDFS-9787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140367#comment-15140367
]
Guocui Mi commented on HDFS-9787:
---------------------------------
>>> this would imply that the non-primary SNN never sends a checkpoint after
>>> the first time?
It is true according to my observation.
I am trying to add unittest to cover the scenario. Another two scenarios
triggered in our cluster:
1) PrimaryCheckpoint uploading fsimage failure due to ANN not available
temporarily.
2) Restart all NNs at same time.
I afraid the proposal you shared can't work.
1) set lastCheckpointTime before following code in doCheckpoint(): no
difference between putting after each loop iteration.
2) after following code in doCheckpoint() : Non-primary SNN will do checkpoint
one by one continuously since lastCheckpointTime not get updated.
if(!sendCheckpoint){ return; }
> SNNs stop uploading FSImage to ANN once isPrimaryCheckPointer changed to
> false.
> -------------------------------------------------------------------------------
>
> Key: HDFS-9787
> URL: https://issues.apache.org/jira/browse/HDFS-9787
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 3.0.0
> Reporter: Guocui Mi
> Assignee: Guocui Mi
> Attachments: HDFS-9786-v000.patch
>
>
> SNNs stop uploading FSImage to ANN once isPrimaryCheckPointer become false.
> Here is the logic to check if upload FSImage or not.
> In StandbyCheckpointer.java
> boolean sendRequest = isPrimaryCheckPointer || secsSinceLast >=
> checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> The sendRequest is always false if isPrimaryCheckPointer is false giving
> secsSinceLast (~checkpointPeriod) >= checkpointConf.getQuietPeriod()
> (checkpointPeriod * this.quietMultiplier(default value 1.5)) always returns
> false.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)