[ 
https://issues.apache.org/jira/browse/HDFS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838670#comment-16838670
 ] 

Erik Krogen commented on HDFS-12979:
------------------------------------

I like the idea, [~vagarychen]. I agree that it makes sense for every Standby 
NN to continue to produce checkpoint images, and for them to dynamically assign 
themselves other NNs to checkpoint to as opposed to keeping it in 
configuration. Besides Plamen's map-of-URL concern, I have some comments:
* For the logic in the future creation loop within {{doCheckpoint}} (L232 - 
L256), it seems to me that it always compares the period to 
{{getQuietPeriod()}}. Shouldn't it compare to a shorter period if this SbNN is 
the primary for that node? Maybe I am missing something in the logic.
* In that same loop, why do we need 
{{checkpointReceivers.containsKey(activeNNAddress)}}? Aren't we guaranteed that 
all of {{activeNNAddresses}} show up in this map due to the logic at L99 - L101?
* I'm not sure if we should swallow an {{InterruptedException}} (L274) -- 
wouldn't this indicate that the checkpointing process has been interrupted and 
we should exit immediately?
* For your {{TODO}} at L277, can we use {{MultipleIOException}} ?
* In L237-238, can we use {{TimeUnit.MILLISECONDS.toSeconds(...)}} instead of 
dividing by a constant to make it obvious what the input/output units are?
* I need to look more deeply before I know if this makes sense, but in 
{{StandbyCheckpointer}} we have {{checkpointConf}}, is there a way to use the 
same logic for the conf accesses within {{ImageServlet}} ?
* For the time period check within {{ImageServlet}}, why do we convert 
everything to seconds? If we use {{TimeUnit.MILLISECONDS}} for 
{{checkpointPeriod}}, then we shouldn't have to do any unit conversion.
* For the error message alongside the {{SC_CONFLICT}}, can we make it include 
information about the checkpoint/last txid/time?

> StandbyNode should upload FsImage to ObserverNode after checkpointing.
> ----------------------------------------------------------------------
>
>                 Key: HDFS-12979
>                 URL: https://issues.apache.org/jira/browse/HDFS-12979
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs
>            Reporter: Konstantin Shvachko
>            Assignee: Chen Liang
>            Priority: Major
>         Attachments: HDFS-12979.001.patch, HDFS-12979.002.patch, 
> HDFS-12979.003.patch, HDFS-12979.004.patch
>
>
> ObserverNode does not create checkpoints. So it's fsimage file can get very 
> old making bootstrap of ObserverNode too long. A StandbyNode should copy 
> latest fsimage to ObserverNode(s) along with ANN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to