[jira] [Commented] (HDFS-12979) StandbyNode should upload FsImage to ObserverNode after checkpointing.

Chen Liang (JIRA) Mon, 06 May 2019 10:32:16 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834030#comment-16834030
 ]


Chen Liang commented on HDFS-12979:
-----------------------------------

Actually another related thing is {{isPrimaryCheckPointer}} boolean check. 
Currently this gets set to true when there is just one successful upload. The 
idea was that when multiple Standby are uploading, only one should successfully 
upload the image and set this to true, the others all have false.

But this is based on the assumption that there is only one image receiver. With 
Observer, we have multiple NNs accepting image, meaning there can be multiple 
places leading to successful upload. For example, say sbn1 uploads to Active 
and set {{isPrimaryCheckPointer}} to true, sbn2 uploads to an Observer and also 
set {{isPrimaryCheckPointer}} to true. Then we have two "primary" check pointer.

Seems even if both sbn1 and sbn2 see other NNs in the same order, this is still 
an issue . e.g. sbn1 uploads to ANN, succeed, set {{isPrimaryCheckPointer}} to 
true, sbn2 in the meantime uploading to ANN failed, but if sbn2 proceeds to 
observer before sbn1 (if the loop breaking gets removed), it would still have 
the upload succeed and set {{isPrimaryCheckPointer}} to true.

The root cause seems that there is no way to distinguish between successful 
upload to ANN and to Observer. Because otherwise we can easily identify that 
only the Standby that successfully uploaded to ANN is the primary.

I think the best way might be to introduce another HttpServlet response code in 
{{TransferResult, }}that observer returns a different http code on successful 
upload to it. And this should not cause issue for rolling upgrade, because we 
should only switch SbN to observer when the cluster has already been upgraded, 
after which time the new http code gets sent.

Comments are welcome!

> StandbyNode should upload FsImage to ObserverNode after checkpointing.
> ----------------------------------------------------------------------
>
>                 Key: HDFS-12979
>                 URL: https://issues.apache.org/jira/browse/HDFS-12979
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs
>            Reporter: Konstantin Shvachko
>            Assignee: Chen Liang
>            Priority: Major
>         Attachments: HDFS-12979.001.patch
>
>
> ObserverNode does not create checkpoints. So it's fsimage file can get very 
> old making bootstrap of ObserverNode too long. A StandbyNode should copy 
> latest fsimage to ObserverNode(s) along with ANN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-12979) StandbyNode should upload FsImage to ObserverNode after checkpointing.

Reply via email to