Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Anders Widell Mon, 22 Feb 2016 03:15:08 -0800

Hi!

Please see my comments inline, marked [AndersW2].


regards,
Anders Widell

On 02/22/2016 03:18 AM, Nhat Pham wrote:
>
> Hi Mahesh and Anders,
>
> Please see my comment below.
>
> BTW, have you finished the review and test?
>
> Best regards,
>
> Nhat Pham
>
> *From:*A V Mahesh [mailto:[email protected]]
> *Sent:* Friday, February 19, 2016 2:28 PM
> *To:* Nhat Pham <[email protected]>; 'Anders Widell' 
> <[email protected]>; 'Minh Chau H' <[email protected]>
> *Cc:* [email protected]; 'Beatriz Brandao' 
> <[email protected]>
> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state V2 
> [#1621]
>
> Hi Nhat Pham,
>
> On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
>     Could you please give more detailed information about steps to
>     reproduce the problem below? Thanks.
>
>
> Don't see this as specific bug  , we need to see the issue as CLM 
> integrated service point  of view ,
> by considering Anders Widell  explication about CLM application 
> behavior during headless state
> we need to reintegrate CPND with CLM ( before this  headless state 
> feature  no case of CPND existence in the obscene of CLMD  , but now 
> it is ).
>
> And this will be the consistent across the all services who integrated 
> with CLM  ( you may need some changes in CLM also )
>
> */[Nhat Pham] I think CLM should return /*SA_AIS_ERR_TRY_AGAIN in this 
> case.
>
> @Anders. How would you think?
>
[AndersW2] Is it saClmInitialize_4() that returns SA_AIS_ERR_UNAVAILABLE 
when it cannot reach the CLM server? Then yes, I agree that it should 
return *//*SA_AIS_ERR_TRY_AGAIN instead.
>
>
> To start with let us consider case CPND  on payload restarted on PL  
> during headless state
> and an application is in running on PL.
>
> */[Nhat Pham] Regarding the CPND as CLM application, I’m not sure what 
> it can do in this case. In case it restarts, it is monitored by AMF./*
>
> */If it blocks for too long, AMF will also trigger a node reboot./*
>
> */In my test case, the CPND get blocked by CLM. It doesn’t get out of 
> the saClmInitialize. How do you get the “/ER cpnd clm init failed with 
> return value:31**/”?/*
>
> */Following is the cpnd trace./*
>
> Feb 22  8:56:41.188122 osafckptnd [736:cpnd_init.c:0183] >> cpnd_lib_init
>
> Feb 22  8:56:41.188332 osafckptnd [736:cpnd_init.c:0412] >> 
> cpnd_cb_db_init
>
> Feb 22  8:56:41.188600 osafckptnd [736:cpnd_init.c:0437] << 
> cpnd_cb_db_init
>
> Feb 22  8:56:41.188778 osafckptnd [736:clma_api.c:0503] >> saClmInitialize
>
> Feb 22  8:56:41.188945 osafckptnd [736:clma_api.c:0593] >> clmainitialize
>
> Feb 22  8:56:41.190052 osafckptnd [736:clma_util.c:0100] >> 
> clma_startup: clma_use_count: 0
>
> Feb 22  8:56:41.190273 osafckptnd [736:clma_mds.c:1124] >> clma_mds_init
>
> Feb 22  8:56:41.190825 osafckptnd [736:clma_mds.c:1170] << clma_mds_init
>
> -AVM
>
> On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
>     Hi Mahesh,
>
>     Could you please give more detailed information about steps to
>     reproduce the problem below? Thanks.
>
>     Best regards,
>
>     Nhat Pham
>
>     *From:* A V Mahesh [mailto:[email protected]]
>     *Sent:* Friday, February 19, 2016 1:06 PM
>     *To:* Anders Widell <[email protected]>
>     <mailto:[email protected]>; Nhat Pham
>     <[email protected]> <mailto:[email protected]>;
>     'Minh Chau H' <[email protected]>
>     <mailto:[email protected]>
>     *Cc:* [email protected]
>     <mailto:[email protected]>; 'Beatriz Brandao'
>     <[email protected]> <mailto:[email protected]>
>     *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>     preserving and recovering checkpoint replicas during headless
>     state V2 [#1621]
>
>     Hi Anders Widell,
>     Thanks for the detailed explanation  about CLM during headless state.
>
>     HI  Nhat Pham ,
>
>     Comment : 3
>     Please see below  the problem I was interpreted now I seeing it 
>     during CLMD obscene ( during headless state ),
>     so now CPND/CLMA need to  to address below case , currently cpnd
>     clm init failed with return value: SA_AIS_ERR_UNAVAILABLE
>     but should be SA_AIS_ERR_TRY_AGAIN
>
>     ==================================================
>     Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE STATE->
>     IMM_NODE_FULLY_AVAILABLE 17418
>     Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync ending normally
>     Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch set to 9 in ImmModel
>     Feb 19 11:18:28 PL-4 cpsv_app: IN Received PROC_STALE_CLIENTS
>     Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 42
>     (MsgQueueService132111) <108, 2040f>
>     Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 43
>     (MsgQueueService131855) <0, 2030f>
>     Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 44
>     (safLogService) <0, 2010f>
>     Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER STATE:
>     IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
>     Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 45
>     (safClmService) <0, 2010f>
>     *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd clm init failed
>     with return value:31
>     Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd init failed
>     Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd_lib_req FAILED
>     Feb 19 11:18:28 PL-4 osafckptnd[7718]: __init_cpnd() failed*
>     Feb 19 11:18:28 PL-4 osafclmna[5432]: NO
>     safNode=PL-4,safCluster=myClmCluster Joined cluster, nodeid=2040f
>     Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD NEW_ACTIVE, adest:1
>     Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO Sending node up due to
>     NCSMDS_NEW_ACTIVE
>     Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SISU states sent
>     Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU states sent
>     Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 CSICOMP states synced
>     Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU states sent
>     Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 46
>     (safAmfService) <0, 2010f>
>     Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO
>     'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Component or SU restart
>     probation timer expired
>     Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Instantiation of
>     'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' failed
>     Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Reason: component
>     registration timer expired
>     Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA
>     'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence
>     State RESTARTING => INSTANTIATION_FAILED
>     Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Component Failover
>     trigerred for 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': Failed
>     component: 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>     Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER
>     'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got Inst failed
>     Feb 19 11:18:35 PL-4 osafamfnd[5441]: Rebooting OpenSAF NodeId =
>     132111 EE Name = , Reason: NCS component Instantiation failed,
>     OwnNodeId = 132111, SupervisionTime = 60
>     Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting local node; timeout=60
>     Feb 19 11:18:39 PL-4 kernel: [ 4877.338518] md: stopping all md
>     devices.
>     ==================================================
>
>     -AVM
>
>     On 2/15/2016 5:11 PM, Anders Widell wrote:
>
>         Hi!
>
>         Please find my answer inline, marked [AndersW].
>
>         regards,
>         Anders Widell
>
>         On 02/15/2016 10:38 AM, Nhat Pham wrote:
>
>             Hi Mahesh,
>
>             It's good. Thank you. :)
>
>             [AVM]  Up on rejoining of the SC`s The replica should be
>             re-created regardless
>             of another application opens it on PL4.
>                            ( Note : this comment is based on your
>             explanation have not yet
>             reviewed/tested  ,
>                               currently i am struggling with SC`s   
>             not rejoining
>             after headless state , i can provide you more on this once
>             i  complte my
>             review/testing)
>
>             [Nhat] To make cloud resilience works, you need the
>             patches from other
>             services (log, amf, clm, ntf).
>             @Minh: I heard that you created tar file which includes
>             all patches. Could you
>             please send it to Mahesh? Thanks
>
>             [AVM] I understand that , before I comment more on this  
>             please allow me to
>             understand
>                           I am not still not very clear of the
>             headless design in detail.
>                           For example cluster membership of PL`s
>             during headless state ,
>                            In the absence of  SC`s  (CLMD) dose the
>             PLs is considered as
>             cluster nodes or not (cluster membership) ?
>
>             [Nhat] I don't know much about this.
>             @ Anders: Could you please have comment about this? Thanks
>
>         [AndersW] First of all, keep in mind that the "headless" state
>         should ideally not last a very long time. Once we have the
>         spare SC feature in place (ticket [#79]), a new SC should
>         become active within a matter of a few seconds after we have
>         lost both the active and the standby SC.
>
>         I think you should view the state of the cluster in the
>         headless state in the same way as you view the state of the
>         cluster during a failover between the active and the standby
>         SC. Imagine that the active SC dies. It takes the standby SC
>         1.5 seconds to detect the failure of the active SC (this is
>         due to the TIPC timeout). If you have configured the
>         PROMOTE_ACTIVE_TIMER, there is an additional delay before the
>         standby takes over as active. What is the state of the cluster
>         during the time after the active SC failed and before the
>         standby takes over?
>
>         The state of the cluster while it is headless is very similar.
>         The difference is that this state may last a little bit longer
>         (though not more than a few seconds, until one of the spare
>         SCs becomes active). Another difference is that we may have
>         lost some state. With a "perfect" implementation of the
>         headless feature we should not lose any state at all, but with
>         the current set of patches we do lose state.
>
>         So specifically if we talk about cluster membership and ask
>         the question: is a particular PL a member of the cluster or
>         not during the headless state? Well, if you ask CLM about this
>         during the headless state, then you will not know - because
>         CLM doesn't provide any service during the headless state. If
>         you keep retrying you query to CLM, you will eventually get an
>         answer - but you will not get this answer until there is an
>         active SC again and we have exited the headless state. When
>         viewed in this way, the answer to the question about a node's
>         membership is undefined during the headless state, since CLM
>         will not provide you with any answer until there is an active SC.
>
>         However, if you asked CLM about the node's cluster membership
>         status before the cluster went headless, you probably saved a
>         cached copy of the cluster membership state. Maybe you also
>         installed a CLM track callback and intend to update this
>         cached copy every time the cluster membership status changes.
>         The question then is: can you continue using this cached copy
>         of the cluster membership state during the headless state? The
>         answer is YES: since CLM doesn't provide any service during
>         the headless state, it also means that the cluster membership
>         view cannot change during this time. Nodes can of course
>         reboot or die, but CLM will not notice and hence the cluster
>         view will not be updated. You can argue that this is bad
>         because the cluster view doesn't reflect reality, but notice
>         that this will always be the case. We can never propagate
>         information instantaneously, and detection of node failures
>         will take 1.5 seconds due to the TIPC timeout. You can never
>         be sure that a node is alive at this very moment just because
>         CLM tells you that it is a member of the cluster. If we are
>         unfortunate enough to lose both system controller nodes
>         simultaneously, updates to the cluster membership view will be
>         delayed a few seconds longer than usual.
>
>
>             Best regards,
>             Nhat Pham
>
>             -----Original Message-----
>             From: A V Mahesh [mailto:[email protected]]
>             Sent: Monday, February 15, 2016 11:19 AM
>             To: Nhat Pham <[email protected]>
>             <mailto:[email protected]>;
>             [email protected]
>             <mailto:[email protected]>
>             Cc: [email protected]
>             <mailto:[email protected]>; 'Beatriz
>             Brandao'
>             <[email protected]>
>             <mailto:[email protected]>
>             Subject: Re: [PATCH 0 of 1] Review Request for cpsv:
>             Support preserving and
>             recovering checkpoint replicas during headless state V2
>             [#1621]
>
>             Hi Nhat Pham,
>
>             How is your holiday went
>
>             Please find my comments below
>
>             On 2/15/2016 8:43 AM, Nhat Pham wrote:
>
>                 Hi Mahesh,
>
>                 For the comment 1, the patch will be updated accordingly.
>
>             [AVM]  Please hold , I will provide more comments in this
>             week , so we can
>             have consolidated V3
>
>                 For the comment 2, I think the CKPT service will not
>                 be backward
>                 compatible if the scAbsenceAllowed is true.
>                 The client can't create non-collocated checkpoint on SCs.
>
>                 Furthermore, this solution only protects the CKPT
>                 service from the
>                 case "The non-collocated checkpoint  is created on a SC"
>                 there are still the cases where the replicas are
>                 completely lost. Ex:
>
>                 - The non-collocated checkpoint created on a PL. The
>                 PL reboots. Both
>                 replicas now locate on SCs. Then, headless state
>                 happens. All replicas are
>                 lost.
>                 - The non-collocated checkpoint has active replica
>                 locating on a PL
>                 and this PL restarts during headless state
>                 - The non-collocated checkpoint is created on PL3.
>                 This checkpoint is
>                 also opened on PL4. Then SCs and PL3 reboot.
>
>             [AVM]  Up on rejoining of the SC`s The replica should be
>             re-created regardless
>             of another application opens it on PL4.
>                            ( Note : this comment is based on your
>             explanation have not yet
>             reviewed/tested  ,
>                               currently i am struggling with SC`s   
>             not rejoining
>             after headless state , i can provide you more on this once
>             i  complte my
>             review/testing)
>
>                 In this case, all replicas are lost and the client has
>                 to create it again.
>
>                 In case multiple nodes (which including SCs) reboot,
>                 losing replicas
>                 is unpreventable. The patch is to recover the
>                 checkpoints in possible cases.
>                 How do you think?
>
>             [AVM] I understand that , before I comment more on this  
>             please allow
>             me to understand
>                           I am not still not very clear of the
>             headless design in detail.
>
>                           For example cluster membership of PL`s
>             during headless
>             state ,
>                            In the absence of  SC`s  (CLMD) dose the
>             PLs is considered as
>             cluster nodes or not (cluster membership) ?
>
>                                  - if not consider as  NON cluster
>             nodes Checkpoint Service
>             API  should  leverage the SA Forum Cluster
>                                    Membership Service  and API's can
>             fail with
>             SA_AIS_ERR_UNAVAILABLE
>
>                                  - if considers as cluster nodes we
>             need to follow all the
>             defined rules which are defined in SAI-AIS-CKPT-B.02.02
>             specification
>
>                           so give me some more time to review it
>             completely , so that we
>             can  have consolidated patch V3
>
>             -AVM
>
>                 Best regards,
>                 Nhat Pham
>
>                 -----Original Message-----
>                 From: A V Mahesh [mailto:[email protected]]
>                 Sent: Friday, February 12, 2016 11:10 AM
>                 To: Nhat Pham <[email protected]>
>                 <mailto:[email protected]>;
>                 [email protected]
>                 <mailto:[email protected]>
>                 Cc: [email protected]
>                 <mailto:[email protected]>; Beatriz
>                 Brandao
>                 <[email protected]>
>                 <mailto:[email protected]>
>                 Subject: Re: [PATCH 0 of 1] Review Request for cpsv:
>                 Support
>                 preserving and recovering checkpoint replicas during
>                 headless state V2
>                 [#1621]
>
>
>                 Comment 2 :
>
>                 After incorporating the comment one all the
>                 Limitations should be
>                 prevented based on Hydra configuration is enabled in
>                 IMM status.
>
>                 Foe example :  if some application is trying to create
>
>                 non-collocated checkpoint active replica getting
>                 generated/locating on
>                 SC then ,regardless of the heads (SC`s) status exist
>                 not exist should
>                 return SA_AIS_ERR_NOT_SUPPORTED
>
>                 In other words, rather that allowing to created
>                 non-collocated
>                 checkpoint when
>                 heads(SC`s)  are exit , and non-collocated checkpoint
>                 getting
>                 unrecoverable after heads(SC`s) rejoins.
>
>                 
> ======================================================================
>
>                 =======================
>
>                         Limitation: The CKPT service doesn't support
>                     recovering checkpoints in
>                         following cases:
>                         . The checkpoint which is unlinked before
>                     headless.
>                         . The non-collocated checkpoint has active
>                     replica locating on SC.
>                         . The non-collocated checkpoint has active
>                     replica locating on a PL
>                     and this PL
>                         restarts during headless state. In this cases,
>                     the checkpoint replica is
>                         destroyed. The fault code
>                     SA_AIS_ERR_BAD_HANDLE is returned when the
>                     client
>                         accesses the checkpoint in these cases. The
>                     client must re-open the
>                         checkpoint.
>
>                 
> ======================================================================
>
>                 =======================
>
>                 -AVM
>
>
>                 On 2/11/2016 12:52 PM, A V Mahesh wrote:
>
>                     Hi,
>
>                     I jut starred reviewing patch , I will be  giving
>                     comments as soon as
>                     I crossover any , to save some time.
>
>                     Comment 1 :
>                     This functionality should be under  checks if
>                     Hydra configuration is
>                     enabled in IMM attrName =
>                     const_cast<SaImmAttrNameT>("scAbsenceAllowed")
>
>                     Please see example how  LOG/AMF  services
>                     implemented it.
>
>                     -AVM
>
>
>                     On 1/29/2016 1:02 PM, Nhat Pham wrote:
>
>                         Hi Mahesh,
>
>                         As described in the README, the CKPT service
>                         returns
>                         SA_AIS_ERR_TRY_AGAIN fault code in this case.
>                         I guess it's same for other services.
>
>                         @Anders: Could you please confirm this?
>
>                         Best regards,
>                         Nhat Pham
>
>                         -----Original Message-----
>                         From: A V Mahesh [mailto:[email protected]]
>                         Sent: Friday, January 29, 2016 2:11 PM
>                         To: Nhat Pham <[email protected]>
>                         <mailto:[email protected]>;
>                         [email protected]
>                         <mailto:[email protected]>
>                         Cc: [email protected]
>                         <mailto:[email protected]>
>                         Subject: Re: [PATCH 0 of 1] Review Request for
>                         cpsv: Support
>                         preserving and recovering checkpoint replicas
>                         during headless state
>                         V2 [#1621]
>
>                         Hi,
>
>                         On 1/29/2016 11:45 AM, Nhat Pham wrote:
>
>                                   -  The behavior of application will
>                             be consistent with other
>                             saf services like imm/amf behavior  during
>                             headless state.
>                             [Nhat] I'm not clear what you mean about
>                             "consistent"?
>
>                         In the obscene of  Director (SC's) , what is
>                         expected return values
>                         of SAF API should ( all services ) ,
>                              which are not in aposition to  provide
>                         service at that moment.
>
>                         I think all services should return same  SAF
>                         ERRS., I thinks
>                         currently we don't have  it , may be  Anders
>                         Widel  will help us.
>
>                         -AVM
>
>
>                         On 1/29/2016 11:45 AM, Nhat Pham wrote:
>
>                             Hi Mahesh,
>
>                             Please see the attachment for the README.
>                             Let me know if there is
>                             any more information required.
>
>                             Regarding your comments:
>                                   -  during headless state 
>                             applications may behave like during
>                             CPND restart case [Nhat] Headless state
>                             and CPND restart are
>                             different events. Thus, the behavior is
>                             different.
>                             Headless state is a case where both SCs go
>                             down.
>
>                                   -  The behavior of application will
>                             be consistent with other
>                             saf services like imm/amf behavior  during
>                             headless state.
>                             [Nhat] I'm not clear what you mean about
>                             "consistent"?
>
>                             Best regards,
>                             Nhat Pham
>
>                             -----Original Message-----
>                             From: A V Mahesh
>                             [mailto:[email protected]]
>                             Sent: Friday, January 29, 2016 11:12 AM
>                             To: Nhat Pham <[email protected]>
>                             <mailto:[email protected]>;
>                             [email protected]
>                             <mailto:[email protected]>
>                             Cc: [email protected]
>                             <mailto:[email protected]>
>                             Subject: Re: [PATCH 0 of 1] Review Request
>                             for cpsv: Support
>                             preserving and recovering checkpoint
>                             replicas during headless state
>                             V2 [#1621]
>
>                             Hi Nhat Pham,
>
>                             I stared reviewing this patch , so can
>                             please provide  README file
>                             with scope and limitations , that will
>                             help to define
>                             testing/reviewing scope .
>
>                             Following are minimum things we can keep
>                             in mind while
>                             reviewing/accepting patch ,
>
>                             - Not effecting existing functionality
>                                   -  during headless state 
>                             applications may behave like during
>                             CPND restart case
>                                   -  The minimum functionally of
>                             application works
>                                   -  The behavior of application will
>                             be consistent with
>                                      other saf services like imm/amf
>                             behavior  during headless state.
>
>                             So please do provide any additional
>                             detailed in README if any of
>                             the above is deviated , that allow users
>                             to know about the
>                             limitations/deviation.
>
>                             -AVM
>
>                             On 1/4/2016 3:15 PM, Nhat Pham wrote:
>
>                                 Summary: cpsv: Support preserving and
>                                 recovering checkpoint
>                                 replicas during headless state [#1621]
>                                 Review request for Trac
>                                 Ticket(s):
>                                 #1621 Peer Reviewer(s):
>                                 [email protected]
>                                 <mailto:[email protected]>;
>                                 [email protected]
>                                 <mailto:[email protected]>
>                                 Pull request to:
>                                 [email protected]
>                                 <mailto:[email protected]>
>                                 Affected branch(es): default Development
>                                 branch: default
>
>                                 --------------------------------
>                                 Impacted area       Impact y/n
>                                 --------------------------------
>                                       Docs                    n
>                                       Build system            n
>                                       RPM/packaging           n
>                                       Configuration files     n
>                                       Startup scripts         n
>                                       SAF services            y
>                                       OpenSAF services        n
>                                       Core libraries          n
>                                       Samples                 n
>                                       Tests                   n
>                                       Other                   n
>
>
>                                 Comments (indicate scope for each "y"
>                                 above):
>                                 ---------------------------------------------
>
>
>                                 changeset
>                                 faec4a4445a4c23e8f630857b19aabb43b5af18d
>                                 Author:    Nhat Pham
>                                 <[email protected]>
>                                 <mailto:[email protected]>
>                                 Date:    Mon, 04 Jan 2016 16:34:33 +0700
>
>                                       cpsv: Support preserving and
>                                 recovering checkpoint replicas
>                                 during headless state [#1621]
>
>                                       Background:
>                                       ---------- This enhancement
>                                 supports to preserve checkpoint
>                                 replicas
>
>                             in case
>
>                                       both SCs down (headless state)
>                                 and recover replicas in case
>                                 one of
>
>                             SCs up
>
>                                       again. If both SCs goes down,
>                                 checkpoint replicas on
>                                 surviving nodes
>
>                             still
>
>                                       remain. When a SC is available
>                                 again, surviving replicas are
>
>                             automatically
>
>                                       registered to the SC checkpoint
>                                 database. Content in
>                                 surviving
>
>                             replicas are
>
>                                       intacted and synchronized to new
>                                 replicas.
>
>                                       When no SC is available, client
>                                 API calls changing checkpoint
>
>                             configuration
>
>                                       which requires SC communication,
>                                 are rejected. Client API
>                                 calls
>
>                             reading and
>
>                                       writing existing checkpoint
>                                 replicas still work.
>
>                                       Limitation: The CKPT service
>                                 does not support recovering
>                                 checkpoints
>
>                             in
>
>                                       following cases:
>                                        - The checkpoint which is
>                                 unlinked before headless.
>                                        - The non-collocated checkpoint
>                                 has active replica locating
>                                 on SC.
>                                        - The non-collocated checkpoint
>                                 has active replica locating
>                                 on a PL
>
>                             and this
>
>                                       PL restarts during headless
>                                 state. In this cases, the
>                                 checkpoint
>
>                             replica is
>
>                                       destroyed. The fault code
>                                 SA_AIS_ERR_BAD_HANDLE is returned
>                                 when the
>
>                             client
>
>                                       accesses the checkpoint in these
>                                 cases. The client must
>                                 re-open the
>                                       checkpoint.
>
>                                       While in headless state,
>                                 accessing checkpoint replicas does
>                                 not work
>
>                             if the
>
>                                       node which hosts the active
>                                 replica goes down. It will back
>                                 working
>
>                             when a
>
>                                       SC available again.
>
>                                       Solution:
>                                       --------- The solution for this
>                                 enhancement includes 2 parts:
>
>                                       1. To destroy un-recoverable
>                                 checkpoint described above when
>                                 both
>
>                             SCs are
>
>                                       down: When both SCs are down,
>                                 the CPND deletes un-recoverable
>
>                             checkpoint
>
>                                       nodes and replicas on PLs. Then
>                                 it requests CPA to destroy
>
>                             corresponding
>
>                                       checkpoint node by using new
>                                 message
>                                 CPA_EVT_ND2A_CKPT_DESTROY
>
>                                       2. To update CPD with checkpoint
>                                 information When an active
>                                 SC is up
>
>                             after
>
>                                       headless, CPND will update CPD
>                                 with checkpoint information by
>                                 using
>
>                             new
>
>                                       message
>                                 CPD_EVT_ND2D_CKPT_INFO_UPDATE instead
>                                 of using
>                                       CPD_EVT_ND2D_CKPT_CREATE. This
>                                 is because the CPND will
>                                 create new
>
>                             ckpt_id
>
>                                       for the checkpoint which might
>                                 be different with the current
>                                 ckpt id
>
>                             if the
>
>                                 CPD_EVT_ND2D_CKPT_CREATE is used. The
>                                 CPD collects checkpoint
>
>                             information
>
>                                       within 6s. During this updating
>                                 time, following requests is
>                                 rejected
>
>                             with
>
>                                       fault code SA_AIS_ERR_TRY_AGAIN:
>                                       - CPD_EVT_ND2D_CKPT_CREATE
>                                       - CPD_EVT_ND2D_CKPT_UNLINK
>                                       - CPD_EVT_ND2D_ACTIVE_SET
>                                       - CPD_EVT_ND2D_CKPT_RDSET
>
>
>                                 Complete diffstat:
>                                 ------------------
>                                 osaf/libs/agents/saf/cpa/cpa_proc.c
>                                 |   52
>
>                             +++++++++++++++++++++++++++++++++++
>
>                                 osaf/libs/common/cpsv/cpsv_edu.c |   43
>
>                             +++++++++++++++++++++++++++++
>
>                                 osaf/libs/common/cpsv/include/cpd_cb.h
>                                 |    3 ++
>                                 osaf/libs/common/cpsv/include/cpd_imm.h |   
>                                 1 +
>                                 osaf/libs/common/cpsv/include/cpd_proc.h
>                                 |    7 ++++
>                                 osaf/libs/common/cpsv/include/cpd_tmr.h |   
>                                 3 +-
>                                 osaf/libs/common/cpsv/include/cpnd_cb.h |   
>                                 1 +
>                                 osaf/libs/common/cpsv/include/cpnd_init.h
>                                 |    2 +
>                                 osaf/libs/common/cpsv/include/cpsv_evt.h
>                                 |   20 +++++++++++++
>                                 osaf/services/saf/cpsv/cpd/Makefile.am
>                                 |    3 +-
>                                 osaf/services/saf/cpsv/cpd/cpd_evt.c     
>                                 | 229
>
>                             
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                             
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                             ++++
>
>                                 osaf/services/saf/cpsv/cpd/cpd_imm.c
>                                 |  112
>
>                             
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                 osaf/services/saf/cpsv/cpd/cpd_init.c
>                                 |   20 ++++++++++++-
>                                 osaf/services/saf/cpsv/cpd/cpd_proc.c    
>                                 | 309
>
>                             
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                             
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                             
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                 osaf/services/saf/cpsv/cpd/cpd_tmr.c
>                                 |    7 ++++
>                                 osaf/services/saf/cpsv/cpnd/cpnd_db.c
>                                 |   16 ++++++++++
>                                 osaf/services/saf/cpsv/cpnd/cpnd_evt.c
>                                 |   22 +++++++++++++++
>                                 osaf/services/saf/cpsv/cpnd/cpnd_init.c |  
>                                 23 ++++++++++++++-
>                                 osaf/services/saf/cpsv/cpnd/cpnd_mds.c
>                                 |   13 ++++++++
>                                 osaf/services/saf/cpsv/cpnd/cpnd_proc.c  
>                                 | 314
>
>                             
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                             
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                             
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>
>
>                                       20 files changed, 1189
>                                 insertions(+), 11 deletions(-)
>
>
>                                 Testing Commands:
>                                 -----------------
>                                 -
>
>                                 Testing, Expected Results:
>                                 --------------------------
>                                 -
>
>
>                                 Conditions of Submission:
>                                 -------------------------
>                                       <<HOW MANY DAYS BEFORE PUSHING,
>                                 CONSENSUS ETC>>
>
>
>                                 Arch      Built     Started    Linux
>                                 distro
>                                 -------------------------------------------
>
>                                 mips        n          n
>                                 mips64      n          n
>                                 x86         n          n
>                                 x86_64      n          n
>                                 powerpc     n          n
>                                 powerpc64   n          n
>
>
>                                 Reviewer Checklist:
>                                 -------------------
>                                 [Submitters: make sure that your
>                                 review doesn't trigger any
>                                 checkmarks!]
>
>
>                                 Your checkin has not passed review
>                                 because (see checked entries):
>
>                                 ___ Your RR template is generally
>                                 incomplete; it has too many
>                                 blank
>
>                             entries
>
>                                          that need proper data filled in.
>
>                                 ___ You have failed to nominate the
>                                 proper persons for review and
>                                 push.
>
>                                 ___ Your patches do not have proper
>                                 short+long header
>
>                                 ___ You have grammar/spelling in your
>                                 header that is unacceptable.
>
>                                 ___ You have exceeded a sensible line
>                                 length in your
>
>                             headers/comments/text.
>
>                                 ___ You have failed to put in a proper
>                                 Trac Ticket # into your
>                                 commits.
>
>                                 ___ You have incorrectly put/left
>                                 internal data in your comments/files
>                                          (i.e. internal bug tracking
>                                 tool IDs, product names etc)
>
>                                 ___ You have not given any evidence of
>                                 testing beyond basic build
>                                 tests.
>                                          Demonstrate some level of
>                                 runtime or other sanity testing.
>
>                                 ___ You have ^M present in some of
>                                 your files. These have to be
>                                 removed.
>
>                                 ___ You have needlessly changed
>                                 whitespace or added whitespace crimes
>                                          like trailing spaces, or
>                                 spaces before tabs.
>
>                                 ___ You have mixed real technical
>                                 changes with whitespace and other
>                                          cosmetic code cleanup
>                                 changes. These have to be separate
>                                 commits.
>
>                                 ___ You need to refactor your
>                                 submission into logical chunks; there is
>                                          too much content into a
>                                 single commit.
>
>                                 ___ You have extraneous garbage in
>                                 your review (merge commits etc)
>
>                                 ___ You have giant attachments which
>                                 should never have been sent;
>                                          Instead you should place your
>                                 content in a public tree to
>                                 be pulled.
>
>                                 ___ You have too many commits attached
>                                 to an e-mail; resend as
>                                 threaded
>                                          commits, or place in a public
>                                 tree for a pull.
>
>                                 ___ You have resent this content
>                                 multiple times without a clear
>                                 indication
>                                          of what has changed between
>                                 each re-send.
>
>                                 ___ You have failed to adequately and
>                                 individually address all of the
>                                          comments and change requests
>                                 that were proposed in the
>                                 initial
>
>                             review.
>
>                                 ___ You have a misconfigured ~/.hgrc
>                                 file (i.e. username, email
>                                 etc)
>
>                                 ___ Your computer have a badly
>                                 configured date and time; confusing the
>                                          the threaded patch review.
>
>                                 ___ Your changes affect IPC mechanism,
>                                 and you don't present any
>                                 results
>                                          for in-service upgradability
>                                 test.
>
>                                 ___ Your changes affect user manual
>                                 and documentation, your patch
>                                 series
>                                          do not contain the patch that
>                                 updates the Doxygen manual.
>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Reply via email to