Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Nhat Pham Sun, 21 Feb 2016 20:09:07 -0800

Hi Mahesh,

Could you please clarify which case the error below happened?


Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER STATE:
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 45
(safClmService) <0, 2010f>
Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd clm init failed with return
value:31
Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd init failed 
Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd_lib_req FAILED
Feb 19 11:18:28 PL-4 osafckptnd[7718]: __init_cpnd() failed
Feb 19 11:18:28 PL-4 osafclmna[5432]: NO
safNode=PL-4,safCluster=myClmCluster Joined cluster, nodeid=2040f

According to the log, the PL-4 joined cluster, it means the cluster is not
in headless state, doesn't it?

Best regards,
Nhat Pham

-----Original Message-----
From: Nhat Pham [mailto:nhat.p...@dektech.com.au] 
Sent: Monday, February 22, 2016 9:19 AM
To: 'A V Mahesh' <mahesh.va...@oracle.com>; 'Anders Widell'
<anders.wid...@ericsson.com>
Cc: 'Beatriz Brandao' <beatriz.bran...@ericsson.com>; 'Minh Chau H'
<minh.c...@dektech.com.au>; opensaf-devel@lists.sourceforge.net
Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
preserving and recovering checkpoint replicas during headless state V2
[#1621]

Hi Mahesh and Anders,

 

Please see my comment below.

 

BTW, have you finished the review and test?

 

Best regards,

Nhat Pham

 

From: A V Mahesh [mailto:mahesh.va...@oracle.com]
Sent: Friday, February 19, 2016 2:28 PM
To: Nhat Pham <nhat.p...@dektech.com.au <mailto:nhat.p...@dektech.com.au> >;
'Anders Widell'
<anders.wid...@ericsson.com <mailto:anders.wid...@ericsson.com> >; 'Minh
Chau H' <minh.c...@dektech.com.au <mailto:minh.c...@dektech.com.au> >
Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net> ; 'Beatriz Brandao'
<beatriz.bran...@ericsson.com <mailto:beatriz.bran...@ericsson.com> >
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621]

 

Hi Nhat Pham,

On 2/19/2016 12:28 PM, Nhat Pham wrote:

Could you please give more detailed information about steps to reproduce the
problem below? Thanks.


Don't see this as specific bug  , we need to see the issue as  CLM
integrated service point  of view , by considering Anders Widell
explication about CLM  application behavior during headless state we need to
reintegrate CPND with CLM ( before this  headless state feature no case of
CPND existence in the obscene of CLMD  , but now it is ).

And this will be the consistent across the all services who integrated with
CLM  ( you may need some changes in CLM also )

[Nhat Pham] I think CLM should return SA_AIS_ERR_TRY_AGAIN in this case.

@Anders. How would you think?

To start with let us consider case CPND  on payload restarted on PL  during
headless state and an application is in running on PL.

[Nhat Pham] Regarding the CPND as CLM application, I'm not sure what it can
do in this case. In case it restarts, it is monitored by AMF.

If it blocks for too long, AMF will also trigger a node reboot.

In my test case, the CPND get blocked by CLM. It doesn't get out of the
saClmInitialize. How do you get the "ER cpnd clm init failed with return
value:31"?

Following is the cpnd trace.

Feb 22  8:56:41.188122 osafckptnd [736:cpnd_init.c:0183] >> cpnd_lib_init

Feb 22  8:56:41.188332 osafckptnd [736:cpnd_init.c:0412] >> cpnd_cb_db_init

Feb 22  8:56:41.188600 osafckptnd [736:cpnd_init.c:0437] << cpnd_cb_db_init

Feb 22  8:56:41.188778 osafckptnd [736:clma_api.c:0503] >> saClmInitialize

Feb 22  8:56:41.188945 osafckptnd [736:clma_api.c:0593] >> clmainitialize

Feb 22  8:56:41.190052 osafckptnd [736:clma_util.c:0100] >> clma_startup:
clma_use_count: 0

Feb 22  8:56:41.190273 osafckptnd [736:clma_mds.c:1124] >> clma_mds_init

Feb 22  8:56:41.190825 osafckptnd [736:clma_mds.c:1170] << clma_mds_init

-AVM

On 2/19/2016 12:28 PM, Nhat Pham wrote:

Hi Mahesh,

 

Could you please give more detailed information about steps to reproduce the
problem below? Thanks.

 

Best regards,

Nhat Pham

 

From: A V Mahesh [mailto:mahesh.va...@oracle.com]
Sent: Friday, February 19, 2016 1:06 PM
To: Anders Widell  <mailto:anders.wid...@ericsson.com>
<anders.wid...@ericsson.com <mailto:anders.wid...@ericsson.com> >; Nhat Pham
<mailto:nhat.p...@dektech.com.au> <nhat.p...@dektech.com.au
<mailto:nhat.p...@dektech.com.au> >; 'Minh Chau H'
<mailto:minh.c...@dektech.com.au> <minh.c...@dektech.com.au
<mailto:minh.c...@dektech.com.au> >
Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net> 
<mailto:opensaf-devel@lists.sourceforge.net> ; 'Beatriz Brandao'
<mailto:beatriz.bran...@ericsson.com> <beatriz.bran...@ericsson.com
<mailto:beatriz.bran...@ericsson.com> >
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621]

 

Hi Anders Widell,
Thanks for the detailed explanation  about CLM during headless state.

HI  Nhat Pham ,

Comment : 3
Please see below  the problem I was interpreted now I  seeing it  during
CLMD obscene ( during headless state ), so now CPND/CLMA need to  to address
below case , currently cpnd clm init
failed with return value:   SA_AIS_ERR_UNAVAILABLE
but should be SA_AIS_ERR_TRY_AGAIN 

==================================================
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE 17418 Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync
ending normally Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch set to 9 in
ImmModel Feb 19 11:18:28 PL-4 cpsv_app: IN Received PROC_STALE_CLIENTS Feb
19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 42
(MsgQueueService132111) <108, 2040f>
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 43
(MsgQueueService131855) <0, 2030f>
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected: 44
(safLogService) <0, 2010f>
Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER STATE:
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY Feb 19 11:18:28 PL-4
osafimmnd[5422]: NO Implementer connected: 45
(safClmService) <0, 2010f>
Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd clm init failed with return
value:31
Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd init failed Feb 19 11:18:28
PL-4 osafckptnd[7718]: ER cpnd_lib_req FAILED Feb 19 11:18:28 PL-4
osafckptnd[7718]: __init_cpnd() failed Feb 19 11:18:28 PL-4 osafclmna[5432]:
NO safNode=PL-4,safCluster=myClmCluster Joined cluster, nodeid=2040f Feb 19
11:18:28 PL-4 osafamfnd[5441]: NO AVD NEW_ACTIVE, adest:1 Feb 19 11:18:28
PL-4 osafamfnd[5441]: NO Sending node up due to NCSMDS_NEW_ACTIVE Feb 19
11:18:28 PL-4 osafamfnd[5441]: NO 1 SISU states sent Feb 19 11:18:28 PL-4
osafamfnd[5441]: NO 1 SU states sent Feb 19 11:18:28 PL-4 osafamfnd[5441]:
NO 7 CSICOMP states synced Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU
states sent Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Implementer connected:
46
(safAmfService) <0, 2010f>
Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO
'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Component or SU restart probation
timer expired Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Instantiation of
'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' failed Feb 19 11:18:35
PL-4 osafamfnd[5441]: NO Reason: component registration timer expired Feb 19
11:18:35 PL-4 osafamfnd[5441]: WA
'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Presence State
RESTARTING => INSTANTIATION_FAILED Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
Component Failover trigerred for
'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': Failed component:
'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER
'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got Inst failed Feb 19
11:18:35 PL-4 osafamfnd[5441]: Rebooting OpenSAF NodeId = 132111 EE Name = ,
Reason: NCS component Instantiation failed, OwnNodeId = 132111,
SupervisionTime = 60 Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting local
node; timeout=60 Feb 19 11:18:39 PL-4 kernel: [ 4877.338518] md: stopping
all md devices.
==================================================

-AVM

On 2/15/2016 5:11 PM, Anders Widell wrote:

Hi! 

Please find my answer inline, marked [AndersW]. 

regards,
Anders Widell 

On 02/15/2016 10:38 AM, Nhat Pham wrote: 

Hi Mahesh, 

It's good. Thank you. :) 

[AVM]  Up on rejoining of the SC`s The replica should be re-created
regardless of another application opens it on PL4. 
               ( Note : this comment is based on your explanation have not
yet reviewed/tested  , 
                  currently i am struggling with  SC`s    not rejoining 
after headless state , i can provide you more on this once i  complte my
review/testing) 

[Nhat] To make cloud resilience works, you need the patches from other
services (log, amf, clm, ntf). 
@Minh: I heard that you created tar file which includes all patches. Could
you please send it to Mahesh? Thanks 

[AVM] I understand that , before I comment more on this   please allow me to

understand 
              I am not still not very clear of the headless design in
detail. 
              For example cluster membership of PL`s   during headless state
, 
               In the absence of  SC`s  (CLMD) dose the PLs is considered as

cluster nodes or not (cluster membership) ? 

[Nhat] I don't know much about this. 
@ Anders: Could you please have comment about this? Thanks 

[AndersW] First of all, keep in mind that the "headless" state should
ideally not last a very long time. Once we have the spare SC feature in
place (ticket [#79]), a new SC should become active within a matter of a few
seconds after we have lost both the active and the standby SC. 

I think you should view the state of the cluster in the headless state in
the same way as you view the state of the cluster during a failover between
the active and the standby SC. Imagine that the active SC dies. It takes the
standby SC 1.5 seconds to detect the failure of the active SC (this is due
to the TIPC timeout). If you have configured the PROMOTE_ACTIVE_TIMER, there
is an additional delay before the standby takes over as active. What is the
state of the cluster during the time after the active SC failed and before
the standby takes over? 

The state of the cluster while it is headless is very similar. The
difference is that this state may last a little bit longer (though not more
than a few seconds, until one of the spare SCs becomes active). Another
difference is that we may have lost some state. With a "perfect"
implementation of the headless feature we should not lose any state at all,
but with the current set of patches we do lose state. 

So specifically if we talk about cluster membership and ask the question: is
a particular PL a member of the cluster or not during the headless state?
Well, if you ask CLM about this during the headless state, then you will not
know - because CLM doesn't provide any service during the headless state. If
you keep retrying you query to CLM, you will eventually get an answer - but
you will not get this answer until there is an active SC again and we have
exited the headless state. When viewed in this way, the answer to the
question about a node's membership is undefined during the headless state,
since CLM will not provide you with any answer until there is an active SC. 

However, if you asked CLM about the node's cluster membership status before
the cluster went headless, you probably saved a cached copy of the cluster
membership state. Maybe you also installed a CLM track callback and intend
to update this cached copy every time the cluster membership status changes.
The question then is: can you continue using this cached copy of the cluster
membership state during the headless state? The answer is YES: since CLM
doesn't provide any service during the headless state, it also means that
the cluster membership view cannot change during this time. Nodes can of
course reboot or die, but CLM will not notice and hence the cluster view
will not be updated. You can argue that this is bad because the cluster view
doesn't reflect reality, but notice that this will always be the case. We
can never propagate information instantaneously, and detection of node
failures will take 1.5 seconds due to the TIPC timeout. You can never be
sure that a node is alive at this very moment just because CLM tells you
that it is a member of the cluster. If we are unfortunate enough to lose
both system controller nodes simultaneously, updates to the cluster
membership view will be delayed a few seconds longer than usual. 




Best regards,
Nhat Pham 

-----Original Message-----
From: A V Mahesh [mailto:mahesh.va...@oracle.com]
Sent: Monday, February 15, 2016 11:19 AM
To: Nhat Pham  <mailto:nhat.p...@dektech.com.au> <nhat.p...@dektech.com.au
<mailto:nhat.p...@dektech.com.au> >; anders.wid...@ericsson.com
<mailto:anders.wid...@ericsson.com>  <mailto:anders.wid...@ericsson.com>
Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net> 
<mailto:opensaf-devel@lists.sourceforge.net> ; 'Beatriz Brandao' 
 <mailto:beatriz.bran...@ericsson.com> <beatriz.bran...@ericsson.com
<mailto:beatriz.bran...@ericsson.com> >
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621] 

Hi Nhat Pham, 

How is your holiday went 

Please find my comments below 

On 2/15/2016 8:43 AM, Nhat Pham wrote: 

Hi Mahesh, 

For the comment 1, the patch will be updated accordingly. 

[AVM]  Please hold , I will provide more comments in this week , so we can
have consolidated V3 

For the comment 2, I think the CKPT service will not be backward compatible
if the scAbsenceAllowed is true. 
The client can't create non-collocated checkpoint on SCs. 

Furthermore, this solution only protects the CKPT service from the case "The
non-collocated checkpoint  is created on a SC" 
there are still the cases where the replicas are completely lost. Ex: 

- The non-collocated checkpoint created on a PL. The PL reboots. Both
replicas now locate on SCs. Then, headless state happens. All replicas are
lost. 
- The non-collocated checkpoint has active replica locating on a PL and this
PL restarts during headless state
- The non-collocated checkpoint is created on PL3. This checkpoint is also
opened on PL4. Then SCs and PL3 reboot. 

[AVM]  Up on rejoining of the SC`s The replica should be re-created
regardless of another application opens it on PL4. 
               ( Note : this comment is based on your explanation have not
yet reviewed/tested  , 
                  currently i am struggling with  SC`s    not rejoining 
after headless state , i can provide you more on this once i  complte my
review/testing) 

In this case, all replicas are lost and the client has to create it again. 

In case multiple nodes (which including SCs) reboot, losing replicas is
unpreventable. The patch is to recover the checkpoints in possible cases.

How do you think? 

[AVM] I understand that , before I comment more on this   please allow 
me to understand 
              I am not still not very clear of the headless design in
detail. 

              For example cluster membership of PL`s   during headless 
state , 
               In the absence of  SC`s  (CLMD) dose the PLs is considered as

cluster nodes or not (cluster membership) ? 

                     - if not consider as  NON cluster nodes Checkpoint
Service API  should  leverage the SA Forum Cluster 
                       Membership Service  and API's can fail with
SA_AIS_ERR_UNAVAILABLE 

                     - if considers as cluster nodes  we need to follow all
the defined rules which are defined in SAI-AIS-CKPT-B.02.02 specification 

              so give me some more time to review it completely , so that we

can  have consolidated patch V3 

-AVM 

Best regards,
Nhat Pham 

-----Original Message-----
From: A V Mahesh [mailto:mahesh.va...@oracle.com]
Sent: Friday, February 12, 2016 11:10 AM
To: Nhat Pham  <mailto:nhat.p...@dektech.com.au> <nhat.p...@dektech.com.au
<mailto:nhat.p...@dektech.com.au> >; anders.wid...@ericsson.com
<mailto:anders.wid...@ericsson.com>  <mailto:anders.wid...@ericsson.com>
Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net> 
<mailto:opensaf-devel@lists.sourceforge.net> ; Beatriz Brandao
<mailto:beatriz.bran...@ericsson.com> <beatriz.bran...@ericsson.com
<mailto:beatriz.bran...@ericsson.com> >
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state V2 [#1621] 


Comment 2 : 

After incorporating the comment one all the Limitations should be prevented
based on Hydra configuration is enabled in IMM status. 

Foe example :  if some application is trying to create 

non-collocated checkpoint active replica getting generated/locating on SC
then ,regardless of the heads (SC`s) status exist not exist should return
SA_AIS_ERR_NOT_SUPPORTED 

In other words, rather that allowing to created non-collocated checkpoint
when
heads(SC`s)  are exit , and non-collocated checkpoint getting unrecoverable
after heads(SC`s) rejoins. 

======================================================================
======================= 

    Limitation: The CKPT service doesn't support recovering checkpoints in 
    following cases: 
    . The checkpoint which is unlinked before headless. 
    . The non-collocated checkpoint has active replica locating on SC. 
    . The non-collocated checkpoint has active replica locating on a PL and
this PL 
    restarts during headless state. In this cases, the checkpoint replica is

    destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when the
client 
    accesses the checkpoint in these cases. The client must re-open the 
    checkpoint. 

======================================================================
======================= 

-AVM 


On 2/11/2016 12:52 PM, A V Mahesh wrote: 

Hi, 

I jut starred reviewing patch , I will be  giving comments as soon as I
crossover any , to save some time. 

Comment 1 : 
This functionality should be under  checks if Hydra configuration is enabled
in IMM attrName =
const_cast<SaImmAttrNameT>("scAbsenceAllowed") 

Please see example how  LOG/AMF  services implemented it. 

-AVM 


On 1/29/2016 1:02 PM, Nhat Pham wrote: 

Hi Mahesh, 

As described in the README, the CKPT service returns SA_AIS_ERR_TRY_AGAIN
fault code in this case. 
I guess it's same for other services. 

@Anders: Could you please confirm this? 

Best regards,
Nhat Pham 

-----Original Message-----
From: A V Mahesh [mailto:mahesh.va...@oracle.com]
Sent: Friday, January 29, 2016 2:11 PM
To: Nhat Pham  <mailto:nhat.p...@dektech.com.au> <nhat.p...@dektech.com.au
<mailto:nhat.p...@dektech.com.au> >; anders.wid...@ericsson.com
<mailto:anders.wid...@ericsson.com>  <mailto:anders.wid...@ericsson.com>
Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net> 
<mailto:opensaf-devel@lists.sourceforge.net>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state
V2 [#1621] 

Hi, 

On 1/29/2016 11:45 AM, Nhat Pham wrote: 

      -  The behavior of application will be consistent with other saf
services like imm/amf behavior  during headless state. 
[Nhat] I'm not clear what you mean about "consistent"? 

In the obscene of  Director (SC's) , what is expected return values of SAF
API should ( all services ) , 
     which are not in aposition to  provide service at that moment. 

I think all services should return same  SAF ERRS., I thinks currently we
don't have  it , may be  Anders Widel  will help us. 

-AVM 


On 1/29/2016 11:45 AM, Nhat Pham wrote: 

Hi Mahesh, 

Please see the attachment for the README. Let me know if there is any more
information required. 

Regarding your comments: 
      -  during headless state  applications may behave like during CPND
restart case [Nhat] Headless state and CPND restart are different events.
Thus, the behavior is different. 
Headless state is a case where both SCs go down. 

      -  The behavior of application will be consistent with other saf
services like imm/amf behavior  during headless state. 
[Nhat] I'm not clear what you mean about "consistent"? 

Best regards,
Nhat Pham 

-----Original Message-----
From: A V Mahesh [mailto:mahesh.va...@oracle.com]
Sent: Friday, January 29, 2016 11:12 AM
To: Nhat Pham  <mailto:nhat.p...@dektech.com.au> <nhat.p...@dektech.com.au
<mailto:nhat.p...@dektech.com.au> >;

anders.wid...@ericsson.com <mailto:anders.wid...@ericsson.com>
<mailto:anders.wid...@ericsson.com>
Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net> 
<mailto:opensaf-devel@lists.sourceforge.net>
Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and
recovering checkpoint replicas during headless state
V2 [#1621] 

Hi Nhat Pham, 

I stared reviewing this patch , so can please provide  README file with
scope and limitations , that will help to define testing/reviewing scope . 

Following are minimum things we can keep in mind while reviewing/accepting
patch , 

- Not effecting existing functionality 
      -  during headless state  applications may behave like during CPND
restart case 
      -  The minimum functionally of application works 
      -  The behavior of application will be consistent with 
         other saf services like imm/amf behavior  during headless state. 

So please do provide any additional detailed in README if any of the above
is deviated , that allow users to know about the limitations/deviation. 

-AVM 

On 1/4/2016 3:15 PM, Nhat Pham wrote: 

Summary: cpsv: Support preserving and recovering checkpoint replicas during
headless state [#1621] Review request for Trac
Ticket(s): 
#1621 Peer Reviewer(s): mahesh.va...@oracle.com
<mailto:mahesh.va...@oracle.com>  <mailto:mahesh.va...@oracle.com> ;
anders.wid...@ericsson.com <mailto:anders.wid...@ericsson.com>
<mailto:anders.wid...@ericsson.com>  Pull request
to: 
mahesh.va...@oracle.com <mailto:mahesh.va...@oracle.com>
<mailto:mahesh.va...@oracle.com>  Affected
branch(es): default Development
branch: default 

-------------------------------- 
Impacted area       Impact y/n 
-------------------------------- 
      Docs                    n 
      Build system            n 
      RPM/packaging           n 
      Configuration files     n 
      Startup scripts         n 
      SAF services            y 
      OpenSAF services        n 
      Core libraries          n 
      Samples                 n 
      Tests                   n 
      Other                   n 


Comments (indicate scope for each "y" above): 
--------------------------------------------- 

changeset faec4a4445a4c23e8f630857b19aabb43b5af18d 
Author:    Nhat Pham  <mailto:nhat.p...@dektech.com.au>
<nhat.p...@dektech.com.au <mailto:nhat.p...@dektech.com.au> > 
Date:    Mon, 04 Jan 2016 16:34:33 +0700 

      cpsv: Support preserving and recovering checkpoint replicas during
headless state [#1621] 

      Background: 
      ---------- This enhancement supports to preserve checkpoint replicas 

in case 

      both SCs down (headless state) and recover replicas in case one of 

SCs up 

      again. If both SCs goes down, checkpoint replicas on surviving nodes 

still 

      remain. When a SC is available again, surviving replicas are 

automatically 

      registered to the SC checkpoint database. Content in surviving 

replicas are 

      intacted and synchronized to new replicas. 

      When no SC is available, client API calls changing checkpoint 

configuration 

      which requires SC communication, are rejected. Client API calls 

reading and 

      writing existing checkpoint replicas still work. 

      Limitation: The CKPT service does not support recovering checkpoints 

in 

      following cases: 
       - The checkpoint which is unlinked before headless. 
       - The non-collocated checkpoint has active replica locating on SC. 
       - The non-collocated checkpoint has active replica locating on a PL 

and this 

      PL restarts during headless state. In this cases, the checkpoint 

replica is 

      destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when the 

client 

      accesses the checkpoint in these cases. The client must re-open the 
      checkpoint. 

      While in headless state, accessing checkpoint replicas does not work 

if the 

      node which hosts the active replica goes down. It will back working 

when a 

      SC available again. 

      Solution: 
      --------- The solution for this enhancement includes 2 parts: 

      1. To destroy un-recoverable checkpoint described above when both 

SCs are 

      down: When both SCs are down, the CPND deletes un-recoverable 

checkpoint 

      nodes and replicas on PLs. Then it requests CPA to destroy 

corresponding 

      checkpoint node by using new message CPA_EVT_ND2A_CKPT_DESTROY 

      2. To update CPD with checkpoint information When an active SC is up 

after 

      headless, CPND will update CPD with checkpoint information by using 

new 

      message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using 
      CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will create new 

ckpt_id 

      for the checkpoint which might be different with the current ckpt id 

if the 

      CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint 

information 

      within 6s. During this updating time, following requests is rejected 

with 

      fault code SA_AIS_ERR_TRY_AGAIN: 
      - CPD_EVT_ND2D_CKPT_CREATE 
      - CPD_EVT_ND2D_CKPT_UNLINK 
      - CPD_EVT_ND2D_ACTIVE_SET 
      - CPD_EVT_ND2D_CKPT_RDSET 


Complete diffstat: 
------------------ 
      osaf/libs/agents/saf/cpa/cpa_proc.c       |   52 

+++++++++++++++++++++++++++++++++++ 

osaf/libs/common/cpsv/cpsv_edu.c          |   43 

+++++++++++++++++++++++++++++ 

osaf/libs/common/cpsv/include/cpd_cb.h    |    3 ++ 
      osaf/libs/common/cpsv/include/cpd_imm.h   |    1 + 
      osaf/libs/common/cpsv/include/cpd_proc.h  |    7 ++++ 
      osaf/libs/common/cpsv/include/cpd_tmr.h   |    3 +- 
      osaf/libs/common/cpsv/include/cpnd_cb.h   |    1 + 
      osaf/libs/common/cpsv/include/cpnd_init.h |    2 + 
      osaf/libs/common/cpsv/include/cpsv_evt.h  |   20 +++++++++++++ 
      osaf/services/saf/cpsv/cpd/Makefile.am    |    3 +- 
      osaf/services/saf/cpsv/cpd/cpd_evt.c      |  229 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


++++ 

osaf/services/saf/cpsv/cpd/cpd_imm.c      |  112 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




osaf/services/saf/cpsv/cpd/cpd_init.c     |   20 ++++++++++++- 
      osaf/services/saf/cpsv/cpd/cpd_proc.c     |  309 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 

osaf/services/saf/cpsv/cpd/cpd_tmr.c      |    7 ++++ 
      osaf/services/saf/cpsv/cpnd/cpnd_db.c     |   16 ++++++++++ 
      osaf/services/saf/cpsv/cpnd/cpnd_evt.c    |   22 +++++++++++++++ 
      osaf/services/saf/cpsv/cpnd/cpnd_init.c   |   23 ++++++++++++++- 
      osaf/services/saf/cpsv/cpnd/cpnd_mds.c    |   13 ++++++++ 
      osaf/services/saf/cpsv/cpnd/cpnd_proc.c   |  314 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---

      20 files changed, 1189 insertions(+), 11 deletions(-) 


Testing Commands: 
-----------------
- 

Testing, Expected Results: 
--------------------------
- 


Conditions of Submission: 
------------------------- 
      <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>> 


Arch      Built     Started    Linux distro 
------------------------------------------- 
mips        n          n 
mips64      n          n 
x86         n          n 
x86_64      n          n 
powerpc     n          n 
powerpc64   n          n 


Reviewer Checklist: 
------------------- 
[Submitters: make sure that your review doesn't trigger any 
checkmarks!] 


Your checkin has not passed review because (see checked entries): 

___ Your RR template is generally incomplete; it has too many 
blank 

entries 

         that need proper data filled in. 

___ You have failed to nominate the proper persons for review and 
push. 

___ Your patches do not have proper short+long header 

___ You have grammar/spelling in your header that is unacceptable. 

___ You have exceeded a sensible line length in your 

headers/comments/text. 

___ You have failed to put in a proper Trac Ticket # into your 
commits. 

___ You have incorrectly put/left internal data in your comments/files 
         (i.e. internal bug tracking tool IDs, product names etc) 

___ You have not given any evidence of testing beyond basic build 
tests. 
         Demonstrate some level of runtime or other sanity testing. 

___ You have ^M present in some of your files. These have to be 
removed. 

___ You have needlessly changed whitespace or added whitespace crimes 
         like trailing spaces, or spaces before tabs. 

___ You have mixed real technical changes with whitespace and other 
         cosmetic code cleanup changes. These have to be separate 
commits. 

___ You need to refactor your submission into logical chunks; there is 
         too much content into a single commit. 

___ You have extraneous garbage in your review (merge commits etc) 

___ You have giant attachments which should never have been sent; 
         Instead you should place your content in a public tree to 
be pulled. 

___ You have too many commits attached to an e-mail; resend as 
threaded 
         commits, or place in a public tree for a pull. 

___ You have resent this content multiple times without a clear 
indication 
         of what has changed between each re-send. 

___ You have failed to adequately and individually address all of the 
         comments and change requests that were proposed in the 
initial 

review. 

___ You have a misconfigured ~/.hgrc file (i.e. username, email 
etc) 

___ Your computer have a badly configured date and time; confusing the 
         the threaded patch review. 

___ Your changes affect IPC mechanism, and you don't present any 
results 
         for in-service upgradability test. 

___ Your changes affect user manual and documentation, your patch 
series 
         do not contain the patch that updates the Doxygen manual. 

 

 

 

 

----------------------------------------------------------------------------
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
<mailto:Opensaf-devel@lists.sourceforge.net> 
https://lists.sourceforge.net/lists/listinfo/opensaf-devel
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Reply via email to