Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

A V Mahesh Thu, 25 Feb 2016 21:32:34 -0800

Hi Nhat Pham,

Please find my answers.


-AVM

On 2/26/2016 10:23 AM, Nhat Pham wrote:
>
> Hi Mahesh,
>
>  Please see my answers below with [NhatPham5]
>
>  Best regards,
>
> Nhat Pham
>
> *From:*A V Mahesh [mailto:[email protected]]
> *Sent:* Friday, February 26, 2016 11:17 AM
> *To:* Nhat Pham <[email protected]>; 'Anders Widell' 
> <[email protected]>
> *Cc:* [email protected]; 'Beatriz Brandao' 
> <[email protected]>; 'Minh Chau H' <[email protected]>
> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state V2 
> [#1621]
>
>  Hi Nhat Pham,
>
> >>[NhatPham4] To be more correct, the application will get 
> SA_AIS_ERR_BAD_HANDLE when trying to access the lost checkpoint 
> because all data was destroyed.
> >>[AndersW4] If this is a problem we could re-create the checkpoint 
> with no sections in it.
>
>     Even I come across this approach ,  instead of destroying the 
> checkpoint information (current patch doing)  from CPND of payloads  
> and  returning the SA_AIS_ERR_BAD_HANDLE applcation on PL`s
>     in the NEW patch  V3 check the possibility of re-cremating  the 
> checkpoint with sections  ( you can send this data from PL to SC up on 
> CPD up)   .
>
> [NhatPham5] The “all data” here I means the checkpoint node 
> information in database controlled by cpnd (not the replica). In this 
> case, all replicas were lost. How can the checkpoint be re-created 
> with sections?
>
[AVM] I know replicas  are lost  ,  I am suggesting  to use the 
`checkpoint node information` available  at  PL  CPND ( only one CPD 
will volunteer for this  if multiple application opened on )
            assume and try to recreate the  checkpoint  as fresh request 
came from CPA  (  CPD assumes   request came all the way form  
CPA-->CPND--->CPD but it is not )
            so that CPD  will create new replicas with sections  with 
clean/no-data  instead of  asking application to recreate it  .

             I think the  LOG stems are getting recreated with 
empty/fresh/no data like this with data avlible on LGA .
>
> For other cases where the checkpoint replicas survive, the checkpoint 
> is restored when the SC is up again.
>
> Ex: A checkpoint is created on a PL. There are 3 replicas created on 
> SCs and PL. The headless state happens. After the SC is up, the 
> checkpoint is recovered.
>
>
>
> -AVM
>
> On 2/26/2016 8:11 AM, Nhat Pham wrote:
>
>     Hi,
>
>     Please see my comment below with [NhatPham4]
>
>     Best regards,
>
>     Nhat Pham
>
>     *From:* Anders Widell [mailto:[email protected]]
>     *Sent:* Thursday, February 25, 2016 9:25 PM
>     *To:* A V Mahesh <[email protected]>
>     <mailto:[email protected]>; Nhat Pham
>     <[email protected]> <mailto:[email protected]>
>     *Cc:* [email protected]
>     <mailto:[email protected]>; 'Beatriz Brandao'
>     <[email protected]>
>     <mailto:[email protected]>; 'Minh Chau H'
>     <[email protected]> <mailto:[email protected]>
>     *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>     preserving and recovering checkpoint replicas during headless
>     state V2 [#1621]
>
>     Hi!
>
>     See my comments inline, marked [AndersW4].
>
>     regards,
>     Anders Widell
>
>     On 02/25/2016 05:26 AM, A V Mahesh wrote:
>
>         Hi Nhat Pham,
>
>         Please see my comment below.
>
>         -AVM
>
>         On 2/25/2016 7:54 AM, Nhat Pham wrote:
>
>             Hi Mahesh,
>
>             Would you  agree with the comment below?
>
>             To summarize, following are the comment so far:
>
>             *Comment 1*: This functionality should be under checks if
>             Hydra configuration is enabled in IMM attrName =
>
>             const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>
>             Action: The code will be updated accordingly.
>
>     [AndersW4] Just a question here: is this really needed? If the
>     code is already 100% backwards compatible when the headless
>     feature is disabled, what would be the point of reading the
>     configuration and taking different paths in the code depending on
>     it? Maybe the code is not 100% backwards compatible and then I
>     agree that we need to read the configuration.
>
>     The reason why I am asking is that I had the impression that the
>     code would only cause different behaviour in the cases where both
>     system controllers die at the same time, and this cannot happen
>     when the headless feature is disabled (or rather: it can happen,
>     but it would trigger an immediate cluster restart so any
>     difference in behaviour after that point is irrelevant).
>
>     [NhatPham4] The code is backwards compatible when the headless
>     feature is disable.
>
>     For V2 patch, cpnd will update cpd with recoverable checkpoint
>     data when SC is up after headless state.(From implementation point
>     of view)
>
>     In current system if the headless feature is disable, whole
>     cluster reboots. Thus all data is destroyed.
>
>
>     For V2 patch + checking scAbsenceAllowed, cpnd destroys all the
>     checkpoint data (as original implementation). (From implementation
>     point of view)
>
>     In current system if the headless feature is disable, whole
>     cluster reboots. Thus all data is destroyed.
>
>     So if you ask if the checking is really needed in current
>     situation. The answer is not really.
>
>     This checking is just to make sure that all checkpoint data is
>     destroyed in case headless feature is disable.
>
>     How do you think?
>
>             *Comment 2*: To keep the scope of CPSV service as
>             non-collocated checkpoint creation NOT_SUPPORTED , if
>             cluster is running with IMMSV_SC_ABSENCE_ALLOWED (
>             headless state configuration enabled at the time of
>             cluster startup  currently it is not configurable , so
>             there no chance of  run-time configuration change ).
>
>             Action: No change in code. The CPSV still keep supporting
>             non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED
>             is enable.
>
>          >>[AndersW3] No, I think we ought to support non-colocated
>         checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The
>         fact that we have "system controllers" is an implementation
>         detail of OpenSAF. I don't think the CKPT SAF specification
>         implies that
>          >>non-colocated checkpoints must be fully replicated on all
>         the nodes in the cluster, and thus we must have the
>         possibility that all replicas are lost. It is not clear
>         exactly what to expect from the APIs when this happens, but
>         you could handle it in a similar way as the case >> when all
>         sections have been automatically deleted by the checkpoint
>         service because the sections have expired.
>
>         [AVM]  I am not in agreement with both comments ,   we can
>         not  handle it in a similar to sections expiration case hear ,
>         in case of sections expiration checkpoint replica  still exist
>         only section deleted
>
>     [AndersW4] If this is a problem we could re-create the checkpoint
>     with no sections in it.
>
>
>                     CPSV specification says  if two replicas exist (
>         in our case Only on SC`s) at a certain point in time, and the
>         nodes hosting both of these replicas is
>                     administratively taken out of service, the
>         Checkpoint Service should allocate another replica on another
>         node while this node is not available
>                     please check section `3.1.7.2 Non-Collocated
>         Checkpoints`  of cpsv specification .
>
>     [AndersW4] The spec actually says "may" rather than "should" in
>     this section. And the purpose of allocating another replica is to
>     "enhance the availability of checkpoints". When I read this
>     section, I think it is quite clear that the spec does not perceive
>     non-colocated checkpoints as guaranteed to preserve data in the
>     case of node failures:
>
>     "The Checkpoint Service may create replicas
>     other than the ones that may be created when opening a checkpoint.
>     These other
>     replicas can be useful to enhance the availability of checkpoints.
>     For example, if two
>     replicas exist at a certain point in time, and the node hosting
>     one of these replicas is
>     administratively taken out of service, the Checkpoint Service may
>     allocate another
>     replica on another node while this node is not available."
>
>     So, data can be lost due to (multiple) node failures. There are
>     two other cases where data is lost: automatic deletion of the
>     entire checkpoint if it has not been opened by any process for the
>     duration of the retention time, and automatic deletion of sections
>     within a checkpoint when the sections reach their expiration
>     times. The APIs specify the return code SA_AIS_ERR_NOT_EXIST to
>     signal that a specific section, or the entire checkpoint, doesn't
>     exist. Thus, there support in the API for reporting loss of
>     checkpoint data (whatever the reason of the loss may be). If the
>     headless feature is disabled, we cannot lose non-colocated
>     checkpoints due to node failures, but when the headless feature is
>     enabled we can.
>
>
>                      For example,  take a case of  application on PL
>         is in progress of writing to non-collocated checkpoint
>         sections ( physical replica exist only on SC`s )
>                      what will happen to application on PL ?   , ok
>         let us consider user agreed to loose the checkpoint and he
>         what to recreated it , what will happen to  cpnd DB on PL and
>         the complexity involved in it (clean up) ,
>                      and this will lead to lot of maintainability issues.
>
>     [AndersW4] The thing that will happen (from an application's
>     perspective) is that you will get the SA_AIS_ERR_NOT_EXIST error
>     code from the CKPT API when trying to access the lost checkpoint.
>     I don't know the complexity at the code level for implementing
>     this, but isn't this already supported by the code which is out on
>     review (Nhat, correct me if I am wrong)?
>
>     [NhatPham4] To be more correct, the application will get
>     SA_AIS_ERR_BAD_HANDLE when trying to access the lost checkpoint
>     because all data was destroyed.
>
>     But for opening the checkpoint (not creating), it will get
>     SA_AIS_ERR_NOT_EXIST.
>
>
>                     On top of that  CKPT SAF specification only says
>         that non-collocated checkpoint and all its sections should
>         survive if the Checkpoint Service running  on cluster and
>                     replica is  USER private data ( not Opensaf States
>         ) ,  loosing any USER private data  not acceptable .
>
>
>             *Comment 3*: This is about case where checkpoint node
>             director (cpnd) crashes during headless state. In this
>             case the cpnd can’t finish starting because it can’t
>             initialize CLM service.
>
>             Then after time out, the AMF triggers a restart again.
>             Finally, the node is rebooted.
>
>             It is expected that this problem should not lead to a node
>             reboot.
>
>             Action: No change in code. This is the limitation of the
>             system during headless state.
>
>
>         [AVM]  code changes required in CPSV CLM integration code need
>         to be revisited to handle TRYAGAIN.
>
>             If you agree with the summary above, I’ll update code and
>             send out the V3 for review.
>
>             Best regards,
>
>             Nhat Pham
>
>             *From:* Anders Widell [mailto:[email protected]]
>             *Sent:* Wednesday, February 24, 2016 9:26 PM
>             *To:* Nhat Pham <[email protected]>
>             <mailto:[email protected]>; 'A V Mahesh'
>             <[email protected]> <mailto:[email protected]>
>             *Cc:* [email protected]
>             <mailto:[email protected]>; 'Beatriz
>             Brandao' <[email protected]>
>             <mailto:[email protected]>; 'Minh Chau H'
>             <[email protected]> <mailto:[email protected]>
>             *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
>             Support preserving and recovering checkpoint replicas
>             during headless state V2 [#1621]
>
>             See my comments inline, marked [AndersW3].
>
>             regards,
>             Anders Widell
>
>             On 02/24/2016 07:32 AM, Nhat Pham wrote:
>
>                 Hi Mahesh and Anders,
>
>                 Please see my comments below.
>
>                 Best regards,
>
>                 Nhat Pham
>
>                 *From:* A V Mahesh [mailto:[email protected]]
>                 *Sent:* Wednesday, February 24, 2016 11:06 AM
>                 *To:* Nhat Pham <[email protected]>
>                 <mailto:[email protected]>; 'Anders Widell'
>                 <[email protected]>
>                 <mailto:[email protected]>
>                 *Cc:* [email protected]
>                 <mailto:[email protected]>; 'Beatriz
>                 Brandao' <[email protected]>
>                 <mailto:[email protected]>; 'Minh Chau H'
>                 <[email protected]>
>                 <mailto:[email protected]>
>                 *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
>                 Support preserving and recovering checkpoint replicas
>                 during headless state V2 [#1621]
>
>                 Hi Nhat Pham,
>
>                 If component ( CPND ) restart allows while Controllers
>                 absent , before  requesting CLM going to change return
>                 value to**SA_AIS_ERR_TRY_AGAIN ,
>                 We need to get clarification from  AMF guys  on few
>                 things  why because  if CPND is on
>                 SA_AIS_ERR_TRY_AGAIN and component restart timeout
>                 then AMF will restart component again ( this become
>                 cyclic ) and after saAmfSGCompRestartMax  configured
>                 value Node gose for reboot as next level escalation,
>                 in that case we may required changes in  AMF as well, 
>                 to not to act on component restart timeout in case of
>                 Controllers absent ( i am not sure it is deviation of
>                 AMF specification ) .
>
>                 */[Nhat Pham] In headless state, I’m not sure about
>                 this either. /*
>
>                 */@Anders: Would you have comments about this?/*
>
>             [AndersW3] Ok, first of all I would like to point out that
>             normally, the OpenSAF checkpoint node director should not
>             crash. So we are talking about a situation where multiple
>             faults have occurred: first both the active and the
>             standby system controllers have died, and then shortly
>             afterwards - before we have a new active system controller
>             - the checkpoint node director also crashes. Sure, these
>             may not be totally independent events, but still there are
>             a lot of faults that have happened within a short period
>             of time. We should test the node director and make sure it
>             doesn't crash in this type of scenario.
>
>             Now, let's consider the case where we have a fault in the
>             node director that causes it to crash during the headless
>             state. The general philosophy of the headless feature is
>             that when things work fine - i.e. in the absence of fault
>             - we should be able to continue running while the system
>             controllers are absent. However, if a fault happens during
>             the headless state, we may not be able to recover from the
>             fault until there is an active system controller. AMF does
>             provide support for restarting components, but as you have
>             pointed out, the node director will be stuck in a
>             TRY_AGAIN loop immediately after it has been restarted. So
>             this means that if the node director crashes during the
>             headless state, we have lost the checkpoint functionality
>             on that node and we will not get it back until there is an
>             active system controller. Other services like IMM will
>             still work for a while, but AMF will as you say eventually
>             escalate the checkpoint node director failure to a node
>             restart and then the whole node is gone. The node will not
>             come back until we have an active system controller. So to
>             summarize: there is very limited support for recovering
>             from faults that happen during the headless state. The
>             full recovery will not happen until we have an active
>             system controller.
>
>                 Please do incorporate current comments ( in design
>                 prospective )  and republish the patch , I will
>                 re-test V3 patch and provide review comments on
>                 function issue/bugs if I found any.
>
>                 One Important note  , in the new patch  let us not
>                 have any complexity of  allowing non-collocated
>                 checkpoint creation and then documenting that  in some
>                 scenario ,
>                 non-collocated checkpoint  replicas are recoverable  ,
>                 why because replica is  USER private data ( not
>                 Opensaf States ) ,  loosing USER private data  not
>                 acceptable .
>                 so let us keep the scope of CPSV service as
>                 non-collocated checkpoint creation NOT_SUPPORTED , if
>                 cluster is running with
>                  IMMSV_SC_ABSENCE_ALLOWED ( headless state
>                 configuration enabled at the time of cluster startup 
>                 currently it is not configurable , so their no chance
>                 of  run-time configuration change ).
>
>                 We can provide support for non-collocated in
>                 subsequent enhancements by having  solution like
>                 replica on lower node ID PL will also created
>                 non-collocated  ( max three riplicas in cluster
>                 regradless of where non-collocated is opened ).
>
>                 So for now, regardless of the heads (SC`s) status
>                 exist not exist  CPSV should return
>                 SA_AIS_ERR_NOT_SUPPORTED in case of
>                 IMMSV_SC_ABSENCE_ALLOWED enabled cluster ,
>                 and let us document it as well.
>
>                 */[Nhat Pham] The patch is to limit loosing replicas
>                 and checkpoints in case of headless state./*
>
>                 */In case both replicas locate on SCs and they reboot,
>                 loosing checkpoint is unpreventable with current
>                 design after headless state./*
>
>                 */Even if we implement the proposal “/*max three
>                 riplicas in cluster regradless of where non-collocated
>                 is opened*/”, there is still the case where the
>                 checkpoint is lost. Ex. The SCs and the PL which hosts
>                 the replica reboot same time./*
>
>                 */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if both
>                 SCs reboot, this leads whole cluster reboots. Then the
>                 checkpoint is lost.
>
>                 */What I mean is there are cases where the checkpoint
>                 is lost. The point is what we can do to limit loosing
>                 data./*
>
>                 */For the proposal of reject creating non-collocated
>                 checkpoint in case of/* IMMSV_SC_ABSENCE_ALLOWED
>                 enabled, I think this will lead to in compatible problem.
>
>                 */@Anders: How do you think about rejecting creating
>                 non-collocated checkpoint in case of
>                 /*IMMSV_SC_ABSENCE_ALLOWED enabled?
>
>             [AndersW3] No, I think we ought to support non-colocated
>             checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The
>             fact that we have "system controllers" is an
>             implementation detail of OpenSAF. I don't think the CKPT
>             SAF specification implies that non-colocated checkpoints
>             must be fully replicated on all the nodes in the cluster,
>             and thus we must have the possibility that all replicas
>             are lost. It is not clear exactly what to expect from the
>             APIs when this happens, but you could handle it in a
>             similar way as the case when all sections have been
>             automatically deleted by the checkpoint service because
>             the sections have expired.
>
>
>                 -AVM
>
>                 On 2/24/2016 6:51 AM, Nhat Pham wrote:
>
>                     Hi Mahesh,
>
>                     Do you have any further comments?
>
>                     Best regards,
>
>                     Nhat Pham
>
>                     *From:* A V Mahesh [mailto:[email protected]]
>                     *Sent:* Monday, February 22, 2016 10:37 AM
>                     *To:* Nhat Pham <[email protected]>
>                     <mailto:[email protected]>; 'Anders Widell'
>                     <[email protected]>
>                     <mailto:[email protected]>
>                     *Cc:* [email protected]
>                     <mailto:[email protected]>;
>                     'Beatriz Brandao' <[email protected]>
>                     <mailto:[email protected]>; 'Minh Chau
>                     H' <[email protected]>
>                     <mailto:[email protected]>
>                     *Subject:* Re: [PATCH 0 of 1] Review Request for
>                     cpsv: Support preserving and recovering checkpoint
>                     replicas during headless state V2 [#1621]
>
>                     Hi,
>
>                     >>BTW, have you finished the review and test?
>
>                     I will finish by today.
>
>                     -AVM
>
>                     On 2/22/2016 7:48 AM, Nhat Pham wrote:
>
>                         Hi Mahesh and Anders,
>
>                         Please see my comment below.
>
>                         BTW, have you finished the review and test?
>
>                         Best regards,
>
>                         Nhat Pham
>
>                         *From:* A V Mahesh
>                         [mailto:[email protected]]
>                         *Sent:* Friday, February 19, 2016 2:28 PM
>                         *To:* Nhat Pham <[email protected]>
>                         <mailto:[email protected]>; 'Anders
>                         Widell' <[email protected]>
>                         <mailto:[email protected]>; 'Minh
>                         Chau H' <[email protected]>
>                         <mailto:[email protected]>
>                         *Cc:* [email protected]
>                         <mailto:[email protected]>;
>                         'Beatriz Brandao'
>                         <[email protected]>
>                         <mailto:[email protected]>
>                         *Subject:* Re: [PATCH 0 of 1] Review Request
>                         for cpsv: Support preserving and recovering
>                         checkpoint replicas during headless state V2
>                         [#1621]
>
>                         Hi Nhat Pham,
>
>                         On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
>                             Could you please give more detailed
>                             information about steps to reproduce the
>                             problem below? Thanks.
>
>
>                         Don't see this as specific bug  , we need to
>                         see the issue as  CLM integrated service
>                         point  of view ,
>                         by considering Anders Widell explication about
>                         CLM  application behavior during headless state
>                         we need to reintegrate CPND with CLM ( before
>                         this  headless state feature  no case of CPND
>                         existence in the obscene of CLMD  , but now it
>                         is ).
>
>                         And this will be the consistent across the all
>                         services who integrated with CLM  ( you may
>                         need some changes in CLM also )
>
>                         */[Nhat Pham] I think CLM should return
>                         /*SA_AIS_ERR_TRY_AGAIN in this case.
>
>                         @Anders. How would you think?
>
>                         To start with let us consider case CPND on
>                         payload restarted on PL  during headless state
>                         and an application is in running on PL.
>
>                         */[Nhat Pham] Regarding the CPND as CLM
>                         application, I’m not sure what it can do in
>                         this case. In case it restarts, it is
>                         monitored by AMF./*
>
>                         */If it blocks for too long, AMF will also
>                         trigger a node reboot./*
>
>                         */In my test case, the CPND get blocked by
>                         CLM. It doesn’t get out of the
>                         saClmInitialize. How do you get the “/ER cpnd
>                         clm init failed with return value:31/”?/*
>
>                         */Following is the cpnd trace./*
>
>                         Feb 22 8:56:41.188122 osafckptnd
>                         [736:cpnd_init.c:0183] >> cpnd_lib_init
>
>                         Feb 22 8:56:41.188332 osafckptnd
>                         [736:cpnd_init.c:0412] >> cpnd_cb_db_init
>
>                         Feb 22 8:56:41.188600 osafckptnd
>                         [736:cpnd_init.c:0437] << cpnd_cb_db_init
>
>                         Feb 22 8:56:41.188778 osafckptnd
>                         [736:clma_api.c:0503] >> saClmInitialize
>
>                         Feb 22 8:56:41.188945 osafckptnd
>                         [736:clma_api.c:0593] >> clmainitialize
>
>                         Feb 22 8:56:41.190052 osafckptnd
>                         [736:clma_util.c:0100] >> clma_startup:
>                         clma_use_count: 0
>
>                         Feb 22 8:56:41.190273 osafckptnd
>                         [736:clma_mds.c:1124] >> clma_mds_init
>
>                         Feb 22 8:56:41.190825 osafckptnd
>                         [736:clma_mds.c:1170] << clma_mds_init
>
>                         -AVM
>
>                         On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
>                             Hi Mahesh,
>
>                             Could you please give more detailed
>                             information about steps to reproduce the
>                             problem below? Thanks.
>
>                             Best regards,
>
>                             Nhat Pham
>
>                             *From:* A V Mahesh
>                             [mailto:[email protected]]
>                             *Sent:* Friday, February 19, 2016 1:06 PM
>                             *To:* Anders Widell
>                             <[email protected]>
>                             <mailto:[email protected]>; Nhat
>                             Pham <[email protected]>
>                             <mailto:[email protected]>; 'Minh
>                             Chau H' <[email protected]>
>                             <mailto:[email protected]>
>                             *Cc:* [email protected]
>                             <mailto:[email protected]>;
>                             'Beatriz Brandao'
>                             <[email protected]>
>                             <mailto:[email protected]>
>                             *Subject:* Re: [PATCH 0 of 1] Review
>                             Request for cpsv: Support preserving and
>                             recovering checkpoint replicas during
>                             headless state V2 [#1621]
>
>                             Hi Anders Widell,
>                             Thanks for the detailed explanation about
>                             CLM during headless state.
>
>                             HI  Nhat Pham ,
>
>                             Comment : 3
>                             Please see below  the problem I was
>                             interpreted now I  seeing it  during CLMD
>                             obscene ( during headless state ),
>                             so now CPND/CLMA need to  to address below
>                             case , currently cpnd clm init failed with
>                             return value: SA_AIS_ERR_UNAVAILABLE
>                             but should be SA_AIS_ERR_TRY_AGAIN
>
>                             ==================================================
>                             Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                             NODE STATE-> IMM_NODE_FULLY_AVAILABLE 17418
>                             Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync
>                             ending normally
>                             Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                             Epoch set to 9 in ImmModel
>                             Feb 19 11:18:28 PL-4 cpsv_app: IN Received
>                             PROC_STALE_CLIENTS
>                             Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                             Implementer connected: 42
>                             (MsgQueueService132111) <108, 2040f>
>                             Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                             Implementer connected: 43
>                             (MsgQueueService131855) <0, 2030f>
>                             Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                             Implementer connected: 44 (safLogService)
>                             <0, 2010f>
>                             Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                             SERVER STATE: IMM_SERVER_SYNC_SERVER -->
>                             IMM_SERVER_READY
>                             Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                             Implementer connected: 45 (safClmService)
>                             <0, 2010f>
>                             *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER
>                             cpnd clm init failed with return value:31
>                             Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER
>                             cpnd init failed
>                             Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER
>                             cpnd_lib_req FAILED
>                             Feb 19 11:18:28 PL-4 osafckptnd[7718]:
>                             __init_cpnd() failed*
>                             Feb 19 11:18:28 PL-4 osafclmna[5432]: NO
>                             safNode=PL-4,safCluster=myClmCluster
>                             Joined cluster, nodeid=2040f
>                             Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO
>                             AVD NEW_ACTIVE, adest:1
>                             Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO
>                             Sending node up due to NCSMDS_NEW_ACTIVE
>                             Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1
>                             SISU states sent
>                             Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1
>                             SU states sent
>                             Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7
>                             CSICOMP states synced
>                             Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7
>                             SU states sent
>                             Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                             Implementer connected: 46 (safAmfService)
>                             <0, 2010f>
>                             Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO
>                             'safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                             Component or SU restart probation timer
>                             expired
>                             Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
>                             Instantiation of
>                             
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                             failed
>                             Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
>                             Reason: component registration timer expired
>                             Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA
>                             
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                             Presence State RESTARTING =>
>                             INSTANTIATION_FAILED
>                             Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
>                             Component Failover trigerred for
>                             'safSu=PL-4,safSg=NoRed,safApp=OpenSAF':
>                             Failed component:
>                             
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                             Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER
>                             
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got
>                             Inst failed
>                             Feb 19 11:18:35 PL-4 osafamfnd[5441]:
>                             Rebooting OpenSAF NodeId = 132111 EE Name
>                             = , Reason: NCS component Instantiation
>                             failed, OwnNodeId = 132111,
>                             SupervisionTime = 60
>                             Feb 19 11:18:36 PL-4 opensaf_reboot:
>                             Rebooting local node; timeout=60
>                             Feb 19 11:18:39 PL-4 kernel: [
>                             4877.338518] md: stopping all md devices.
>                             ==================================================
>
>                             -AVM
>
>                             On 2/15/2016 5:11 PM, Anders Widell wrote:
>
>                                 Hi!
>
>                                 Please find my answer inline, marked
>                                 [AndersW].
>
>                                 regards,
>                                 Anders Widell
>
>                                 On 02/15/2016 10:38 AM, Nhat Pham wrote:
>
>                                     Hi Mahesh,
>
>                                     It's good. Thank you. :)
>
>                                     [AVM]  Up on rejoining of the SC`s
>                                     The replica should be re-created
>                                     regardless
>                                     of another application opens it on
>                                     PL4.
>                                                    ( Note : this
>                                     comment is based on your
>                                     explanation have not yet
>                                     reviewed/tested  ,
>                                                       currently i am
>                                     struggling with  SC`s    not
>                                     rejoining
>                                     after headless state , i can
>                                     provide you more on this once i
>                                     complte my
>                                     review/testing)
>
>                                     [Nhat] To make cloud resilience
>                                     works, you need the patches from
>                                     other
>                                     services (log, amf, clm, ntf).
>                                     @Minh: I heard that you created
>                                     tar file which includes all
>                                     patches. Could you
>                                     please send it to Mahesh? Thanks
>
>                                     [AVM] I understand that , before I
>                                     comment more on this   please
>                                     allow me to
>                                     understand
>                                                   I am not still not
>                                     very clear of the headless design
>                                     in detail.
>                                                   For example cluster
>                                     membership of PL`s during headless
>                                     state ,
>                                                    In the absence of 
>                                     SC`s  (CLMD) dose the PLs is
>                                     considered as
>                                     cluster nodes or not (cluster
>                                     membership) ?
>
>                                     [Nhat] I don't know much about this.
>                                     @ Anders: Could you please have
>                                     comment about this? Thanks
>
>                                 [AndersW] First of all, keep in mind
>                                 that the "headless" state should
>                                 ideally not last a very long time.
>                                 Once we have the spare SC feature in
>                                 place (ticket [#79]), a new SC should
>                                 become active within a matter of a few
>                                 seconds after we have lost both the
>                                 active and the standby SC.
>
>                                 I think you should view the state of
>                                 the cluster in the headless state in
>                                 the same way as you view the state of
>                                 the cluster during a failover between
>                                 the active and the standby SC. Imagine
>                                 that the active SC dies. It takes the
>                                 standby SC 1.5 seconds to detect the
>                                 failure of the active SC (this is due
>                                 to the TIPC timeout). If you have
>                                 configured the PROMOTE_ACTIVE_TIMER,
>                                 there is an additional delay before
>                                 the standby takes over as active. What
>                                 is the state of the cluster during the
>                                 time after the active SC failed and
>                                 before the standby takes over?
>
>                                 The state of the cluster while it is
>                                 headless is very similar. The
>                                 difference is that this state may last
>                                 a little bit longer (though not more
>                                 than a few seconds, until one of the
>                                 spare SCs becomes active). Another
>                                 difference is that we may have lost
>                                 some state. With a "perfect"
>                                 implementation of the headless feature
>                                 we should not lose any state at all,
>                                 but with the current set of patches we
>                                 do lose state.
>
>                                 So specifically if we talk about
>                                 cluster membership and ask the
>                                 question: is a particular PL a member
>                                 of the cluster or not during the
>                                 headless state? Well, if you ask CLM
>                                 about this during the headless state,
>                                 then you will not know - because CLM
>                                 doesn't provide any service during the
>                                 headless state. If you keep retrying
>                                 you query to CLM, you will eventually
>                                 get an answer - but you will not get
>                                 this answer until there is an active
>                                 SC again and we have exited the
>                                 headless state. When viewed in this
>                                 way, the answer to the question about
>                                 a node's membership is undefined
>                                 during the headless state, since CLM
>                                 will not provide you with any answer
>                                 until there is an active SC.
>
>                                 However, if you asked CLM about the
>                                 node's cluster membership status
>                                 before the cluster went headless, you
>                                 probably saved a cached copy of the
>                                 cluster membership state. Maybe you
>                                 also installed a CLM track callback
>                                 and intend to update this cached copy
>                                 every time the cluster membership
>                                 status changes. The question then is:
>                                 can you continue using this cached
>                                 copy of the cluster membership state
>                                 during the headless state? The answer
>                                 is YES: since CLM doesn't provide any
>                                 service during the headless state, it
>                                 also means that the cluster membership
>                                 view cannot change during this time.
>                                 Nodes can of course reboot or die, but
>                                 CLM will not notice and hence the
>                                 cluster view will not be updated. You
>                                 can argue that this is bad because the
>                                 cluster view doesn't reflect reality,
>                                 but notice that this will always be
>                                 the case. We can never propagate
>                                 information instantaneously, and
>                                 detection of node failures will take
>                                 1.5 seconds due to the TIPC timeout.
>                                 You can never be sure that a node is
>                                 alive at this very moment just because
>                                 CLM tells you that it is a member of
>                                 the cluster. If we are unfortunate
>                                 enough to lose both system controller
>                                 nodes simultaneously, updates to the
>                                 cluster membership view will be
>                                 delayed a few seconds longer than usual.
>
>
>                                     Best regards,
>                                     Nhat Pham
>
>                                     -----Original Message-----
>                                     From: A V Mahesh
>                                     [mailto:[email protected]]
>                                     Sent: Monday, February 15, 2016
>                                     11:19 AM
>                                     To: Nhat Pham
>                                     <[email protected]>
>                                     <mailto:[email protected]>;
>                                     [email protected]
>                                     <mailto:[email protected]>
>                                     Cc:
>                                     [email protected] 
> <mailto:[email protected]>;
>                                     'Beatriz Brandao'
>                                     <[email protected]>
>                                     <mailto:[email protected]>
>                                     Subject: Re: [PATCH 0 of 1] Review
>                                     Request for cpsv: Support
>                                     preserving and
>                                     recovering checkpoint replicas
>                                     during headless state V2 [#1621]
>
>                                     Hi Nhat Pham,
>
>                                     How is your holiday went
>
>                                     Please find my comments below
>
>                                     On 2/15/2016 8:43 AM, Nhat Pham
>                                     wrote:
>
>                                         Hi Mahesh,
>
>                                         For the comment 1, the patch
>                                         will be updated accordingly.
>
>                                     [AVM]  Please hold , I will
>                                     provide more comments in this week
>                                     , so we can
>                                     have consolidated V3
>
>                                         For the comment 2, I think the
>                                         CKPT service will not be backward
>                                         compatible if the
>                                         scAbsenceAllowed is true.
>                                         The client can't create
>                                         non-collocated checkpoint on SCs.
>
>                                         Furthermore, this solution
>                                         only protects the CKPT service
>                                         from the
>                                         case "The non-collocated
>                                         checkpoint  is created on a SC"
>                                         there are still the cases
>                                         where the replicas are
>                                         completely lost. Ex:
>
>                                         - The non-collocated
>                                         checkpoint created on a PL.
>                                         The PL reboots. Both
>                                         replicas now locate on SCs.
>                                         Then, headless state happens.
>                                         All replicas are
>                                         lost.
>                                         - The non-collocated
>                                         checkpoint has active replica
>                                         locating on a PL
>                                         and this PL restarts during
>                                         headless state
>                                         - The non-collocated
>                                         checkpoint is created on PL3.
>                                         This checkpoint is
>                                         also opened on PL4. Then SCs
>                                         and PL3 reboot.
>
>                                     [AVM]  Up on rejoining of the SC`s
>                                     The replica should be re-created
>                                     regardless
>                                     of another application opens it on
>                                     PL4.
>                                                    ( Note : this
>                                     comment is based on your
>                                     explanation have not yet
>                                     reviewed/tested  ,
>                                                       currently i am
>                                     struggling with  SC`s    not
>                                     rejoining
>                                     after headless state , i can
>                                     provide you more on this once i
>                                     complte my
>                                     review/testing)
>
>                                         In this case, all replicas are
>                                         lost and the client has to
>                                         create it again.
>
>                                         In case multiple nodes (which
>                                         including SCs) reboot, losing
>                                         replicas
>                                         is unpreventable. The patch is
>                                         to recover the checkpoints in
>                                         possible cases.
>                                         How do you think?
>
>                                     [AVM] I understand that , before I
>                                     comment more on this   please allow
>                                     me to understand
>                                                   I am not still not
>                                     very clear of the headless design
>                                     in detail.
>
>                                                   For example cluster
>                                     membership of PL`s during headless
>                                     state ,
>                                                    In the absence of 
>                                     SC`s  (CLMD) dose the PLs is
>                                     considered as
>                                     cluster nodes or not (cluster
>                                     membership) ?
>
>                                                          - if not
>                                     consider as  NON cluster nodes
>                                     Checkpoint Service
>                                     API  should  leverage the SA Forum
>                                     Cluster
>                                     Membership Service  and API's can
>                                     fail with
>                                     SA_AIS_ERR_UNAVAILABLE
>
>                                                          - if
>                                     considers as cluster nodes  we
>                                     need to follow all the
>                                     defined rules which are defined in
>                                     SAI-AIS-CKPT-B.02.02 specification
>
>                                                   so give me some more
>                                     time to review it completely , so
>                                     that we
>                                     can  have consolidated patch V3
>
>                                     -AVM
>
>                                         Best regards,
>                                         Nhat Pham
>
>                                         -----Original Message-----
>                                         From: A V Mahesh
>                                         [mailto:[email protected]]
>                                         Sent: Friday, February 12,
>                                         2016 11:10 AM
>                                         To: Nhat Pham
>                                         <[email protected]>
>                                         <mailto:[email protected]>;
>                                         [email protected]
>                                         <mailto:[email protected]>
>
>                                         Cc:
>                                         [email protected]
>                                         
> <mailto:[email protected]>;
>                                         Beatriz Brandao
>                                         <[email protected]>
>                                         <mailto:[email protected]>
>
>                                         Subject: Re: [PATCH 0 of 1]
>                                         Review Request for cpsv: Support
>                                         preserving and recovering
>                                         checkpoint replicas during
>                                         headless state V2
>                                         [#1621]
>
>
>                                         Comment 2 :
>
>                                         After incorporating the
>                                         comment one all the
>                                         Limitations should be
>                                         prevented based on Hydra
>                                         configuration is enabled in
>                                         IMM status.
>
>                                         Foe example :  if some
>                                         application is trying to create
>
>                                         non-collocated checkpoint
>                                         active replica getting
>                                         generated/locating on
>                                         SC then ,regardless of the
>                                         heads (SC`s) status exist not
>                                         exist should
>                                         return SA_AIS_ERR_NOT_SUPPORTED
>
>                                         In other words, rather that
>                                         allowing to created
>                                         non-collocated
>                                         checkpoint when
>                                         heads(SC`s)  are exit , and
>                                         non-collocated checkpoint getting
>                                         unrecoverable after
>                                         heads(SC`s) rejoins.
>
>                                         
> ======================================================================
>
>                                         =======================
>
>                                                 Limitation: The CKPT
>                                             service doesn't support
>                                             recovering checkpoints in
>                                                 following cases:
>                                                 . The checkpoint which
>                                             is unlinked before headless.
>                                                 . The non-collocated
>                                             checkpoint has active
>                                             replica locating on SC.
>                                                 . The non-collocated
>                                             checkpoint has active
>                                             replica locating on a PL
>                                             and this PL
>                                                 restarts during
>                                             headless state. In this
>                                             cases, the checkpoint
>                                             replica is
>                                                 destroyed. The fault
>                                             code SA_AIS_ERR_BAD_HANDLE
>                                             is returned when the
>                                             client
>                                                 accesses the
>                                             checkpoint in these cases.
>                                             The client must re-open the
>                                                 checkpoint.
>
>                                         
> ======================================================================
>
>                                         =======================
>
>                                         -AVM
>
>
>                                         On 2/11/2016 12:52 PM, A V
>                                         Mahesh wrote:
>
>                                             Hi,
>
>                                             I jut starred reviewing
>                                             patch , I will be  giving
>                                             comments as soon as
>                                             I crossover any , to save
>                                             some time.
>
>                                             Comment 1 :
>                                             This functionality should
>                                             be under  checks if Hydra
>                                             configuration is
>                                             enabled in IMM attrName =
>                                             
> const_cast<SaImmAttrNameT>("scAbsenceAllowed")
>
>
>                                             Please see example how
>                                             LOG/AMF  services
>                                             implemented it.
>
>                                             -AVM
>
>
>                                             On 1/29/2016 1:02 PM, Nhat
>                                             Pham wrote:
>
>                                                 Hi Mahesh,
>
>                                                 As described in the
>                                                 README, the CKPT
>                                                 service returns
>                                                 SA_AIS_ERR_TRY_AGAIN
>                                                 fault code in this case.
>                                                 I guess it's same for
>                                                 other services.
>
>                                                 @Anders: Could you
>                                                 please confirm this?
>
>                                                 Best regards,
>                                                 Nhat Pham
>
>                                                 -----Original
>                                                 Message-----
>                                                 From: A V Mahesh
>                                                 
> [mailto:[email protected]]
>
>                                                 Sent: Friday, January
>                                                 29, 2016 2:11 PM
>                                                 To: Nhat Pham
>                                                 <[email protected]>
>                                                 
> <mailto:[email protected]>;
>                                                 [email protected]
>                                                 
> <mailto:[email protected]>
>
>                                                 Cc:
>                                                 
> [email protected]
>                                                 
> <mailto:[email protected]>
>
>                                                 Subject: Re: [PATCH 0
>                                                 of 1] Review Request
>                                                 for cpsv: Support
>                                                 preserving and
>                                                 recovering checkpoint
>                                                 replicas during
>                                                 headless state
>                                                 V2 [#1621]
>
>                                                 Hi,
>
>                                                 On 1/29/2016 11:45 AM,
>                                                 Nhat Pham wrote:
>
>                                                     -  The behavior of
>                                                     application will
>                                                     be consistent with
>                                                     other
>                                                     saf services like
>                                                     imm/amf behavior 
>                                                     during headless
>                                                     state.
>                                                     [Nhat] I'm not
>                                                     clear what you
>                                                     mean about
>                                                     "consistent"?
>
>                                                 In the obscene of 
>                                                 Director (SC's) , what
>                                                 is expected return values
>                                                 of SAF API should (
>                                                 all services ) ,
>                                                      which are not in
>                                                 aposition to  provide
>                                                 service at that moment.
>
>                                                 I think all services
>                                                 should return same 
>                                                 SAF ERRS., I thinks
>                                                 currently we don't
>                                                 have it , may be 
>                                                 Anders Widel will help
>                                                 us.
>
>                                                 -AVM
>
>
>                                                 On 1/29/2016 11:45 AM,
>                                                 Nhat Pham wrote:
>
>                                                     Hi Mahesh,
>
>                                                     Please see the
>                                                     attachment for the
>                                                     README. Let me
>                                                     know if there is
>                                                     any more
>                                                     information required.
>
>                                                     Regarding your
>                                                     comments:
>                                                           -  during
>                                                     headless state 
>                                                     applications may
>                                                     behave like during
>                                                     CPND restart case
>                                                     [Nhat] Headless
>                                                     state and CPND
>                                                     restart are
>                                                     different events.
>                                                     Thus, the behavior
>                                                     is different.
>                                                     Headless state is
>                                                     a case where both
>                                                     SCs go down.
>
>                                                           -  The
>                                                     behavior of
>                                                     application will
>                                                     be consistent with
>                                                     other
>                                                     saf services like
>                                                     imm/amf behavior 
>                                                     during headless
>                                                     state.
>                                                     [Nhat] I'm not
>                                                     clear what you
>                                                     mean about
>                                                     "consistent"?
>
>                                                     Best regards,
>                                                     Nhat Pham
>
>                                                     -----Original
>                                                     Message-----
>                                                     From: A V Mahesh
>                                                     
> [mailto:[email protected]]
>
>                                                     Sent: Friday,
>                                                     January 29, 2016
>                                                     11:12 AM
>                                                     To: Nhat Pham
>                                                     <[email protected]>
>                                                     
> <mailto:[email protected]>;
>
>                                                     [email protected]
>                                                     
> <mailto:[email protected]>
>
>                                                     Cc:
>                                                     
> [email protected]
>                                                     
> <mailto:[email protected]>
>
>                                                     Subject: Re:
>                                                     [PATCH 0 of 1]
>                                                     Review Request for
>                                                     cpsv: Support
>                                                     preserving and
>                                                     recovering
>                                                     checkpoint
>                                                     replicas during
>                                                     headless state
>                                                     V2 [#1621]
>
>                                                     Hi Nhat Pham,
>
>                                                     I stared reviewing
>                                                     this patch , so
>                                                     can please
>                                                     provide  README file
>                                                     with scope and
>                                                     limitations , that
>                                                     will help to define
>                                                     testing/reviewing
>                                                     scope .
>
>                                                     Following are
>                                                     minimum things we
>                                                     can keep in mind
>                                                     while
>                                                     reviewing/accepting patch
>                                                     ,
>
>                                                     - Not effecting
>                                                     existing
>                                                     functionality
>                                                           -  during
>                                                     headless state 
>                                                     applications may
>                                                     behave like during
>                                                     CPND restart case
>                                                           -  The
>                                                     minimum
>                                                     functionally of
>                                                     application works
>                                                           -  The
>                                                     behavior of
>                                                     application will
>                                                     be consistent with
>                                                              other saf
>                                                     services like
>                                                     imm/amf behavior 
>                                                     during headless
>                                                     state.
>
>                                                     So please do
>                                                     provide any
>                                                     additional
>                                                     detailed in README
>                                                     if any of
>                                                     the above is
>                                                     deviated , that
>                                                     allow users to
>                                                     know about the
>                                                     limitations/deviation.
>
>
>                                                     -AVM
>
>                                                     On 1/4/2016 3:15
>                                                     PM, Nhat Pham wrote:
>
>                                                         Summary: cpsv:
>                                                         Support
>                                                         preserving and
>                                                         recovering
>                                                         checkpoint
>                                                         replicas
>                                                         during
>                                                         headless state
>                                                         [#1621] Review
>                                                         request for Trac
>                                                         Ticket(s):
>                                                         #1621 Peer
>                                                         Reviewer(s):
>                                                         
> [email protected]
>                                                         
> <mailto:[email protected]>;
>
>                                                         
> [email protected]
>                                                         
> <mailto:[email protected]>
>                                                         Pull request to:
>                                                         
> [email protected]
>                                                         
> <mailto:[email protected]>
>                                                         Affected
>                                                         branch(es):
>                                                         default
>                                                         Development
>                                                         branch: default
>
>                                                         
> --------------------------------
>
>                                                         Impacted area
>                                                         Impact y/n
>                                                         
> --------------------------------
>
>                                                         Docs n
>                                                               Build
>                                                         system           
>                                                         n
>                                                         RPM/packaging n
>                                                              
>                                                         Configuration
>                                                         files     n
>                                                               Startup
>                                                         scripts         n
>                                                               SAF
>                                                         services           
>                                                         y
>                                                               OpenSAF
>                                                         services        n
>                                                               Core
>                                                         libraries         
>                                                         n
>                                                         Samples n
>                                                         Tests n
>                                                         Other n
>
>
>                                                         Comments
>                                                         (indicate
>                                                         scope for each
>                                                         "y" above):
>                                                         
> ---------------------------------------------
>
>
>                                                         changeset
>                                                         
> faec4a4445a4c23e8f630857b19aabb43b5af18d
>
>                                                         Author:   
>                                                         Nhat Pham
>                                                         
> <[email protected]>
>                                                         
> <mailto:[email protected]>
>
>                                                         Date:    Mon,
>                                                         04 Jan 2016
>                                                         16:34:33 +0700
>
>                                                               cpsv:
>                                                         Support
>                                                         preserving and
>                                                         recovering
>                                                         checkpoint
>                                                         replicas
>                                                         during
>                                                         headless state
>                                                         [#1621]
>
>                                                               Background:
>                                                              
>                                                         ----------
>                                                         This
>                                                         enhancement
>                                                         supports to
>                                                         preserve
>                                                         checkpoint
>                                                         replicas
>
>                                                     in case
>
>                                                         both SCs down
>                                                         (headless
>                                                         state) and
>                                                         recover
>                                                         replicas in case
>                                                         one of
>
>                                                     SCs up
>
>                                                         again. If both
>                                                         SCs goes down,
>                                                         checkpoint
>                                                         replicas on
>                                                         surviving nodes
>
>                                                     still
>
>                                                         remain. When a
>                                                         SC is
>                                                         available
>                                                         again,
>                                                         surviving
>                                                         replicas are
>
>                                                     automatically
>
>                                                         registered to
>                                                         the SC
>                                                         checkpoint
>                                                         database.
>                                                         Content in
>                                                         surviving
>
>                                                     replicas are
>
>                                                         intacted and
>                                                         synchronized
>                                                         to new replicas.
>
>                                                               When no
>                                                         SC is
>                                                         available,
>                                                         client API
>                                                         calls changing
>                                                         checkpoint
>
>                                                     configuration
>
>                                                         which requires
>                                                         SC
>                                                         communication,
>                                                         are rejected.
>                                                         Client API
>                                                         calls
>
>                                                     reading and
>
>                                                         writing
>                                                         existing
>                                                         checkpoint
>                                                         replicas still
>                                                         work.
>
>                                                              
>                                                         Limitation:
>                                                         The CKPT
>                                                         service does
>                                                         not support
>                                                         recovering
>                                                         checkpoints
>
>                                                     in
>
>                                                         following cases:
>                                                                - The
>                                                         checkpoint
>                                                         which is
>                                                         unlinked
>                                                         before headless.
>                                                                - The
>                                                         non-collocated
>                                                         checkpoint has
>                                                         active replica
>                                                         locating
>                                                         on SC.
>                                                                - The
>                                                         non-collocated
>                                                         checkpoint has
>                                                         active replica
>                                                         locating
>                                                         on a PL
>
>                                                     and this
>
>                                                         PL restarts
>                                                         during
>                                                         headless
>                                                         state. In this
>                                                         cases, the
>                                                         checkpoint
>
>                                                     replica is
>
>                                                         destroyed. The
>                                                         fault code
>                                                         SA_AIS_ERR_BAD_HANDLE
>                                                         is returned
>                                                         when the
>
>                                                     client
>
>                                                         accesses the
>                                                         checkpoint in
>                                                         these cases.
>                                                         The client must
>                                                         re-open the
>                                                               checkpoint.
>
>                                                               While in
>                                                         headless
>                                                         state,
>                                                         accessing
>                                                         checkpoint
>                                                         replicas does
>                                                         not work
>
>                                                     if the
>
>                                                         node which
>                                                         hosts the
>                                                         active replica
>                                                         goes down. It
>                                                         will back
>                                                         working
>
>                                                     when a
>
>                                                         SC available
>                                                         again.
>
>                                                               Solution:
>                                                              
>                                                         --------- The
>                                                         solution for
>                                                         this
>                                                         enhancement
>                                                         includes 2 parts:
>
>                                                               1. To
>                                                         destroy
>                                                         un-recoverable
>                                                         checkpoint
>                                                         described
>                                                         above when
>                                                         both
>
>                                                     SCs are
>
>                                                         down: When
>                                                         both SCs are
>                                                         down, the CPND
>                                                         deletes
>                                                         un-recoverable
>
>                                                     checkpoint
>
>                                                         nodes and
>                                                         replicas on
>                                                         PLs. Then it
>                                                         requests CPA
>                                                         to destroy
>
>                                                     corresponding
>
>                                                         checkpoint
>                                                         node by using
>                                                         new message
>                                                         
> CPA_EVT_ND2A_CKPT_DESTROY
>
>
>                                                               2. To
>                                                         update CPD
>                                                         with
>                                                         checkpoint
>                                                         information
>                                                         When an active
>                                                         SC is up
>
>                                                     after
>
>                                                         headless, CPND
>                                                         will update
>                                                         CPD with
>                                                         checkpoint
>                                                         information by
>                                                         using
>
>                                                     new
>
>                                                         message
>                                                         
> CPD_EVT_ND2D_CKPT_INFO_UPDATE
>                                                         instead of using
>                                                         
> CPD_EVT_ND2D_CKPT_CREATE.
>                                                         This is
>                                                         because the
>                                                         CPND will
>                                                         create new
>
>                                                     ckpt_id
>
>                                                         for the
>                                                         checkpoint
>                                                         which might be
>                                                         different with
>                                                         the current
>                                                         ckpt id
>
>                                                     if the
>
>                                                         
> CPD_EVT_ND2D_CKPT_CREATE
>                                                         is used. The
>                                                         CPD collects
>                                                         checkpoint
>
>                                                     information
>
>                                                         within 6s.
>                                                         During this
>                                                         updating time,
>                                                         following
>                                                         requests is
>                                                         rejected
>
>                                                     with
>
>                                                         fault code
>                                                         SA_AIS_ERR_TRY_AGAIN:
>
>                                                               -
>                                                         
> CPD_EVT_ND2D_CKPT_CREATE
>
>                                                               -
>                                                         
> CPD_EVT_ND2D_CKPT_UNLINK
>
>                                                               -
>                                                         
> CPD_EVT_ND2D_ACTIVE_SET
>
>                                                               -
>                                                         
> CPD_EVT_ND2D_CKPT_RDSET
>
>
>
>                                                         Complete
>                                                         diffstat:
>                                                         ------------------
>
>                                                         
> osaf/libs/agents/saf/cpa/cpa_proc.c
>                                                         |   52
>
>                                                     
> +++++++++++++++++++++++++++++++++++
>
>
>                                                         
> osaf/libs/common/cpsv/cpsv_edu.c
>                                                         |   43
>
>                                                     
> +++++++++++++++++++++++++++++
>
>
>                                                         
> osaf/libs/common/cpsv/include/cpd_cb.h
>                                                         |    3 ++
>                                                         
> osaf/libs/common/cpsv/include/cpd_imm.h
>                                                         |    1 +
>                                                         
> osaf/libs/common/cpsv/include/cpd_proc.h
>                                                         |    7 ++++
>                                                         
> osaf/libs/common/cpsv/include/cpd_tmr.h
>                                                         |    3 +-
>                                                         
> osaf/libs/common/cpsv/include/cpnd_cb.h
>                                                         |    1 +
>                                                         
> osaf/libs/common/cpsv/include/cpnd_init.h
>                                                         |    2 +
>                                                         
> osaf/libs/common/cpsv/include/cpsv_evt.h
>                                                         |   20
>                                                         +++++++++++++
>                                                         
> osaf/services/saf/cpsv/cpd/Makefile.am
>                                                         |    3 +-
>                                                         
> osaf/services/saf/cpsv/cpd/cpd_evt.c
>                                                         |  229
>
>                                                     
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                     
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                     ++++
>
>                                                         
> osaf/services/saf/cpsv/cpd/cpd_imm.c
>                                                         |  112
>
>                                                     
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                         
> osaf/services/saf/cpsv/cpd/cpd_init.c
>                                                         |   20
>                                                         ++++++++++++-
>                                                         
> osaf/services/saf/cpsv/cpd/cpd_proc.c
>                                                         |  309
>
>                                                     
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                     
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                     
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                         
> osaf/services/saf/cpsv/cpd/cpd_tmr.c
>                                                         |    7 ++++
>                                                         
> osaf/services/saf/cpsv/cpnd/cpnd_db.c
>                                                         |   16 ++++++++++
>                                                         
> osaf/services/saf/cpsv/cpnd/cpnd_evt.c
>                                                         |   22
>                                                         +++++++++++++++
>                                                         
> osaf/services/saf/cpsv/cpnd/cpnd_init.c
>                                                         |   23
>                                                         ++++++++++++++-
>                                                         
> osaf/services/saf/cpsv/cpnd/cpnd_mds.c
>                                                         |   13 ++++++++
>                                                         
> osaf/services/saf/cpsv/cpnd/cpnd_proc.c
>                                                         |  314
>
>                                                     
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                     
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                     
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>
>
>                                                         20 files
>                                                         changed, 1189
>                                                         insertions(+),
>                                                         11 deletions(-)
>
>
>                                                         Testing Commands:
>                                                         -----------------
>                                                         -
>
>                                                         Testing,
>                                                         Expected Results:
>                                                         
> --------------------------
>
>                                                         -
>
>
>                                                         Conditions of
>                                                         Submission:
>                                                         
> -------------------------
>
>                                                               <<HOW
>                                                         MANY DAYS
>                                                         BEFORE
>                                                         PUSHING,
>                                                         CONSENSUS ETC>>
>
>
>                                                         Arch     
>                                                         Built
>                                                         Started   
>                                                         Linux distro
>                                                         
> -------------------------------------------
>
>                                                         mips        n n
>                                                         mips64      n n
>                                                         x86         n n
>                                                         x86_64      n n
>                                                         powerpc     n n
>                                                         powerpc64   n n
>
>
>                                                         Reviewer
>                                                         Checklist:
>                                                         -------------------
>
>                                                         [Submitters:
>                                                         make sure that
>                                                         your review
>                                                         doesn't
>                                                         trigger any
>                                                         checkmarks!]
>
>
>                                                         Your checkin
>                                                         has not passed
>                                                         review because
>                                                         (see checked
>                                                         entries):
>
>                                                         ___ Your RR
>                                                         template is
>                                                         generally
>                                                         incomplete; it
>                                                         has too many
>                                                         blank
>
>                                                     entries
>
>                                                         that need
>                                                         proper data
>                                                         filled in.
>
>                                                         ___ You have
>                                                         failed to
>                                                         nominate the
>                                                         proper persons
>                                                         for review and
>                                                         push.
>
>                                                         ___ Your
>                                                         patches do not
>                                                         have proper
>                                                         short+long header
>
>                                                         ___ You have
>                                                         grammar/spelling
>                                                         in your header
>                                                         that is
>                                                         unacceptable.
>
>                                                         ___ You have
>                                                         exceeded a
>                                                         sensible line
>                                                         length in your
>
>                                                     headers/comments/text.
>
>
>                                                         ___ You have
>                                                         failed to put
>                                                         in a proper
>                                                         Trac Ticket #
>                                                         into your
>                                                         commits.
>
>                                                         ___ You have
>                                                         incorrectly
>                                                         put/left
>                                                         internal data
>                                                         in your
>                                                         comments/files
>                                                                  (i.e.
>                                                         internal bug
>                                                         tracking tool
>                                                         IDs, product
>                                                         names etc)
>
>                                                         ___ You have
>                                                         not given any
>                                                         evidence of
>                                                         testing beyond
>                                                         basic build
>                                                         tests.
>                                                                 
>                                                         Demonstrate
>                                                         some level of
>                                                         runtime or
>                                                         other sanity
>                                                         testing.
>
>                                                         ___ You have
>                                                         ^M present in
>                                                         some of your
>                                                         files. These
>                                                         have to be
>                                                         removed.
>
>                                                         ___ You have
>                                                         needlessly
>                                                         changed
>                                                         whitespace or
>                                                         added
>                                                         whitespace crimes
>                                                                  like
>                                                         trailing
>                                                         spaces, or
>                                                         spaces before
>                                                         tabs.
>
>                                                         ___ You have
>                                                         mixed real
>                                                         technical
>                                                         changes with
>                                                         whitespace and
>                                                         other
>                                                                 
>                                                         cosmetic code
>                                                         cleanup
>                                                         changes. These
>                                                         have to be
>                                                         separate
>                                                         commits.
>
>                                                         ___ You need
>                                                         to refactor
>                                                         your
>                                                         submission
>                                                         into logical
>                                                         chunks; there is
>                                                                  too
>                                                         much content
>                                                         into a single
>                                                         commit.
>
>                                                         ___ You have
>                                                         extraneous
>                                                         garbage in
>                                                         your review
>                                                         (merge commits
>                                                         etc)
>
>                                                         ___ You have
>                                                         giant
>                                                         attachments
>                                                         which should
>                                                         never have
>                                                         been sent;
>                                                                 
>                                                         Instead you
>                                                         should place
>                                                         your content
>                                                         in a public
>                                                         tree to
>                                                         be pulled.
>
>                                                         ___ You have
>                                                         too many
>                                                         commits
>                                                         attached to an
>                                                         e-mail; resend as
>                                                         threaded
>                                                                 
>                                                         commits, or
>                                                         place in a
>                                                         public tree
>                                                         for a pull.
>
>                                                         ___ You have
>                                                         resent this
>                                                         content
>                                                         multiple times
>                                                         without a clear
>                                                         indication
>                                                                  of
>                                                         what has
>                                                         changed
>                                                         between each
>                                                         re-send.
>
>                                                         ___ You have
>                                                         failed to
>                                                         adequately and
>                                                         individually
>                                                         address all of
>                                                         the
>                                                                 
>                                                         comments and
>                                                         change
>                                                         requests that
>                                                         were proposed
>                                                         in the
>                                                         initial
>
>                                                     review.
>
>                                                         ___ You have a
>                                                         misconfigured
>                                                         ~/.hgrc file
>                                                         (i.e.
>                                                         username, email
>                                                         etc)
>
>                                                         ___ Your
>                                                         computer have
>                                                         a badly
>                                                         configured
>                                                         date and time;
>                                                         confusing the
>                                                                  the
>                                                         threaded patch
>                                                         review.
>
>                                                         ___ Your
>                                                         changes affect
>                                                         IPC mechanism,
>                                                         and you don't
>                                                         present any
>                                                         results
>                                                                  for
>                                                         in-service
>                                                         upgradability
>                                                         test.
>
>                                                         ___ Your
>                                                         changes affect
>                                                         user manual
>                                                         and
>                                                         documentation,
>                                                         your patch
>                                                         series
>                                                                  do
>                                                         not contain
>                                                         the patch that
>                                                         updates the
>                                                         Doxygen manual.
>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Reply via email to