See my comments, marked [AndersW5].

regards,
Anders Widell

On 02/25/2016 11:09 AM, Nhat Pham wrote:
>
> Hi Mahesh and Anders,
>
> Please see my comment below with [NhatPham3]
>
> Best regards,
>
> Nhat Pham
>
> *From:*A V Mahesh [mailto:mahesh.va...@oracle.com]
> *Sent:* Thursday, February 25, 2016 2:14 PM
> *To:* Nhat Pham <nhat.p...@dektech.com.au>; 'Anders Widell' 
> <anders.wid...@ericsson.com>
> *Cc:* opensaf-devel@lists.sourceforge.net; 'Beatriz Brandao' 
> <beatriz.bran...@ericsson.com>; 'Minh Chau H' <minh.c...@dektech.com.au>
> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state V2 
> [#1621]
>
> Hi Nhat Pham,
>
> Please see my comment.
>
> -AVM
>
> On 2/25/2016 12:07 PM, Nhat Pham wrote:
>
>     Hi Mahesh,
>
>     Please see my comment below with [NhatPham2].
>
>     Best regards,
>
>     Nhat Pham
>
>     *From:* A V Mahesh [mailto:mahesh.va...@oracle.com]
>     *Sent:* Thursday, February 25, 2016 11:26 AM
>     *To:* Nhat Pham <nhat.p...@dektech.com.au>
>     <mailto:nhat.p...@dektech.com.au>; 'Anders Widell'
>     <anders.wid...@ericsson.com> <mailto:anders.wid...@ericsson.com>
>     *Cc:* opensaf-devel@lists.sourceforge.net
>     <mailto:opensaf-devel@lists.sourceforge.net>; 'Beatriz Brandao'
>     <beatriz.bran...@ericsson.com>
>     <mailto:beatriz.bran...@ericsson.com>; 'Minh Chau H'
>     <minh.c...@dektech.com.au> <mailto:minh.c...@dektech.com.au>
>     *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>     preserving and recovering checkpoint replicas during headless
>     state V2 [#1621]
>
>     Hi Nhat Pham,
>
>     Please see my comment below.
>
>     -AVM
>
>     On 2/25/2016 7:54 AM, Nhat Pham wrote:
>
>         Hi Mahesh,
>
>         Would you  agree with the comment below?
>
>         To summarize, following are the comment so far:
>
>         *Comment 1*: This functionality should be under checks if
>         Hydra configuration is enabled in IMM attrName =
>
>         const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>
>         Action: The code will be updated accordingly.
>
>         *Comment 2*: To keep the scope of CPSV service as
>         non-collocated checkpoint creation NOT_SUPPORTED , if cluster
>         is running with IMMSV_SC_ABSENCE_ALLOWED ( headless state
>         configuration enabled at the time of cluster startup currently
>         it is not configurable , so there no chance of  run-time
>         configuration change ).
>
>         Action: No change in code. The CPSV still keep supporting
>         non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is
>         enable.
>
>      >>[AndersW3] No, I think we ought to support non-colocated
>     checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The fact
>     that we have "system controllers" is an implementation detail of
>     OpenSAF. I don't think the CKPT SAF specification implies that
>      >>non-colocated checkpoints must be fully replicated on all the
>     nodes in the cluster, and thus we must have the possibility that
>     all replicas are lost. It is not clear exactly what to expect from
>     the APIs when this happens, but you could handle it in a similar
>     way as the case >> when all sections have been automatically
>     deleted by the checkpoint service because the sections have expired.
>
>     [AVM]  I am not in agreement with both comments ,   we can not 
>     handle it in a similar to sections expiration case hear , in case
>     of sections expiration checkpoint  replica still exist only
>     section deleted
>
>                 CPSV specification says  if two replicas exist ( in
>     our case Only on SC`s) at a certain point in time, and the nodes
>     hosting both of these replicas is
>                 administratively taken out of service, the Checkpoint
>     Service should allocate another replica on another node while this
>     node is not available
>                 please check section `3.1.7.2 Non-Collocated
>     Checkpoints`  of cpsv specification .
>
>                  For example,  take a case of  application on PL is in
>     progress of writing to non-collocated checkpoint sections (
>     physical replica exist only on  SC`s )
>                  what will happen to application on PL ?   , ok let us
>     consider user agreed to loose the checkpoint and he what to
>     recreated it , what will happen to  cpnd DB on PL and the
>     complexity involved in it (clean up) ,
>                  and this will lead to lot of maintainability issues.
>
>                 On top of that  CKPT SAF specification only says that
>     non-collocated checkpoint and all its sections should survive if
>     the Checkpoint Service running  on cluster and
>                 replica is  USER private data ( not Opensaf States )
>     ,  loosing any USER private data  not acceptable .
>
>     [NhatPham2] According to SAI-AIS-CKPT-B.02.02 (chapter 3.1.8
>     Persistence of Checkpoints):
>
>     “As has been stated in Section 2.1 on page 13, the Checkpoint
>     Service typically stores
>
>     checkpoint data in the main memory of the nodes. *Regardless of
>     the retention time, a *
>
>     *checkpoint and all its sections do not survive if the Checkpoint
>     Service stops running *
>
>     *on all nodes hosting replicas for this checkpoint. The stop of
>     the Checkpoint Service *
>
>     *can be caused by administrative actions or node failures*.”
>
>     This states that the checkpoint doesn’t not survive in case the
>     nodes hosting its replicas failures (i.e SCs in our case).
>
> [AVM If we read further section `3.1.7.2 Non-Collocated Checkpoints` , 
> it explains with example :
>
> "For example, if two replicas exist at a certain point in time, and 
> the node hosting one of these replicas is
> administratively taken out of service, the Checkpoint Service may 
> allocate another
> replica on another node while this node is not available."
>
> [NhatPham3] I think this example is to support the idea of enhancing 
> the availability of checkpoints by creating multiple replicas. 
> Furthermore, it mentions about administrative as, while headless state 
> is about multiple node failure.
>
> @Anders: How do you think?
>
[AndersW5] Yes, as I already replied in the previous mail I don't think 
the spec views non-colocated checkpoints as guaranteed to preserve data 
in presence of (multiple) node failures. The only way to provide such 
guarantee would be to have full replication of the checkpoint data on 
all nodes in the cluster, which would be expensive and wasteful.
>
> *//*
>
>     Regarding the case you mentioned about the lost checkpoint, what
>     will happen to cpnd DB on PL.
>
>     With this patch the CPND detects un-recoverable checkpoints and
>     deletes them all from the DB in case the headless state happens.
>
> [AVM]  I know  , I was saying  maintaining such flow involved  with  
> transport  `no active timer`   will enable lot of  new issue in CPSV 
> and this becomes code maintainability issue,
>              for example :
>
>                 1)  both SC`s rejoined quickly ( below  `no active 
> timer`  timeout i think it is currently  ) we will end up with  not 
> deleting DB
>                      to address this we need collect evidences to 
> detect  headless state happens.
>
> [NhatPham3] I’m not sure if it’s really a case. But if so, this 
> problem impacts whole system not just CPSV regardless of headless state.
>
> @Ander: How do you think?
>
[AndersW5] Not sure what problem you are referring to here. Do you mean 
that if the system controllers come back very quickly after headless, 
the node director has no way of knowing that we have been in the 
headless state? I thought it was possible to determine this from the MDS 
events you receive. If not, you may have to add this information to the 
protocol you use on top of MDS.
>
>         *Comment 3*: This is about case where checkpoint node director
>         (cpnd) crashes during headless state. In this case the cpnd
>         can’t finish starting because it can’t initialize CLM service.
>
>         Then after time out, the AMF triggers a restart again.
>         Finally, the node is rebooted.
>
>         It is expected that this problem should not lead to a node reboot.
>
>         Action: No change in code. This is the limitation of the
>         system during headless state.
>
>
>     [AVM]  code changes required in CPSV CLM integration  code need to
>     be revisited to handle TRYAGAIN.
>
>     [NhatPham2] Agree. The CPND code will updated to re-initialize clm
>     for TRY AGAIN fault code.
>
>         If you agree with the summary above, I’ll update code and send
>         out the V3 for review.
>
>         Best regards,
>
>         Nhat Pham
>
>         *From:* Anders Widell [mailto:anders.wid...@ericsson.com]
>         *Sent:* Wednesday, February 24, 2016 9:26 PM
>         *To:* Nhat Pham <nhat.p...@dektech.com.au>
>         <mailto:nhat.p...@dektech.com.au>; 'A V Mahesh'
>         <mahesh.va...@oracle.com> <mailto:mahesh.va...@oracle.com>
>         *Cc:* opensaf-devel@lists.sourceforge.net
>         <mailto:opensaf-devel@lists.sourceforge.net>; 'Beatriz
>         Brandao' <beatriz.bran...@ericsson.com>
>         <mailto:beatriz.bran...@ericsson.com>; 'Minh Chau H'
>         <minh.c...@dektech.com.au> <mailto:minh.c...@dektech.com.au>
>         *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>         preserving and recovering checkpoint replicas during headless
>         state V2 [#1621]
>
>         See my comments inline, marked [AndersW3].
>
>         regards,
>         Anders Widell
>
>         On 02/24/2016 07:32 AM, Nhat Pham wrote:
>
>             Hi Mahesh and Anders,
>
>             Please see my comments below.
>
>             Best regards,
>
>             Nhat Pham
>
>             *From:* A V Mahesh [mailto:mahesh.va...@oracle.com]
>             *Sent:* Wednesday, February 24, 2016 11:06 AM
>             *To:* Nhat Pham <nhat.p...@dektech.com.au>
>             <mailto:nhat.p...@dektech.com.au>; 'Anders Widell'
>             <anders.wid...@ericsson.com>
>             <mailto:anders.wid...@ericsson.com>
>             *Cc:* opensaf-devel@lists.sourceforge.net
>             <mailto:opensaf-devel@lists.sourceforge.net>; 'Beatriz
>             Brandao' <beatriz.bran...@ericsson.com>
>             <mailto:beatriz.bran...@ericsson.com>; 'Minh Chau H'
>             <minh.c...@dektech.com.au> <mailto:minh.c...@dektech.com.au>
>             *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
>             Support preserving and recovering checkpoint replicas
>             during headless state V2 [#1621]
>
>             Hi Nhat Pham,
>
>             If component ( CPND ) restart allows while Controllers
>             absent , before  requesting CLM going to change return
>             value to**SA_AIS_ERR_TRY_AGAIN ,
>             We need to get clarification from  AMF guys  on few
>             things  why because  if CPND is on SA_AIS_ERR_TRY_AGAIN
>             and component restart timeout
>             then AMF will restart component again ( this become cyclic
>             ) and after   saAmfSGCompRestartMax configured value Node
>             gose for reboot as next level escalation,
>             in that case we may required changes in  AMF as well,  to
>             not to act on component restart timeout in case of
>             Controllers absent ( i am not sure it is deviation of AMF
>             specification ) .
>
>             */[Nhat Pham] In headless state, I’m not sure about this
>             either. /*
>
>             */@Anders: Would you have comments about this?/*
>
>         [AndersW3] Ok, first of all I would like to point out that
>         normally, the OpenSAF checkpoint node director should not
>         crash. So we are talking about a situation where multiple
>         faults have occurred: first both the active and the standby
>         system controllers have died, and then shortly afterwards -
>         before we have a new active system controller - the checkpoint
>         node director also crashes. Sure, these may not be totally
>         independent events, but still there are a lot of faults that
>         have happened within a short period of time. We should test
>         the node director and make sure it doesn't crash in this type
>         of scenario.
>
>         Now, let's consider the case where we have a fault in the node
>         director that causes it to crash during the headless state.
>         The general philosophy of the headless feature is that when
>         things work fine - i.e. in the absence of fault - we should be
>         able to continue running while the system controllers are
>         absent. However, if a fault happens during the headless state,
>         we may not be able to recover from the fault until there is an
>         active system controller. AMF does provide support for
>         restarting components, but as you have pointed out, the node
>         director will be stuck in a TRY_AGAIN loop immediately after
>         it has been restarted. So this means that if the node director
>         crashes during the headless state, we have lost the checkpoint
>         functionality on that node and we will not get it back until
>         there is an active system controller. Other services like IMM
>         will still work for a while, but AMF will as you say
>         eventually escalate the checkpoint node director failure to a
>         node restart and then the whole node is gone. The node will
>         not come back until we have an active system controller. So to
>         summarize: there is very limited support for recovering from
>         faults that happen during the headless state. The full
>         recovery will not happen until we have an active system
>         controller.
>
>             Please do incorporate current comments ( in design
>             prospective )  and republish the patch , I will re-test V3
>             patch and provide review comments on function issue/bugs
>             if I found any.
>
>             One Important note  , in the new patch  let us not have
>             any complexity of  allowing   non-collocated checkpoint
>             creation and then documenting that  in some scenario ,
>             non-collocated checkpoint  replicas are recoverable  , why
>             because replica is  USER private data ( not Opensaf States
>             ) ,  loosing USER private data  not acceptable .
>             so let us keep the scope of CPSV service as non-collocated
>             checkpoint creation NOT_SUPPORTED , if cluster is running
>             with
>              IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration
>             enabled at the time of cluster startup  currently it is
>             not configurable , so their no chance of  run-time
>             configuration change ).
>
>             We can provide support for non-collocated in subsequent
>             enhancements by having  solution like replica on lower
>             node ID PL will also created
>             non-collocated  ( max three riplicas in cluster regradless
>             of where non-collocated is opened ).
>
>             So for now, regardless of the heads (SC`s) status exist
>             not exist  CPSV should return SA_AIS_ERR_NOT_SUPPORTED in
>             case of IMMSV_SC_ABSENCE_ALLOWED enabled cluster ,
>             and let us document it as well.
>
>             */[Nhat Pham] The patch is to limit loosing replicas and
>             checkpoints in case of headless state./*
>
>             */In case both replicas locate on SCs and they reboot,
>             loosing checkpoint is unpreventable with current design
>             after headless state./*
>
>             */Even if we implement the proposal “/*max three riplicas
>             in cluster regradless of where non-collocated is
>             opened*/”, there is still the case where the checkpoint is
>             lost. Ex. The SCs and the PL which hosts the replica
>             reboot same time./*
>
>             */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if both SCs
>             reboot, this leads whole cluster reboots. Then the
>             checkpoint is lost.
>
>             */What I mean is there are cases where the checkpoint is
>             lost. The point is what we can do to limit loosing data./*
>
>             */For the proposal of reject creating non-collocated
>             checkpoint in case of/* IMMSV_SC_ABSENCE_ALLOWED enabled,
>             I think this will lead to in compatible problem.
>
>             */@Anders: How do you think about rejecting creating
>             non-collocated checkpoint in case of
>             /*IMMSV_SC_ABSENCE_ALLOWED enabled?
>
>         [AndersW3] No, I think we ought to support non-colocated
>         checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The
>         fact that we have "system controllers" is an implementation
>         detail of OpenSAF. I don't think the CKPT SAF specification
>         implies that non-colocated checkpoints must be fully
>         replicated on all the nodes in the cluster, and thus we must
>         have the possibility that all replicas are lost. It is not
>         clear exactly what to expect from the APIs when this happens,
>         but you could handle it in a similar way as the case when all
>         sections have been automatically deleted by the checkpoint
>         service because the sections have expired.
>
>
>             -AVM
>
>             On 2/24/2016 6:51 AM, Nhat Pham wrote:
>
>                 Hi Mahesh,
>
>                 Do you have any further comments?
>
>                 Best regards,
>
>                 Nhat Pham
>
>                 *From:* A V Mahesh [mailto:mahesh.va...@oracle.com]
>                 *Sent:* Monday, February 22, 2016 10:37 AM
>                 *To:* Nhat Pham <nhat.p...@dektech.com.au>
>                 <mailto:nhat.p...@dektech.com.au>; 'Anders Widell'
>                 <anders.wid...@ericsson.com>
>                 <mailto:anders.wid...@ericsson.com>
>                 *Cc:* opensaf-devel@lists.sourceforge.net
>                 <mailto:opensaf-devel@lists.sourceforge.net>; 'Beatriz
>                 Brandao' <beatriz.bran...@ericsson.com>
>                 <mailto:beatriz.bran...@ericsson.com>; 'Minh Chau H'
>                 <minh.c...@dektech.com.au>
>                 <mailto:minh.c...@dektech.com.au>
>                 *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
>                 Support preserving and recovering checkpoint replicas
>                 during headless state V2 [#1621]
>
>                 Hi,
>
>                 >>BTW, have you finished the review and test?
>
>                 I will finish by today.
>
>                 -AVM
>
>                 On 2/22/2016 7:48 AM, Nhat Pham wrote:
>
>                     Hi Mahesh and Anders,
>
>                     Please see my comment below.
>
>                     BTW, have you finished the review and test?
>
>                     Best regards,
>
>                     Nhat Pham
>
>                     *From:* A V Mahesh [mailto:mahesh.va...@oracle.com]
>                     *Sent:* Friday, February 19, 2016 2:28 PM
>                     *To:* Nhat Pham <nhat.p...@dektech.com.au>
>                     <mailto:nhat.p...@dektech.com.au>; 'Anders Widell'
>                     <anders.wid...@ericsson.com>
>                     <mailto:anders.wid...@ericsson.com>; 'Minh Chau H'
>                     <minh.c...@dektech.com.au>
>                     <mailto:minh.c...@dektech.com.au>
>                     *Cc:* opensaf-devel@lists.sourceforge.net
>                     <mailto:opensaf-devel@lists.sourceforge.net>;
>                     'Beatriz Brandao' <beatriz.bran...@ericsson.com>
>                     <mailto:beatriz.bran...@ericsson.com>
>                     *Subject:* Re: [PATCH 0 of 1] Review Request for
>                     cpsv: Support preserving and recovering checkpoint
>                     replicas during headless state V2 [#1621]
>
>                     Hi Nhat Pham,
>
>                     On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
>                         Could you please give more detailed
>                         information about steps to reproduce the
>                         problem below? Thanks.
>
>
>                     Don't see this as specific bug  , we need to see
>                     the issue as  CLM integrated service point  of view ,
>                     by considering Anders Widell  explication about
>                     CLM  application behavior during headless state
>                     we need to reintegrate CPND with CLM ( before
>                     this  headless state feature  no case of CPND
>                     existence in the obscene of CLMD  , but now it is ).
>
>                     And this will be the consistent across the all
>                     services who integrated with CLM  ( you may need
>                     some changes in CLM also )
>
>                     */[Nhat Pham] I think CLM should return
>                     /*SA_AIS_ERR_TRY_AGAIN in this case.
>
>                     @Anders. How would you think?
>
>                     To start with let us consider case CPND on payload
>                     restarted on PL  during headless state
>                     and an application is in running on PL.
>
>                     */[Nhat Pham] Regarding the CPND as CLM
>                     application, I’m not sure what it can do in this
>                     case. In case it restarts, it is monitored by AMF./*
>
>                     */If it blocks for too long, AMF will also trigger
>                     a node reboot./*
>
>                     */In my test case, the CPND get blocked by CLM. It
>                     doesn’t get out of the saClmInitialize. How do you
>                     get the “/ER cpnd clm init failed with return
>                     value:31/”?/*
>
>                     */Following is the cpnd trace./*
>
>                     Feb 22  8:56:41.188122 osafckptnd
>                     [736:cpnd_init.c:0183] >> cpnd_lib_init
>
>                     Feb 22  8:56:41.188332 osafckptnd
>                     [736:cpnd_init.c:0412] >> cpnd_cb_db_init
>
>                     Feb 22  8:56:41.188600 osafckptnd
>                     [736:cpnd_init.c:0437] << cpnd_cb_db_init
>
>                     Feb 22  8:56:41.188778 osafckptnd
>                     [736:clma_api.c:0503] >> saClmInitialize
>
>                     Feb 22  8:56:41.188945 osafckptnd
>                     [736:clma_api.c:0593] >> clmainitialize
>
>                     Feb 22  8:56:41.190052 osafckptnd
>                     [736:clma_util.c:0100] >> clma_startup:
>                     clma_use_count: 0
>
>                     Feb 22  8:56:41.190273 osafckptnd
>                     [736:clma_mds.c:1124] >> clma_mds_init
>
>                     Feb 22  8:56:41.190825 osafckptnd
>                     [736:clma_mds.c:1170] << clma_mds_init
>
>                     -AVM
>
>                     On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
>                         Hi Mahesh,
>
>                         Could you please give more detailed
>                         information about steps to reproduce the
>                         problem below? Thanks.
>
>                         Best regards,
>
>                         Nhat Pham
>
>                         *From:* A V Mahesh
>                         [mailto:mahesh.va...@oracle.com]
>                         *Sent:* Friday, February 19, 2016 1:06 PM
>                         *To:* Anders Widell
>                         <anders.wid...@ericsson.com>
>                         <mailto:anders.wid...@ericsson.com>; Nhat Pham
>                         <nhat.p...@dektech.com.au>
>                         <mailto:nhat.p...@dektech.com.au>; 'Minh Chau
>                         H' <minh.c...@dektech.com.au>
>                         <mailto:minh.c...@dektech.com.au>
>                         *Cc:* opensaf-devel@lists.sourceforge.net
>                         <mailto:opensaf-devel@lists.sourceforge.net>;
>                         'Beatriz Brandao'
>                         <beatriz.bran...@ericsson.com>
>                         <mailto:beatriz.bran...@ericsson.com>
>                         *Subject:* Re: [PATCH 0 of 1] Review Request
>                         for cpsv: Support preserving and recovering
>                         checkpoint replicas during headless state V2
>                         [#1621]
>
>                         Hi Anders Widell,
>                         Thanks for the detailed explanation about CLM
>                         during headless state.
>
>                         HI  Nhat Pham ,
>
>                         Comment : 3
>                         Please see below  the problem I was
>                         interpreted now I  seeing it  during CLMD
>                         obscene ( during headless state ),
>                         so now CPND/CLMA need to  to address below
>                         case , currently cpnd clm init failed with
>                         return value: SA_AIS_ERR_UNAVAILABLE
>                         but should be SA_AIS_ERR_TRY_AGAIN
>
>                         ==================================================
>                         Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE
>                         STATE-> IMM_NODE_FULLY_AVAILABLE 17418
>                         Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync
>                         ending normally
>                         Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch
>                         set to 9 in ImmModel
>                         Feb 19 11:18:28 PL-4 cpsv_app: IN Received
>                         PROC_STALE_CLIENTS
>                         Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                         Implementer connected: 42
>                         (MsgQueueService132111) <108, 2040f>
>                         Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                         Implementer connected: 43
>                         (MsgQueueService131855) <0, 2030f>
>                         Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                         Implementer connected: 44 (safLogService) <0,
>                         2010f>
>                         Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                         SERVER STATE: IMM_SERVER_SYNC_SERVER -->
>                         IMM_SERVER_READY
>                         Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                         Implementer connected: 45 (safClmService) <0,
>                         2010f>
>                         *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER
>                         cpnd clm init failed with return value:31
>                         Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd
>                         init failed
>                         Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER
>                         cpnd_lib_req FAILED
>                         Feb 19 11:18:28 PL-4 osafckptnd[7718]:
>                         __init_cpnd() failed*
>                         Feb 19 11:18:28 PL-4 osafclmna[5432]: NO
>                         safNode=PL-4,safCluster=myClmCluster Joined
>                         cluster, nodeid=2040f
>                         Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD
>                         NEW_ACTIVE, adest:1
>                         Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO
>                         Sending node up due to NCSMDS_NEW_ACTIVE
>                         Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1
>                         SISU states sent
>                         Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU
>                         states sent
>                         Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7
>                         CSICOMP states synced
>                         Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU
>                         states sent
>                         Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>                         Implementer connected: 46 (safAmfService) <0,
>                         2010f>
>                         Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO
>                         'safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                         Component or SU restart probation timer expired
>                         Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
>                         Instantiation of
>                         'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                         failed
>                         Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
>                         Reason: component registration timer expired
>                         Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA
>                         'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                         Presence State RESTARTING => INSTANTIATION_FAILED
>                         Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
>                         Component Failover trigerred for
>                         'safSu=PL-4,safSg=NoRed,safApp=OpenSAF':
>                         Failed component:
>                         'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                         Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER
>                         
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got
>                         Inst failed
>                         Feb 19 11:18:35 PL-4 osafamfnd[5441]:
>                         Rebooting OpenSAF NodeId = 132111 EE Name = ,
>                         Reason: NCS component Instantiation failed,
>                         OwnNodeId = 132111, SupervisionTime = 60
>                         Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting
>                         local node; timeout=60
>                         Feb 19 11:18:39 PL-4 kernel: [ 4877.338518]
>                         md: stopping all md devices.
>                         ==================================================
>
>                         -AVM
>
>                         On 2/15/2016 5:11 PM, Anders Widell wrote:
>
>                             Hi!
>
>                             Please find my answer inline, marked
>                             [AndersW].
>
>                             regards,
>                             Anders Widell
>
>                             On 02/15/2016 10:38 AM, Nhat Pham wrote:
>
>                                 Hi Mahesh,
>
>                                 It's good. Thank you. :)
>
>                                 [AVM]  Up on rejoining of the SC`s The
>                                 replica should be re-created regardless
>                                 of another application opens it on PL4.
>                                                ( Note : this comment
>                                 is based on your explanation have not yet
>                                 reviewed/tested  ,
>                                                   currently i am
>                                 struggling with  SC`s    not rejoining
>                                 after headless state , i can provide
>                                 you more on this once i complte my
>                                 review/testing)
>
>                                 [Nhat] To make cloud resilience works,
>                                 you need the patches from other
>                                 services (log, amf, clm, ntf).
>                                 @Minh: I heard that you created tar
>                                 file which includes all patches. Could
>                                 you
>                                 please send it to Mahesh? Thanks
>
>                                 [AVM] I understand that , before I
>                                 comment more on this   please allow me to
>                                 understand
>                                               I am not still not very
>                                 clear of the headless design in detail.
>                                               For example cluster
>                                 membership of PL`s   during headless
>                                 state ,
>                                                In the absence of SC`s 
>                                 (CLMD) dose the PLs is considered as
>                                 cluster nodes or not (cluster
>                                 membership) ?
>
>                                 [Nhat] I don't know much about this.
>                                 @ Anders: Could you please have
>                                 comment about this? Thanks
>
>                             [AndersW] First of all, keep in mind that
>                             the "headless" state should ideally not
>                             last a very long time. Once we have the
>                             spare SC feature in place (ticket [#79]),
>                             a new SC should become active within a
>                             matter of a few seconds after we have lost
>                             both the active and the standby SC.
>
>                             I think you should view the state of the
>                             cluster in the headless state in the same
>                             way as you view the state of the cluster
>                             during a failover between the active and
>                             the standby SC. Imagine that the active SC
>                             dies. It takes the standby SC 1.5 seconds
>                             to detect the failure of the active SC
>                             (this is due to the TIPC timeout). If you
>                             have configured the PROMOTE_ACTIVE_TIMER,
>                             there is an additional delay before the
>                             standby takes over as active. What is the
>                             state of the cluster during the time after
>                             the active SC failed and before the
>                             standby takes over?
>
>                             The state of the cluster while it is
>                             headless is very similar. The difference
>                             is that this state may last a little bit
>                             longer (though not more than a few
>                             seconds, until one of the spare SCs
>                             becomes active). Another difference is
>                             that we may have lost some state. With a
>                             "perfect" implementation of the headless
>                             feature we should not lose any state at
>                             all, but with the current set of patches
>                             we do lose state.
>
>                             So specifically if we talk about cluster
>                             membership and ask the question: is a
>                             particular PL a member of the cluster or
>                             not during the headless state? Well, if
>                             you ask CLM about this during the headless
>                             state, then you will not know - because
>                             CLM doesn't provide any service during the
>                             headless state. If you keep retrying you
>                             query to CLM, you will eventually get an
>                             answer - but you will not get this answer
>                             until there is an active SC again and we
>                             have exited the headless state. When
>                             viewed in this way, the answer to the
>                             question about a node's membership is
>                             undefined during the headless state, since
>                             CLM will not provide you with any answer
>                             until there is an active SC.
>
>                             However, if you asked CLM about the node's
>                             cluster membership status before the
>                             cluster went headless, you probably saved
>                             a cached copy of the cluster membership
>                             state. Maybe you also installed a CLM
>                             track callback and intend to update this
>                             cached copy every time the cluster
>                             membership status changes. The question
>                             then is: can you continue using this
>                             cached copy of the cluster membership
>                             state during the headless state? The
>                             answer is YES: since CLM doesn't provide
>                             any service during the headless state, it
>                             also means that the cluster membership
>                             view cannot change during this time. Nodes
>                             can of course reboot or die, but CLM will
>                             not notice and hence the cluster view will
>                             not be updated. You can argue that this is
>                             bad because the cluster view doesn't
>                             reflect reality, but notice that this will
>                             always be the case. We can never propagate
>                             information instantaneously, and detection
>                             of node failures will take 1.5 seconds due
>                             to the TIPC timeout. You can never be sure
>                             that a node is alive at this very moment
>                             just because CLM tells you that it is a
>                             member of the cluster. If we are
>                             unfortunate enough to lose both system
>                             controller nodes simultaneously, updates
>                             to the cluster membership view will be
>                             delayed a few seconds longer than usual.
>
>
>                                 Best regards,
>                                 Nhat Pham
>
>                                 -----Original Message-----
>                                 From: A V Mahesh
>                                 [mailto:mahesh.va...@oracle.com]
>                                 Sent: Monday, February 15, 2016 11:19 AM
>                                 To: Nhat Pham
>                                 <nhat.p...@dektech.com.au>
>                                 <mailto:nhat.p...@dektech.com.au>;
>                                 anders.wid...@ericsson.com
>                                 <mailto:anders.wid...@ericsson.com>
>                                 Cc:
>                                 opensaf-devel@lists.sourceforge.net
>                                 <mailto:opensaf-devel@lists.sourceforge.net>;
>                                 'Beatriz Brandao'
>                                 <beatriz.bran...@ericsson.com>
>                                 <mailto:beatriz.bran...@ericsson.com>
>                                 Subject: Re: [PATCH 0 of 1] Review
>                                 Request for cpsv: Support preserving and
>                                 recovering checkpoint replicas during
>                                 headless state V2 [#1621]
>
>                                 Hi Nhat Pham,
>
>                                 How is your holiday went
>
>                                 Please find my comments below
>
>                                 On 2/15/2016 8:43 AM, Nhat Pham wrote:
>
>                                     Hi Mahesh,
>
>                                     For the comment 1, the patch will
>                                     be updated accordingly.
>
>                                 [AVM]  Please hold , I will provide
>                                 more comments in this week , so we can
>                                 have consolidated V3
>
>                                     For the comment 2, I think the
>                                     CKPT service will not be backward
>                                     compatible if the scAbsenceAllowed
>                                     is true.
>                                     The client can't create
>                                     non-collocated checkpoint on SCs.
>
>                                     Furthermore, this solution only
>                                     protects the CKPT service from the
>                                     case "The non-collocated
>                                     checkpoint  is created on a SC"
>                                     there are still the cases where
>                                     the replicas are completely lost. Ex:
>
>                                     - The non-collocated checkpoint
>                                     created on a PL. The PL reboots. Both
>                                     replicas now locate on SCs. Then,
>                                     headless state happens. All
>                                     replicas are
>                                     lost.
>                                     - The non-collocated checkpoint
>                                     has active replica locating on a PL
>                                     and this PL restarts during
>                                     headless state
>                                     - The non-collocated checkpoint is
>                                     created on PL3. This checkpoint is
>                                     also opened on PL4. Then SCs and
>                                     PL3 reboot.
>
>                                 [AVM]  Up on rejoining of the SC`s The
>                                 replica should be re-created regardless
>                                 of another application opens it on PL4.
>                                                ( Note : this comment
>                                 is based on your explanation have not yet
>                                 reviewed/tested  ,
>                                                   currently i am
>                                 struggling with  SC`s    not rejoining
>                                 after headless state , i can provide
>                                 you more on this once i complte my
>                                 review/testing)
>
>                                     In this case, all replicas are
>                                     lost and the client has to create
>                                     it again.
>
>                                     In case multiple nodes (which
>                                     including SCs) reboot, losing
>                                     replicas
>                                     is unpreventable. The patch is to
>                                     recover the checkpoints in
>                                     possible cases.
>                                     How do you think?
>
>                                 [AVM] I understand that , before I
>                                 comment more on this   please allow
>                                 me to understand
>                                               I am not still not very
>                                 clear of the headless design in detail.
>
>                                               For example cluster
>                                 membership of PL`s   during headless
>                                 state ,
>                                                In the absence of SC`s 
>                                 (CLMD) dose the PLs is considered as
>                                 cluster nodes or not (cluster
>                                 membership) ?
>
>                                                      - if not consider
>                                 as  NON cluster nodes Checkpoint Service
>                                 API  should  leverage the SA Forum
>                                 Cluster
>                                                        Membership
>                                 Service  and API's can fail with
>                                 SA_AIS_ERR_UNAVAILABLE
>
>                                                      - if considers as
>                                 cluster nodes  we need to follow all the
>                                 defined rules which are defined in
>                                 SAI-AIS-CKPT-B.02.02 specification
>
>                                               so give me some more
>                                 time to review it completely , so that we
>                                 can  have consolidated patch V3
>
>                                 -AVM
>
>                                     Best regards,
>                                     Nhat Pham
>
>                                     -----Original Message-----
>                                     From: A V Mahesh
>                                     [mailto:mahesh.va...@oracle.com]
>                                     Sent: Friday, February 12, 2016
>                                     11:10 AM
>                                     To: Nhat Pham
>                                     <nhat.p...@dektech.com.au>
>                                     <mailto:nhat.p...@dektech.com.au>;
>                                     anders.wid...@ericsson.com
>                                     <mailto:anders.wid...@ericsson.com>
>                                     Cc:
>                                     opensaf-devel@lists.sourceforge.net 
> <mailto:opensaf-devel@lists.sourceforge.net>;
>                                     Beatriz Brandao
>                                     <beatriz.bran...@ericsson.com>
>                                     <mailto:beatriz.bran...@ericsson.com>
>                                     Subject: Re: [PATCH 0 of 1] Review
>                                     Request for cpsv: Support
>                                     preserving and recovering
>                                     checkpoint replicas during
>                                     headless state V2
>                                     [#1621]
>
>
>                                     Comment 2 :
>
>                                     After incorporating the comment
>                                     one all the Limitations should be
>                                     prevented based on Hydra
>                                     configuration is enabled in IMM
>                                     status.
>
>                                     Foe example :  if some application
>                                     is trying to create
>
>                                     non-collocated checkpoint active
>                                     replica getting generated/locating on
>                                     SC then ,regardless of the heads
>                                     (SC`s) status exist not exist should
>                                     return SA_AIS_ERR_NOT_SUPPORTED
>
>                                     In other words, rather that
>                                     allowing to created non-collocated
>                                     checkpoint when
>                                     heads(SC`s)  are exit , and
>                                     non-collocated checkpoint getting
>                                     unrecoverable after heads(SC`s)
>                                     rejoins.
>
>                                     
> ======================================================================
>
>                                     =======================
>
>                                             Limitation: The CKPT
>                                         service doesn't support
>                                         recovering checkpoints in
>                                             following cases:
>                                             . The checkpoint which is
>                                         unlinked before headless.
>                                             . The non-collocated
>                                         checkpoint has active replica
>                                         locating on SC.
>                                             . The non-collocated
>                                         checkpoint has active replica
>                                         locating on a PL
>                                         and this PL
>                                             restarts during headless
>                                         state. In this cases, the
>                                         checkpoint replica is
>                                             destroyed. The fault code
>                                         SA_AIS_ERR_BAD_HANDLE is
>                                         returned when the
>                                         client
>                                             accesses the checkpoint in
>                                         these cases. The client must
>                                         re-open the
>                                             checkpoint.
>
>                                     
> ======================================================================
>
>                                     =======================
>
>                                     -AVM
>
>
>                                     On 2/11/2016 12:52 PM, A V Mahesh
>                                     wrote:
>
>                                         Hi,
>
>                                         I jut starred reviewing patch
>                                         , I will be  giving comments
>                                         as soon as
>                                         I crossover any , to save some
>                                         time.
>
>                                         Comment 1 :
>                                         This functionality should be
>                                         under  checks if Hydra
>                                         configuration is
>                                         enabled in IMM attrName =
>                                         
> const_cast<SaImmAttrNameT>("scAbsenceAllowed")
>
>
>                                         Please see example how
>                                         LOG/AMF  services implemented it.
>
>                                         -AVM
>
>
>                                         On 1/29/2016 1:02 PM, Nhat
>                                         Pham wrote:
>
>                                             Hi Mahesh,
>
>                                             As described in the
>                                             README, the CKPT service
>                                             returns
>                                             SA_AIS_ERR_TRY_AGAIN fault
>                                             code in this case.
>                                             I guess it's same for
>                                             other services.
>
>                                             @Anders: Could you please
>                                             confirm this?
>
>                                             Best regards,
>                                             Nhat Pham
>
>                                             -----Original Message-----
>                                             From: A V Mahesh
>                                             [mailto:mahesh.va...@oracle.com]
>
>                                             Sent: Friday, January 29,
>                                             2016 2:11 PM
>                                             To: Nhat Pham
>                                             <nhat.p...@dektech.com.au>
>                                             <mailto:nhat.p...@dektech.com.au>;
>                                             anders.wid...@ericsson.com
>                                             
> <mailto:anders.wid...@ericsson.com>
>
>                                             Cc:
>                                             
> opensaf-devel@lists.sourceforge.net
>                                             
> <mailto:opensaf-devel@lists.sourceforge.net>
>
>                                             Subject: Re: [PATCH 0 of
>                                             1] Review Request for
>                                             cpsv: Support
>                                             preserving and recovering
>                                             checkpoint replicas during
>                                             headless state
>                                             V2 [#1621]
>
>                                             Hi,
>
>                                             On 1/29/2016 11:45 AM,
>                                             Nhat Pham wrote:
>
>                                                 -  The behavior of
>                                                 application will be
>                                                 consistent with other
>                                                 saf services like
>                                                 imm/amf behavior 
>                                                 during headless state.
>                                                 [Nhat] I'm not clear
>                                                 what you mean about
>                                                 "consistent"?
>
>                                             In the obscene of 
>                                             Director (SC's) , what is
>                                             expected return values
>                                             of SAF API should ( all
>                                             services ) ,
>                                                  which are not in
>                                             aposition to  provide
>                                             service at that moment.
>
>                                             I think all services
>                                             should return same  SAF
>                                             ERRS., I thinks
>                                             currently we don't have 
>                                             it , may be  Anders Widel 
>                                             will help us.
>
>                                             -AVM
>
>
>                                             On 1/29/2016 11:45 AM,
>                                             Nhat Pham wrote:
>
>                                                 Hi Mahesh,
>
>                                                 Please see the
>                                                 attachment for the
>                                                 README. Let me know if
>                                                 there is
>                                                 any more information
>                                                 required.
>
>                                                 Regarding your comments:
>                                                       -  during
>                                                 headless state 
>                                                 applications may
>                                                 behave like during
>                                                 CPND restart case
>                                                 [Nhat] Headless state
>                                                 and CPND restart are
>                                                 different events.
>                                                 Thus, the behavior is
>                                                 different.
>                                                 Headless state is a
>                                                 case where both SCs go
>                                                 down.
>
>                                                       -  The behavior
>                                                 of application will be
>                                                 consistent with other
>                                                 saf services like
>                                                 imm/amf behavior 
>                                                 during headless state.
>                                                 [Nhat] I'm not clear
>                                                 what you mean about
>                                                 "consistent"?
>
>                                                 Best regards,
>                                                 Nhat Pham
>
>                                                 -----Original
>                                                 Message-----
>                                                 From: A V Mahesh
>                                                 
> [mailto:mahesh.va...@oracle.com]
>
>                                                 Sent: Friday, January
>                                                 29, 2016 11:12 AM
>                                                 To: Nhat Pham
>                                                 <nhat.p...@dektech.com.au>
>                                                 
> <mailto:nhat.p...@dektech.com.au>;
>
>                                                 anders.wid...@ericsson.com
>                                                 
> <mailto:anders.wid...@ericsson.com>
>
>                                                 Cc:
>                                                 
> opensaf-devel@lists.sourceforge.net
>                                                 
> <mailto:opensaf-devel@lists.sourceforge.net>
>
>                                                 Subject: Re: [PATCH 0
>                                                 of 1] Review Request
>                                                 for cpsv: Support
>                                                 preserving and
>                                                 recovering checkpoint
>                                                 replicas during
>                                                 headless state
>                                                 V2 [#1621]
>
>                                                 Hi Nhat Pham,
>
>                                                 I stared reviewing
>                                                 this patch , so can
>                                                 please provide  README
>                                                 file
>                                                 with scope and
>                                                 limitations , that
>                                                 will help to define
>                                                 testing/reviewing scope .
>
>                                                 Following are minimum
>                                                 things we can keep in
>                                                 mind while
>                                                 reviewing/accepting
>                                                 patch ,
>
>                                                 - Not effecting
>                                                 existing functionality
>                                                       -  during
>                                                 headless state 
>                                                 applications may
>                                                 behave like during
>                                                 CPND restart case
>                                                       -  The minimum
>                                                 functionally of
>                                                 application works
>                                                       -  The behavior
>                                                 of application will be
>                                                 consistent with
>                                                          other saf
>                                                 services like imm/amf
>                                                 behavior  during
>                                                 headless state.
>
>                                                 So please do provide
>                                                 any additional
>                                                 detailed in README if
>                                                 any of
>                                                 the above is deviated
>                                                 , that allow users to
>                                                 know about the
>                                                 limitations/deviation.
>
>                                                 -AVM
>
>                                                 On 1/4/2016 3:15 PM,
>                                                 Nhat Pham wrote:
>
>                                                     Summary: cpsv:
>                                                     Support preserving
>                                                     and recovering
>                                                     checkpoint
>                                                     replicas during
>                                                     headless state
>                                                     [#1621] Review
>                                                     request for Trac
>                                                     Ticket(s):
>                                                     #1621 Peer
>                                                     Reviewer(s):
>                                                     mahesh.va...@oracle.com
>                                                     
> <mailto:mahesh.va...@oracle.com>;
>
>                                                     anders.wid...@ericsson.com
>                                                     
> <mailto:anders.wid...@ericsson.com>
>                                                     Pull request to:
>                                                     mahesh.va...@oracle.com
>                                                     
> <mailto:mahesh.va...@oracle.com>
>                                                     Affected
>                                                     branch(es):
>                                                     default Development
>                                                     branch: default
>
>                                                     
> --------------------------------
>
>                                                     Impacted area
>                                                     Impact y/n
>                                                     
> --------------------------------
>
>                                                     Docs n
>                                                           Build
>                                                     system            n
>                                                     RPM/packaging n
>                                                          
>                                                     Configuration
>                                                     files     n
>                                                           Startup
>                                                     scripts         n
>                                                           SAF
>                                                     services            y
>                                                           OpenSAF
>                                                     services        n
>                                                           Core
>                                                     libraries          n
>                                                     Samples n
>                                                     Tests n
>                                                     Other n
>
>
>                                                     Comments (indicate
>                                                     scope for each "y"
>                                                     above):
>                                                     
> ---------------------------------------------
>
>
>                                                     changeset
>                                                     
> faec4a4445a4c23e8f630857b19aabb43b5af18d
>
>                                                     Author:    Nhat
>                                                     Pham
>                                                     <nhat.p...@dektech.com.au>
>                                                     
> <mailto:nhat.p...@dektech.com.au>
>
>                                                     Date:    Mon, 04
>                                                     Jan 2016 16:34:33
>                                                     +0700
>
>                                                           cpsv:
>                                                     Support preserving
>                                                     and recovering
>                                                     checkpoint replicas
>                                                     during headless
>                                                     state [#1621]
>
>                                                           Background:
>                                                           ----------
>                                                     This enhancement
>                                                     supports to
>                                                     preserve checkpoint
>                                                     replicas
>
>                                                 in case
>
>                                                     both SCs down
>                                                     (headless state)
>                                                     and recover
>                                                     replicas in case
>                                                     one of
>
>                                                 SCs up
>
>                                                     again. If both SCs
>                                                     goes down,
>                                                     checkpoint
>                                                     replicas on
>                                                     surviving nodes
>
>                                                 still
>
>                                                     remain. When a SC
>                                                     is available
>                                                     again, surviving
>                                                     replicas are
>
>                                                 automatically
>
>                                                     registered to the
>                                                     SC checkpoint
>                                                     database. Content in
>                                                     surviving
>
>                                                 replicas are
>
>                                                     intacted and
>                                                     synchronized to
>                                                     new replicas.
>
>                                                           When no SC
>                                                     is available,
>                                                     client API calls
>                                                     changing checkpoint
>
>                                                 configuration
>
>                                                     which requires SC
>                                                     communication, are
>                                                     rejected. Client API
>                                                     calls
>
>                                                 reading and
>
>                                                     writing existing
>                                                     checkpoint
>                                                     replicas still work.
>
>                                                           Limitation:
>                                                     The CKPT service
>                                                     does not support
>                                                     recovering
>                                                     checkpoints
>
>                                                 in
>
>                                                     following cases:
>                                                            - The
>                                                     checkpoint which
>                                                     is unlinked before
>                                                     headless.
>                                                            - The
>                                                     non-collocated
>                                                     checkpoint has
>                                                     active replica
>                                                     locating
>                                                     on SC.
>                                                            - The
>                                                     non-collocated
>                                                     checkpoint has
>                                                     active replica
>                                                     locating
>                                                     on a PL
>
>                                                 and this
>
>                                                     PL restarts during
>                                                     headless state. In
>                                                     this cases, the
>                                                     checkpoint
>
>                                                 replica is
>
>                                                     destroyed. The
>                                                     fault code
>                                                     SA_AIS_ERR_BAD_HANDLE
>                                                     is returned
>                                                     when the
>
>                                                 client
>
>                                                     accesses the
>                                                     checkpoint in
>                                                     these cases. The
>                                                     client must
>                                                     re-open the
>                                                           checkpoint.
>
>                                                           While in
>                                                     headless state,
>                                                     accessing
>                                                     checkpoint
>                                                     replicas does
>                                                     not work
>
>                                                 if the
>
>                                                     node which hosts
>                                                     the active replica
>                                                     goes down. It will
>                                                     back
>                                                     working
>
>                                                 when a
>
>                                                     SC available again.
>
>                                                           Solution:
>                                                           ---------
>                                                     The solution for
>                                                     this enhancement
>                                                     includes 2 parts:
>
>                                                           1. To
>                                                     destroy
>                                                     un-recoverable
>                                                     checkpoint
>                                                     described above when
>                                                     both
>
>                                                 SCs are
>
>                                                     down: When both
>                                                     SCs are down, the
>                                                     CPND deletes
>                                                     un-recoverable
>
>                                                 checkpoint
>
>                                                     nodes and replicas
>                                                     on PLs. Then it
>                                                     requests CPA to
>                                                     destroy
>
>                                                 corresponding
>
>                                                     checkpoint node by
>                                                     using new message
>                                                     CPA_EVT_ND2A_CKPT_DESTROY
>
>
>                                                           2. To update
>                                                     CPD with
>                                                     checkpoint
>                                                     information When
>                                                     an active
>                                                     SC is up
>
>                                                 after
>
>                                                     headless, CPND
>                                                     will update CPD
>                                                     with checkpoint
>                                                     information by
>                                                     using
>
>                                                 new
>
>                                                     message
>                                                     
> CPD_EVT_ND2D_CKPT_INFO_UPDATE
>                                                     instead of using
>                                                     CPD_EVT_ND2D_CKPT_CREATE.
>                                                     This is because
>                                                     the CPND will
>                                                     create new
>
>                                                 ckpt_id
>
>                                                     for the checkpoint
>                                                     which might be
>                                                     different with the
>                                                     current
>                                                     ckpt id
>
>                                                 if the
>
>                                                     CPD_EVT_ND2D_CKPT_CREATE
>                                                     is used. The CPD
>                                                     collects checkpoint
>
>                                                 information
>
>                                                     within 6s. During
>                                                     this updating
>                                                     time, following
>                                                     requests is
>                                                     rejected
>
>                                                 with
>
>                                                     fault code
>                                                     SA_AIS_ERR_TRY_AGAIN:
>                                                           -
>                                                     CPD_EVT_ND2D_CKPT_CREATE
>
>                                                           -
>                                                     CPD_EVT_ND2D_CKPT_UNLINK
>
>                                                           -
>                                                     CPD_EVT_ND2D_ACTIVE_SET
>
>                                                           -
>                                                     CPD_EVT_ND2D_CKPT_RDSET
>
>
>
>                                                     Complete diffstat:
>                                                     ------------------
>                                                     
> osaf/libs/agents/saf/cpa/cpa_proc.c
>                                                     |   52
>
>                                                 
> +++++++++++++++++++++++++++++++++++
>
>
>                                                     
> osaf/libs/common/cpsv/cpsv_edu.c
>                                                     |   43
>
>                                                 +++++++++++++++++++++++++++++
>
>
>                                                     
> osaf/libs/common/cpsv/include/cpd_cb.h
>                                                     |    3 ++
>                                                     
> osaf/libs/common/cpsv/include/cpd_imm.h
>                                                     |    1 +
>                                                     
> osaf/libs/common/cpsv/include/cpd_proc.h
>                                                     |    7 ++++
>                                                     
> osaf/libs/common/cpsv/include/cpd_tmr.h
>                                                     |    3 +-
>                                                     
> osaf/libs/common/cpsv/include/cpnd_cb.h
>                                                     |    1 +
>                                                     
> osaf/libs/common/cpsv/include/cpnd_init.h
>                                                     |    2 +
>                                                     
> osaf/libs/common/cpsv/include/cpsv_evt.h
>                                                     |   20 +++++++++++++
>                                                     
> osaf/services/saf/cpsv/cpd/Makefile.am
>                                                     |    3 +-
>                                                     
> osaf/services/saf/cpsv/cpd/cpd_evt.c
>                                                     |  229
>
>                                                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                 ++++
>
>                                                     
> osaf/services/saf/cpsv/cpd/cpd_imm.c
>                                                     |  112
>
>                                                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                     
> osaf/services/saf/cpsv/cpd/cpd_init.c
>                                                     |   20 ++++++++++++-
>                                                     
> osaf/services/saf/cpsv/cpd/cpd_proc.c
>                                                     |  309
>
>                                                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                     
> osaf/services/saf/cpsv/cpd/cpd_tmr.c
>                                                     |    7 ++++
>                                                     
> osaf/services/saf/cpsv/cpnd/cpnd_db.c
>                                                     |   16 ++++++++++
>                                                     
> osaf/services/saf/cpsv/cpnd/cpnd_evt.c
>                                                     |   22
>                                                     +++++++++++++++
>                                                     
> osaf/services/saf/cpsv/cpnd/cpnd_init.c
>                                                     |   23
>                                                     ++++++++++++++-
>                                                     
> osaf/services/saf/cpsv/cpnd/cpnd_mds.c
>                                                     |   13 ++++++++
>                                                     
> osaf/services/saf/cpsv/cpnd/cpnd_proc.c
>                                                     |  314
>
>                                                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>
>
>                                                     20 files changed,
>                                                     1189
>                                                     insertions(+), 11
>                                                     deletions(-)
>
>
>                                                     Testing Commands:
>                                                     -----------------
>                                                     -
>
>                                                     Testing, Expected
>                                                     Results:
>                                                     --------------------------
>
>                                                     -
>
>
>                                                     Conditions of
>                                                     Submission:
>                                                     -------------------------
>
>                                                           <<HOW MANY
>                                                     DAYS BEFORE
>                                                     PUSHING, CONSENSUS
>                                                     ETC>>
>
>
>                                                     Arch      Built
>                                                     Started    Linux
>                                                     distro
>                                                     
> -------------------------------------------
>
>                                                     mips       
>                                                     n          n
>                                                     mips64     
>                                                     n          n
>                                                     x86        
>                                                     n          n
>                                                     x86_64     
>                                                     n          n
>                                                     powerpc    
>                                                     n          n
>                                                     powerpc64  
>                                                     n          n
>
>
>                                                     Reviewer Checklist:
>                                                     -------------------
>                                                     [Submitters: make
>                                                     sure that your
>                                                     review doesn't
>                                                     trigger any
>                                                     checkmarks!]
>
>
>                                                     Your checkin has
>                                                     not passed review
>                                                     because (see
>                                                     checked entries):
>
>                                                     ___ Your RR
>                                                     template is
>                                                     generally
>                                                     incomplete; it has
>                                                     too many
>                                                     blank
>
>                                                 entries
>
>                                                     that need proper
>                                                     data filled in.
>
>                                                     ___ You have
>                                                     failed to nominate
>                                                     the proper persons
>                                                     for review and
>                                                     push.
>
>                                                     ___ Your patches
>                                                     do not have proper
>                                                     short+long header
>
>                                                     ___ You have
>                                                     grammar/spelling
>                                                     in your header
>                                                     that is unacceptable.
>
>                                                     ___ You have
>                                                     exceeded a
>                                                     sensible line
>                                                     length in your
>
>                                                 headers/comments/text.
>
>                                                     ___ You have
>                                                     failed to put in a
>                                                     proper Trac Ticket
>                                                     # into your
>                                                     commits.
>
>                                                     ___ You have
>                                                     incorrectly
>                                                     put/left internal
>                                                     data in your
>                                                     comments/files
>                                                              (i.e.
>                                                     internal bug
>                                                     tracking tool IDs,
>                                                     product names etc)
>
>                                                     ___ You have not
>                                                     given any evidence
>                                                     of testing beyond
>                                                     basic build
>                                                     tests.
>                                                             
>                                                     Demonstrate some
>                                                     level of runtime
>                                                     or other sanity
>                                                     testing.
>
>                                                     ___ You have ^M
>                                                     present in some of
>                                                     your files. These
>                                                     have to be
>                                                     removed.
>
>                                                     ___ You have
>                                                     needlessly changed
>                                                     whitespace or
>                                                     added whitespace
>                                                     crimes
>                                                              like
>                                                     trailing spaces,
>                                                     or spaces before
>                                                     tabs.
>
>                                                     ___ You have mixed
>                                                     real technical
>                                                     changes with
>                                                     whitespace and other
>                                                              cosmetic
>                                                     code cleanup
>                                                     changes. These
>                                                     have to be separate
>                                                     commits.
>
>                                                     ___ You need to
>                                                     refactor your
>                                                     submission into
>                                                     logical chunks;
>                                                     there is
>                                                              too much
>                                                     content into a
>                                                     single commit.
>
>                                                     ___ You have
>                                                     extraneous garbage
>                                                     in your review
>                                                     (merge commits etc)
>
>                                                     ___ You have giant
>                                                     attachments which
>                                                     should never have
>                                                     been sent;
>                                                              Instead
>                                                     you should place
>                                                     your content in a
>                                                     public tree to
>                                                     be pulled.
>
>                                                     ___ You have too
>                                                     many commits
>                                                     attached to an
>                                                     e-mail; resend as
>                                                     threaded
>                                                              commits,
>                                                     or place in a
>                                                     public tree for a
>                                                     pull.
>
>                                                     ___ You have
>                                                     resent this
>                                                     content multiple
>                                                     times without a clear
>                                                     indication
>                                                              of what
>                                                     has changed
>                                                     between each re-send.
>
>                                                     ___ You have
>                                                     failed to
>                                                     adequately and
>                                                     individually
>                                                     address all of the
>                                                              comments
>                                                     and change
>                                                     requests that were
>                                                     proposed in the
>                                                     initial
>
>                                                 review.
>
>                                                     ___ You have a
>                                                     misconfigured
>                                                     ~/.hgrc file (i.e.
>                                                     username, email
>                                                     etc)
>
>                                                     ___ Your computer
>                                                     have a badly
>                                                     configured date
>                                                     and time;
>                                                     confusing the
>                                                              the
>                                                     threaded patch
>                                                     review.
>
>                                                     ___ Your changes
>                                                     affect IPC
>                                                     mechanism, and you
>                                                     don't present any
>                                                     results
>                                                              for
>                                                     in-service
>                                                     upgradability test.
>
>                                                     ___ Your changes
>                                                     affect user manual
>                                                     and documentation,
>                                                     your patch
>                                                     series
>                                                              do not
>                                                     contain the patch
>                                                     that updates the
>                                                     Doxygen manual.
>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to