Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

A V Mahesh Thu, 25 Feb 2016 03:15:54 -0800

Fine.

Please publish the v3 patch.


-AVM

On 2/25/2016 3:50 PM, Nhat Pham wrote:
> Hi Mahesh,
>
> Please see my answers below with [NhatPham4]
>
> Best regards,
> Nhat Pham
>
> -----Original Message-----
> From: A V Mahesh [mailto:[email protected]]
> Sent: Thursday, February 25, 2016 4:31 PM
> To: Nhat Pham <[email protected]>; 'Anders Widell'
> <[email protected]>
> Cc: 'Beatriz Brandao' <[email protected]>; 'Minh Chau H'
> <[email protected]>; [email protected]
> Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
> preserving and recovering checkpoint replicas during headless state V2
> [#1621]
>
> Hi Nhat Pham,
>
>   >> With this patch the CPND detects un-recoverable checkpoints and
> deletes them all from the DB in case the headless state happens.
>
>    By the way I didn't tested some cases,  can you clarify below :
>
> - which error will be revived by cpsv  application of PL ,  for the
> unrecoverable  checkpoint ?
> - Is  accessing  SaCkptHandleT  valid  after head recovery ?
> [NhatPham4] It's still valid during headless state and after head recovery.
> During headless, saCkptCheckpointOpen() returns SA_AIS_ERR_TRY_AGAIN. It's
> back working after head recovery.
>
> - Is  accessing SaCkptCheckpointHandleT  return SA_AIS_ERR_BAD_HANDLE
> after head recovery ?
> [NhatPham4] Yes, it returns SA_AIS_ERR_BAD_HANDLE  during headless state and
> after head recovery. But the SaCkptHandleT is still valid so application can
> re-create the checkpoint.
>
>     -AVM
>
> On 2/25/2016 12:43 PM, A V Mahesh wrote:
>> Hi Nhat Pham,
>>
>> Please see my comment.
>>
>> -AVM
>>
>> On 2/25/2016 12:07 PM, Nhat Pham wrote:
>>> Hi Mahesh,
>>>
>>> Please see my comment below with [NhatPham2].
>>>
>>> Best regards,
>>>
>>> Nhat Pham
>>>
>>> *From:*A V Mahesh [mailto:[email protected]]
>>> *Sent:* Thursday, February 25, 2016 11:26 AM
>>> *To:* Nhat Pham <[email protected]>; 'Anders Widell'
>>> <[email protected]>
>>> *Cc:* [email protected]; 'Beatriz Brandao'
>>> <[email protected]>; 'Minh Chau H' <[email protected]>
>>> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>>> preserving and recovering checkpoint replicas during headless state V2
>>> [#1621]
>>>
>>> Hi Nhat Pham,
>>>
>>> Please see my comment below.
>>>
>>> -AVM
>>>
>>> On 2/25/2016 7:54 AM, Nhat Pham wrote:
>>>
>>>       Hi Mahesh,
>>>
>>>       Would you  agree with the comment below?
>>>
>>>       To summarize, following are the comment so far:
>>>
>>>       *Comment 1*: This functionality should be under checks if Hydra
>>>       configuration is enabled in IMM attrName =
>>>
>>>       const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>>>
>>>       Action: The code will be updated accordingly.
>>>
>>>       *Comment 2*: To keep the scope of CPSV service as non-collocated
>>>       checkpoint creation NOT_SUPPORTED , if cluster is running with
>>>       IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at
>>>       the time of cluster startup  currently it is not configurable , so
>>>       there no chance of  run-time configuration change ).
>>>
>>>       Action: No change in code. The CPSV still keep supporting
>>>       non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is
> enable.
>>>    >>[AndersW3] No, I think we ought to support non-colocated
>>> checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The fact that
>>> we have "system controllers" is an implementation detail of OpenSAF. I
>>> don't think the CKPT SAF specification implies that
>>>    >>non-colocated checkpoints must be fully replicated on all the nodes
>>> in the cluster, and thus we must have the possibility that all
>>> replicas are lost. It is not clear exactly what to expect from the
>>> APIs when this happens, but you could handle it in a similar way as
>>> the case >> when all sections have been automatically deleted by the
>>> checkpoint service because the sections have expired.
>>>
>>> [AVM]  I am not in agreement with both comments ,   we can not  handle
>>> it in a similar to sections expiration case hear , in case of sections
>>> expiration checkpoint  replica  still exist only section deleted
>>>
>>>               CPSV specification says  if two replicas exist ( in our
>>> case Only on SC`s) at a certain point in time, and the nodes hosting
>>> both of these replicas is
>>>               administratively taken out of service, the Checkpoint
>>> Service should allocate another replica on another node while this
>>> node is not available
>>>               please check section `3.1.7.2 Non-Collocated Checkpoints`
>>> of cpsv specification .
>>>
>>>                For example,  take a case of  application on PL is in
>>> progress of writing to non-collocated checkpoint sections ( physical
>>> replica exist only on  SC`s )
>>>                what will happen to application on PL ?   , ok let us
>>> consider user agreed to loose the checkpoint  and he what to recreated
>>> it , what will happen to  cpnd DB on PL and the complexity involved in
>>> it (clean up) ,
>>>                and this will lead to lot of maintainability issues.
>>>
>>>               On top of that  CKPT SAF specification only says that
>>> non-collocated checkpoint and all its sections should survive if the
>>> Checkpoint Service running  on cluster and
>>>               replica is  USER private data ( not Opensaf States ) ,
>>> loosing any USER private data  not acceptable .
>>>
>>> [NhatPham2] According to SAI-AIS-CKPT-B.02.02 (chapter 3.1.8
>>> Persistence of Checkpoints):
>>>
>>> "As has been stated in Section 2.1 on page 13, the Checkpoint Service
>>> typically stores
>>>
>>> checkpoint data in the main memory of the nodes. *Regardless of the
>>> retention time, a *
>>>
>>> *checkpoint and all its sections do not survive if the Checkpoint
>>> Service stops running *
>>>
>>> *on all nodes hosting replicas for this checkpoint. The stop of the
>>> Checkpoint Service *
>>>
>>> *can be caused by administrative actions or node failures*."
>>>
>>> This states that the checkpoint doesn't not survive in case the nodes
>>> hosting its replicas failures (i.e SCs in our case).
>>>
>> [AVM If we read further section `3.1.7.2 Non-Collocated Checkpoints` ,
>> it explains with example :
>>
>> "For example, if two replicas exist at a certain point in time, and the
>> node hosting one of these replicas is
>> administratively taken out of service, the Checkpoint Service may
>> allocate another
>> replica on another node while this node is not available."
>>
>>> Regarding the case you mentioned about the lost checkpoint, what will
>>> happen to cpnd DB on PL.
>>>
>>> With this patch the CPND detects un-recoverable checkpoints and
>>> deletes them all from the DB in case the headless state happens.
>>>
>> [AVM]  I know  , I was saying  maintaining  such flow involved with
>> transport  `no active timer`   will  enable lot of  new issue in CPSV
>> and this becomes code maintainability issue,
>>                 for example :
>>
>>                    1)  both SC`s rejoined quickly ( below  `no active
>> timer`  timeout i think it is currently  ) we will end up with  not
>> deleting DB
>>                         to address this we need collect evidences to
>> detect headless state happens.
>>
>>
>>>       *Comment 3*: This is about case where checkpoint node director
>>>       (cpnd) crashes during headless state. In this case the cpnd can't
>>>       finish starting because it can't initialize CLM service.
>>>
>>>       Then after time out, the AMF triggers a restart again. Finally,
>>>       the node is rebooted.
>>>
>>>       It is expected that this problem should not lead to a node reboot.
>>>
>>>       Action: No change in code. This is the limitation of the system
>>>       during headless state.
>>>
>>>
>>> [AVM]  code changes required in CPSV CLM integration  code need to be
>>> revisited to handle TRYAGAIN.
>>>
>>> [NhatPham2] Agree. The CPND code will updated to re-initialize clm for
>>> TRY AGAIN fault code.
>>>
>>>       If you agree with the summary above, I'll update code and send out
>>>       the V3 for review.
>>>
>>>       Best regards,
>>>
>>>       Nhat Pham
>>>
>>>       *From:* Anders Widell [mailto:[email protected]]
>>>       *Sent:* Wednesday, February 24, 2016 9:26 PM
>>>       *To:* Nhat Pham <[email protected]>
>>>       <mailto:[email protected]>; 'A V Mahesh'
>>>       <[email protected]> <mailto:[email protected]>
>>>       *Cc:* [email protected]
>>>       <mailto:[email protected]>; 'Beatriz Brandao'
>>>       <[email protected]>
>>>       <mailto:[email protected]>; 'Minh Chau H'
>>>       <[email protected]> <mailto:[email protected]>
>>>       *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>>>       preserving and recovering checkpoint replicas during headless
>>>       state V2 [#1621]
>>>
>>>       See my comments inline, marked [AndersW3].
>>>
>>>       regards,
>>>       Anders Widell
>>>
>>>       On 02/24/2016 07:32 AM, Nhat Pham wrote:
>>>
>>>           Hi Mahesh and Anders,
>>>
>>>           Please see my comments below.
>>>
>>>           Best regards,
>>>
>>>           Nhat Pham
>>>
>>>           *From:* A V Mahesh [mailto:[email protected]]
>>>           *Sent:* Wednesday, February 24, 2016 11:06 AM
>>>           *To:* Nhat Pham <[email protected]>
>>>           <mailto:[email protected]>; 'Anders Widell'
>>>           <[email protected]> <mailto:[email protected]>
>>>           *Cc:* [email protected]
>>>           <mailto:[email protected]>; 'Beatriz
>>>           Brandao' <[email protected]>
>>>           <mailto:[email protected]>; 'Minh Chau H'
>>>           <[email protected]> <mailto:[email protected]>
>>>           *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>>>           preserving and recovering checkpoint replicas during headless
>>>           state V2 [#1621]
>>>
>>>           Hi Nhat Pham,
>>>
>>>           If component ( CPND ) restart allows while Controllers absent
>>>           , before  requesting CLM going to change return value
>>>           to**SA_AIS_ERR_TRY_AGAIN ,
>>>           We need to get clarification from  AMF guys  on few things
>>>           why because  if CPND is on SA_AIS_ERR_TRY_AGAIN and component
>>>           restart timeout
>>>           then AMF will restart component again ( this become cyclic )
>>>           and after   saAmfSGCompRestartMax  configured value Node gose
>>>           for reboot as next level escalation,
>>>           in that case we may required changes in  AMF as well, to not
>>>           to act on component restart timeout in case of Controllers
>>>           absent ( i am not sure it is deviation of AMF specification ) .
>>>
>>>           */[Nhat Pham] In headless state, I'm not sure about this
>>>           either. /*
>>>
>>>           */@Anders: Would you have comments about this?/*
>>>
>>>       [AndersW3] Ok, first of all I would like to point out that
>>>       normally, the OpenSAF checkpoint node director should not crash.
>>>       So we are talking about a situation where multiple faults have
>>>       occurred: first both the active and the standby system controllers
>>>       have died, and then shortly afterwards - before we have a new
>>>       active system controller - the checkpoint node director also
>>>       crashes. Sure, these may not be totally independent events, but
>>>       still there are a lot of faults that have happened within a short
>>>       period of time. We should test the node director and make sure it
>>>       doesn't crash in this type of scenario.
>>>
>>>       Now, let's consider the case where we have a fault in the node
>>>       director that causes it to crash during the headless state. The
>>>       general philosophy of the headless feature is that when things
>>>       work fine - i.e. in the absence of fault - we should be able to
>>>       continue running while the system controllers are absent. However,
>>>       if a fault happens during the headless state, we may not be able
>>>       to recover from the fault until there is an active system
>>>       controller. AMF does provide support for restarting components,
>>>       but as you have pointed out, the node director will be stuck in a
>>>       TRY_AGAIN loop immediately after it has been restarted. So this
>>>       means that if the node director crashes during the headless state,
>>>       we have lost the checkpoint functionality on that node and we will
>>>       not get it back until there is an active system controller. Other
>>>       services like IMM will still work for a while, but AMF will as you
>>>       say eventually escalate the checkpoint node director failure to a
>>>       node restart and then the whole node is gone. The node will not
>>>       come back until we have an active system controller. So to
>>>       summarize: there is very limited support for recovering from
>>>       faults that happen during the headless state. The full recovery
>>>       will not happen until we have an active system controller.
>>>
>>>           Please do incorporate current comments ( in design prospective
>>>           )  and republish the patch , I will re-test V3 patch and
>>>           provide review comments on function issue/bugs if I found any.
>>>
>>>           One Important note  , in the new patch  let us not have any
>>>           complexity of  allowing   non-collocated checkpoint creation
>>>           and then documenting that  in some scenario ,
>>>           non-collocated checkpoint  replicas are recoverable  , why
>>>           because replica is  USER private data ( not Opensaf States )
>>>           ,  loosing USER private data  not acceptable .
>>>           so let us keep the scope of CPSV service as non-collocated
>>>           checkpoint creation NOT_SUPPORTED , if cluster is running with
>>>            IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration
>>>           enabled at the time of cluster startup currently it is not
>>>           configurable , so their no chance of  run-time configuration
>>>           change ).
>>>
>>>           We can provide support for non-collocated in subsequent
>>>           enhancements by having  solution like replica on lower node ID
>>>           PL will also created
>>>           non-collocated  ( max three riplicas in cluster regradless of
>>>           where non-collocated is opened ).
>>>
>>>           So for now, regardless of the heads (SC`s) status exist not
>>>           exist  CPSV should return SA_AIS_ERR_NOT_SUPPORTED in case of
>>>           IMMSV_SC_ABSENCE_ALLOWED enabled cluster ,
>>>           and let us document it as well.
>>>
>>>           */[Nhat Pham] The patch is to limit loosing replicas and
>>>           checkpoints in case of headless state./*
>>>
>>>           */In case both replicas locate on SCs and they reboot, loosing
>>>           checkpoint is unpreventable with current design after headless
>>>           state./*
>>>
>>>           */Even if we implement the proposal "/*max three riplicas in
>>>           cluster regradless of where non-collocated is opened*/", there
>>>           is still the case where the checkpoint is lost. Ex. The SCs
>>>           and the PL which hosts the replica reboot same time./*
>>>
>>>           */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if both SCs
>>>           reboot, this leads whole cluster reboots. Then the checkpoint
>>>           is lost.
>>>
>>>           */What I mean is there are cases where the checkpoint is lost.
>>>           The point is what we can do to limit loosing data./*
>>>
>>>           */For the proposal of reject creating non-collocated
>>>           checkpoint in case of/* IMMSV_SC_ABSENCE_ALLOWED enabled, I
>>>           think this will lead to in compatible problem.
>>>
>>>           */@Anders: How do you think about rejecting creating
>>>           non-collocated checkpoint in case of
>>>           /*IMMSV_SC_ABSENCE_ALLOWED enabled?
>>>
>>>       [AndersW3] No, I think we ought to support non-colocated
>>>       checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The fact
>>>       that we have "system controllers" is an implementation detail of
>>>       OpenSAF. I don't think the CKPT SAF specification implies that
>>>       non-colocated checkpoints must be fully replicated on all the
>>>       nodes in the cluster, and thus we must have the possibility that
>>>       all replicas are lost. It is not clear exactly what to expect from
>>>       the APIs when this happens, but you could handle it in a similar
>>>       way as the case when all sections have been automatically deleted
>>>       by the checkpoint service because the sections have expired.
>>>
>>>
>>>           -AVM
>>>
>>>           On 2/24/2016 6:51 AM, Nhat Pham wrote:
>>>
>>>               Hi Mahesh,
>>>
>>>               Do you have any further comments?
>>>
>>>               Best regards,
>>>
>>>               Nhat Pham
>>>
>>>               *From:* A V Mahesh [mailto:[email protected]]
>>>               *Sent:* Monday, February 22, 2016 10:37 AM
>>>               *To:* Nhat Pham <[email protected]>
>>>               <mailto:[email protected]>; 'Anders Widell'
>>>               <[email protected]>
>>>               <mailto:[email protected]>
>>>               *Cc:* [email protected]
>>>               <mailto:[email protected]>; 'Beatriz
>>>               Brandao' <[email protected]>
>>>               <mailto:[email protected]>; 'Minh Chau H'
>>>               <[email protected]> <mailto:[email protected]>
>>>               *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
>>>               Support preserving and recovering checkpoint replicas
>>>               during headless state V2 [#1621]
>>>
>>>               Hi,
>>>
>>>               >>BTW, have you finished the review and test?
>>>
>>>               I will finish by today.
>>>
>>>               -AVM
>>>
>>>               On 2/22/2016 7:48 AM, Nhat Pham wrote:
>>>
>>>                   Hi Mahesh and Anders,
>>>
>>>                   Please see my comment below.
>>>
>>>                   BTW, have you finished the review and test?
>>>
>>>                   Best regards,
>>>
>>>                   Nhat Pham
>>>
>>>                   *From:* A V Mahesh [mailto:[email protected]]
>>>                   *Sent:* Friday, February 19, 2016 2:28 PM
>>>                   *To:* Nhat Pham <[email protected]>
>>>                   <mailto:[email protected]>; 'Anders Widell'
>>>                   <[email protected]>
>>>                   <mailto:[email protected]>; 'Minh Chau H'
>>>                   <[email protected]>
>>>                   <mailto:[email protected]>
>>>                   *Cc:* [email protected]
>>>                   <mailto:[email protected]>; 'Beatriz
>>>                   Brandao' <[email protected]>
>>>                   <mailto:[email protected]>
>>>                   *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
>>>                   Support preserving and recovering checkpoint replicas
>>>                   during headless state V2 [#1621]
>>>
>>>                   Hi Nhat Pham,
>>>
>>>                   On 2/19/2016 12:28 PM, Nhat Pham wrote:
>>>
>>>                       Could you please give more detailed information
>>>                       about steps to reproduce the problem below? Thanks.
>>>
>>>
>>>                   Don't see this as specific bug  , we need to see the
>>>                   issue as  CLM integrated service point  of view ,
>>>                   by considering Anders Widell  explication about CLM
>>>                   application behavior during headless state
>>>                   we need to reintegrate CPND with CLM ( before this
>>>                   headless state feature  no case of CPND existence in
>>>                   the obscene of CLMD  , but now it is ).
>>>
>>>                   And this will be the consistent across the all
>>>                   services who integrated with CLM  ( you may need some
>>>                   changes in CLM also )
>>>
>>>                   */[Nhat Pham] I think CLM should return
>>>                   /*SA_AIS_ERR_TRY_AGAIN in this case.
>>>
>>>                   @Anders. How would you think?
>>>
>>>                   To start with let us consider case CPND  on payload
>>>                   restarted on PL  during headless state
>>>                   and an application is in running on PL.
>>>
>>>                   */[Nhat Pham] Regarding the CPND as CLM application,
>>>                   I'm not sure what it can do in this case. In case it
>>>                   restarts, it is monitored by AMF./*
>>>
>>>                   */If it blocks for too long, AMF will also trigger a
>>>                   node reboot./*
>>>
>>>                   */In my test case, the CPND get blocked by CLM. It
>>>                   doesn't get out of the saClmInitialize. How do you get
>>>                   the "/ER cpnd clm init failed with return value:31/"?/*
>>>
>>>                   */Following is the cpnd trace./*
>>>
>>>                   Feb 22  8:56:41.188122 osafckptnd
>>>                   [736:cpnd_init.c:0183] >> cpnd_lib_init
>>>
>>>                   Feb 22  8:56:41.188332 osafckptnd
>>>                   [736:cpnd_init.c:0412] >> cpnd_cb_db_init
>>>
>>>                   Feb 22  8:56:41.188600 osafckptnd
>>>                   [736:cpnd_init.c:0437] << cpnd_cb_db_init
>>>
>>>                   Feb 22  8:56:41.188778 osafckptnd
>>>                   [736:clma_api.c:0503] >> saClmInitialize
>>>
>>>                   Feb 22  8:56:41.188945 osafckptnd
>>>                   [736:clma_api.c:0593] >> clmainitialize
>>>
>>>                   Feb 22  8:56:41.190052 osafckptnd
>>>                   [736:clma_util.c:0100] >> clma_startup: clma_use_count:
> 0
>>>                   Feb 22  8:56:41.190273 osafckptnd
>>>                   [736:clma_mds.c:1124] >> clma_mds_init
>>>
>>>                   Feb 22  8:56:41.190825 osafckptnd
>>>                   [736:clma_mds.c:1170] << clma_mds_init
>>>
>>>                   -AVM
>>>
>>>                   On 2/19/2016 12:28 PM, Nhat Pham wrote:
>>>
>>>                       Hi Mahesh,
>>>
>>>                       Could you please give more detailed information
>>>                       about steps to reproduce the problem below? Thanks.
>>>
>>>                       Best regards,
>>>
>>>                       Nhat Pham
>>>
>>>                       *From:* A V Mahesh [mailto:[email protected]]
>>>                       *Sent:* Friday, February 19, 2016 1:06 PM
>>>                       *To:* Anders Widell <[email protected]>
>>>                       <mailto:[email protected]>; Nhat Pham
>>>                       <[email protected]>
>>>                       <mailto:[email protected]>; 'Minh Chau H'
>>>                       <[email protected]>
>>>                       <mailto:[email protected]>
>>>                       *Cc:* [email protected]
>>>                       <mailto:[email protected]>;
>>>                       'Beatriz Brandao' <[email protected]>
>>>                       <mailto:[email protected]>
>>>                       *Subject:* Re: [PATCH 0 of 1] Review Request for
>>>                       cpsv: Support preserving and recovering checkpoint
>>>                       replicas during headless state V2 [#1621]
>>>
>>>                       Hi Anders Widell,
>>>                       Thanks for the detailed explanation  about CLM
>>>                       during headless state.
>>>
>>>                       HI  Nhat Pham ,
>>>
>>>                       Comment : 3
>>>                       Please see below  the problem I was interpreted
>>>                       now I  seeing it  during CLMD obscene ( during
>>>                       headless state ),
>>>                       so now CPND/CLMA need to  to address below case ,
>>>                       currently cpnd clm init failed with return
>>>                       value:   SA_AIS_ERR_UNAVAILABLE
>>>                       but should be SA_AIS_ERR_TRY_AGAIN
>>>
>>>                       ==================================================
>>>                       Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE
>>>                       STATE-> IMM_NODE_FULLY_AVAILABLE 17418
>>>                       Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync ending
>>>                       normally
>>>                       Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch set
>>>                       to 9 in ImmModel
>>>                       Feb 19 11:18:28 PL-4 cpsv_app: IN Received
>>>                       PROC_STALE_CLIENTS
>>>                       Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>>                       Implementer connected: 42 (MsgQueueService132111)
>>>                       <108, 2040f>
>>>                       Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>>                       Implementer connected: 43 (MsgQueueService131855)
>>>                       <0, 2030f>
>>>                       Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>>                       Implementer connected: 44 (safLogService) <0, 2010f>
>>>                       Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER
>>>                       STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
>>>                       Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>>                       Implementer connected: 45 (safClmService) <0, 2010f>
>>>                       *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd
>>>                       clm init failed with return value:31
>>>                       Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd
>>>                       init failed
>>>                       Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER
>>>                       cpnd_lib_req FAILED
>>>                       Feb 19 11:18:28 PL-4 osafckptnd[7718]:
>>>                       __init_cpnd() failed*
>>>                       Feb 19 11:18:28 PL-4 osafclmna[5432]: NO
>>>                       safNode=PL-4,safCluster=myClmCluster Joined
>>>                       cluster, nodeid=2040f
>>>                       Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD
>>>                       NEW_ACTIVE, adest:1
>>>                       Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO Sending
>>>                       node up due to NCSMDS_NEW_ACTIVE
>>>                       Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SISU
>>>                       states sent
>>>                       Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU
>>>                       states sent
>>>                       Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 CSICOMP
>>>                       states synced
>>>                       Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU
>>>                       states sent
>>>                       Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>>                       Implementer connected: 46 (safAmfService) <0, 2010f>
>>>                       Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO
>>>                       'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Component
>>>                       or SU restart probation timer expired
>>>                       Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
>>>                       Instantiation of
>>>                       'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>>>                       failed
>>>                       Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Reason:
>>>                       component registration timer expired
>>>                       Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA
>>>                       'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>>>                       Presence State RESTARTING => INSTANTIATION_FAILED
>>>                       Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Component
>>>                       Failover trigerred for
>>>                       'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': Failed
>>>                       component:
>>>                       'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>>>                       Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER
>>>
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got
>>>                       Inst failed
>>>                       Feb 19 11:18:35 PL-4 osafamfnd[5441]: Rebooting
>>>                       OpenSAF NodeId = 132111 EE Name = , Reason: NCS
>>>                       component Instantiation failed, OwnNodeId =
>>>                       132111, SupervisionTime = 60
>>>                       Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting
>>>                       local node; timeout=60
>>>                       Feb 19 11:18:39 PL-4 kernel: [ 4877.338518] md:
>>>                       stopping all md devices.
>>>                       ==================================================
>>>
>>>                       -AVM
>>>
>>>                       On 2/15/2016 5:11 PM, Anders Widell wrote:
>>>
>>>                           Hi!
>>>
>>>                           Please find my answer inline, marked [AndersW].
>>>
>>>                           regards,
>>>                           Anders Widell
>>>
>>>                           On 02/15/2016 10:38 AM, Nhat Pham wrote:
>>>
>>>                               Hi Mahesh,
>>>
>>>                               It's good. Thank you. :)
>>>
>>>                               [AVM]  Up on rejoining of the SC`s The
>>>                               replica should be re-created regardless
>>>                               of another application opens it on PL4.
>>>                                              ( Note : this comment is
>>>                               based on your explanation have not yet
>>>                               reviewed/tested  ,
>>>                                                 currently i am
>>>                               struggling with  SC`s    not rejoining
>>>                               after headless state , i can provide you
>>>                               more on this once i  complte my
>>>                               review/testing)
>>>
>>>                               [Nhat] To make cloud resilience works, you
>>>                               need the patches from other
>>>                               services (log, amf, clm, ntf).
>>>                               @Minh: I heard that you created tar file
>>>                               which includes all patches. Could you
>>>                               please send it to Mahesh? Thanks
>>>
>>>                               [AVM] I understand that , before I comment
>>>                               more on this   please allow me to
>>>                               understand
>>>                                             I am not still not very
>>>                               clear of the headless design in detail.
>>>                                             For example cluster
>>>                               membership of PL`s   during headless state ,
>>>                                              In the absence of SC`s
>>>                               (CLMD) dose the PLs is considered as
>>>                               cluster nodes or not (cluster membership) ?
>>>
>>>                               [Nhat] I don't know much about this.
>>>                               @ Anders: Could you please have comment
>>>                               about this? Thanks
>>>
>>>                           [AndersW] First of all, keep in mind that the
>>>                           "headless" state should ideally not last a
>>>                           very long time. Once we have the spare SC
>>>                           feature in place (ticket [#79]), a new SC
>>>                           should become active within a matter of a few
>>>                           seconds after we have lost both the active and
>>>                           the standby SC.
>>>
>>>                           I think you should view the state of the
>>>                           cluster in the headless state in the same way
>>>                           as you view the state of the cluster during a
>>>                           failover between the active and the standby
>>>                           SC. Imagine that the active SC dies. It takes
>>>                           the standby SC 1.5 seconds to detect the
>>>                           failure of the active SC (this is due to the
>>>                           TIPC timeout). If you have configured the
>>>                           PROMOTE_ACTIVE_TIMER, there is an additional
>>>                           delay before the standby takes over as active.
>>>                           What is the state of the cluster during the
>>>                           time after the active SC failed and before the
>>>                           standby takes over?
>>>
>>>                           The state of the cluster while it is headless
>>>                           is very similar. The difference is that this
>>>                           state may last a little bit longer (though not
>>>                           more than a few seconds, until one of the
>>>                           spare SCs becomes active). Another difference
>>>                           is that we may have lost some state. With a
>>>                           "perfect" implementation of the headless
>>>                           feature we should not lose any state at all,
>>>                           but with the current set of patches we do lose
>>>                           state.
>>>
>>>                           So specifically if we talk about cluster
>>>                           membership and ask the question: is a
>>>                           particular PL a member of the cluster or not
>>>                           during the headless state? Well, if you ask
>>>                           CLM about this during the headless state, then
>>>                           you will not know - because CLM doesn't
>>>                           provide any service during the headless state.
>>>                           If you keep retrying you query to CLM, you
>>>                           will eventually get an answer - but you will
>>>                           not get this answer until there is an active
>>>                           SC again and we have exited the headless
>>>                           state. When viewed in this way, the answer to
>>>                           the question about a node's membership is
>>>                           undefined during the headless state, since CLM
>>>                           will not provide you with any answer until
>>>                           there is an active SC.
>>>
>>>                           However, if you asked CLM about the node's
>>>                           cluster membership status before the cluster
>>>                           went headless, you probably saved a cached
>>>                           copy of the cluster membership state. Maybe
>>>                           you also installed a CLM track callback and
>>>                           intend to update this cached copy every time
>>>                           the cluster membership status changes. The
>>>                           question then is: can you continue using this
>>>                           cached copy of the cluster membership state
>>>                           during the headless state? The answer is YES:
>>>                           since CLM doesn't provide any service during
>>>                           the headless state, it also means that the
>>>                           cluster membership view cannot change during
>>>                           this time. Nodes can of course reboot or die,
>>>                           but CLM will not notice and hence the cluster
>>>                           view will not be updated. You can argue that
>>>                           this is bad because the cluster view doesn't
>>>                           reflect reality, but notice that this will
>>>                           always be the case. We can never propagate
>>>                           information instantaneously, and detection of
>>>                           node failures will take 1.5 seconds due to the
>>>                           TIPC timeout. You can never be sure that a
>>>                           node is alive at this very moment just because
>>>                           CLM tells you that it is a member of the
>>>                           cluster. If we are unfortunate enough to lose
>>>                           both system controller nodes simultaneously,
>>>                           updates to the cluster membership view will be
>>>                           delayed a few seconds longer than usual.
>>>
>>>
>>>                               Best regards,
>>>                               Nhat Pham
>>>
>>>                               -----Original Message-----
>>>                               From: A V Mahesh
>>>                               [mailto:[email protected]]
>>>                               Sent: Monday, February 15, 2016 11:19 AM
>>>                               To: Nhat Pham <[email protected]>
>>>                               <mailto:[email protected]>;
>>>                               [email protected]
>>>                               <mailto:[email protected]>
>>>                               Cc: [email protected]
>>>
> <mailto:[email protected]>;
>>>                               'Beatriz Brandao'
>>>                               <[email protected]>
>>>                               <mailto:[email protected]>
>>>                               Subject: Re: [PATCH 0 of 1] Review Request
>>>                               for cpsv: Support preserving and
>>>                               recovering checkpoint replicas during
>>>                               headless state V2 [#1621]
>>>
>>>                               Hi Nhat Pham,
>>>
>>>                               How is your holiday went
>>>
>>>                               Please find my comments below
>>>
>>>                               On 2/15/2016 8:43 AM, Nhat Pham wrote:
>>>
>>>                                   Hi Mahesh,
>>>
>>>                                   For the comment 1, the patch will be
>>>                                   updated accordingly.
>>>
>>>                               [AVM]  Please hold , I will provide more
>>>                               comments in this week , so we can
>>>                               have consolidated V3
>>>
>>>                                   For the comment 2, I think the CKPT
>>>                                   service will not be backward
>>>                                   compatible if the scAbsenceAllowed is
>>>                                   true.
>>>                                   The client can't create non-collocated
>>>                                   checkpoint on SCs.
>>>
>>>                                   Furthermore, this solution only
>>>                                   protects the CKPT service from the
>>>                                   case "The non-collocated checkpoint is
>>>                                   created on a SC"
>>>                                   there are still the cases where the
>>>                                   replicas are completely lost. Ex:
>>>
>>>                                   - The non-collocated checkpoint
>>>                                   created on a PL. The PL reboots. Both
>>>                                   replicas now locate on SCs. Then,
>>>                                   headless state happens. All replicas are
>>>                                   lost.
>>>                                   - The non-collocated checkpoint has
>>>                                   active replica locating on a PL
>>>                                   and this PL restarts during headless
>>>                                   state
>>>                                   - The non-collocated checkpoint is
>>>                                   created on PL3. This checkpoint is
>>>                                   also opened on PL4. Then SCs and PL3
>>>                                   reboot.
>>>
>>>                               [AVM]  Up on rejoining of the SC`s The
>>>                               replica should be re-created regardless
>>>                               of another application opens it on PL4.
>>>                                              ( Note : this comment is
>>>                               based on your explanation have not yet
>>>                               reviewed/tested  ,
>>>                                                 currently i am
>>>                               struggling with  SC`s    not rejoining
>>>                               after headless state , i can provide you
>>>                               more on this once i  complte my
>>>                               review/testing)
>>>
>>>                                   In this case, all replicas are lost
>>>                                   and the client has to create it again.
>>>
>>>                                   In case multiple nodes (which
>>>                                   including SCs) reboot, losing replicas
>>>                                   is unpreventable. The patch is to
>>>                                   recover the checkpoints in possible
>>>                                   cases.
>>>                                   How do you think?
>>>
>>>                               [AVM] I understand that , before I comment
>>>                               more on this please allow
>>>                               me to understand
>>>                                             I am not still not very
>>>                               clear of the headless design in detail.
>>>
>>>                                             For example cluster
>>>                               membership of PL`s   during headless
>>>                               state ,
>>>                                              In the absence of SC`s
>>>                               (CLMD) dose the PLs is considered as
>>>                               cluster nodes or not (cluster membership) ?
>>>
>>>                                                    - if not consider as
>>>                               NON cluster nodes Checkpoint Service
>>>                               API  should  leverage the SA Forum Cluster
>>>                                                      Membership Service
>>>                               and API's can fail with
>>>                               SA_AIS_ERR_UNAVAILABLE
>>>
>>>                                                    - if considers as
>>>                               cluster nodes  we need to follow all the
>>>                               defined rules which are defined in
>>>                               SAI-AIS-CKPT-B.02.02 specification
>>>
>>>                                             so give me some more time to
>>>                               review it completely , so that we
>>>                               can  have consolidated patch V3
>>>
>>>                               -AVM
>>>
>>>                                   Best regards,
>>>                                   Nhat Pham
>>>
>>>                                   -----Original Message-----
>>>                                   From: A V Mahesh
>>>                                   [mailto:[email protected]]
>>>                                   Sent: Friday, February 12, 2016 11:10 AM
>>>                                   To: Nhat Pham
>>>                                   <[email protected]>
>>>                                   <mailto:[email protected]>;
>>>                                   [email protected]
>>>                                   <mailto:[email protected]>
>>>                                   Cc:
>>>                                   [email protected]
>>>
> <mailto:[email protected]>;
>>>                                   Beatriz Brandao
>>>                                   <[email protected]>
>>>                                   <mailto:[email protected]>
>>>                                   Subject: Re: [PATCH 0 of 1] Review
>>>                                   Request for cpsv: Support
>>>                                   preserving and recovering checkpoint
>>>                                   replicas during headless state V2
>>>                                   [#1621]
>>>
>>>
>>>                                   Comment 2 :
>>>
>>>                                   After incorporating the comment one
>>>                                   all the Limitations should be
>>>                                   prevented based on Hydra configuration
>>>                                   is enabled in IMM status.
>>>
>>>                                   Foe example :  if some application is
>>>                                   trying to create
>>>
>>>                                   non-collocated checkpoint active
>>>                                   replica getting generated/locating on
>>>                                   SC then ,regardless of the heads
>>>                                   (SC`s) status exist not exist should
>>>                                   return SA_AIS_ERR_NOT_SUPPORTED
>>>
>>>                                   In other words, rather that allowing
>>>                                   to created non-collocated
>>>                                   checkpoint when
>>>                                   heads(SC`s)  are exit , and
>>>                                   non-collocated checkpoint getting
>>>                                   unrecoverable after heads(SC`s) rejoins.
>>>
>>>
> ======================================================================
>>>                                   =======================
>>>
>>>                                           Limitation: The CKPT service
>>>                                       doesn't support recovering
>>>                                       checkpoints in
>>>                                           following cases:
>>>                                           . The checkpoint which is
>>>                                       unlinked before headless.
>>>                                           . The non-collocated
>>>                                       checkpoint has active replica
>>>                                       locating on SC.
>>>                                           . The non-collocated
>>>                                       checkpoint has active replica
>>>                                       locating on a PL
>>>                                       and this PL
>>>                                           restarts during headless
>>>                                       state. In this cases, the
>>>                                       checkpoint replica is
>>>                                           destroyed. The fault code
>>>                                       SA_AIS_ERR_BAD_HANDLE is returned
>>>                                       when the
>>>                                       client
>>>                                           accesses the checkpoint in
>>>                                       these cases. The client must
>>>                                       re-open the
>>>                                           checkpoint.
>>>
>>>
> ======================================================================
>>>                                   =======================
>>>
>>>                                   -AVM
>>>
>>>
>>>                                   On 2/11/2016 12:52 PM, A V Mahesh wrote:
>>>
>>>                                       Hi,
>>>
>>>                                       I jut starred reviewing patch , I
>>>                                       will be  giving comments as soon as
>>>                                       I crossover any , to save some time.
>>>
>>>                                       Comment 1 :
>>>                                       This functionality should be
>>>                                       under  checks if Hydra
>>>                                       configuration is
>>>                                       enabled in IMM attrName =
>>>
> const_cast<SaImmAttrNameT>("scAbsenceAllowed")
>>>
>>>                                       Please see example how  LOG/AMF
>>>                                       services implemented it.
>>>
>>>                                       -AVM
>>>
>>>
>>>                                       On 1/29/2016 1:02 PM, Nhat Pham
>>>                                       wrote:
>>>
>>>                                           Hi Mahesh,
>>>
>>>                                           As described in the README,
>>>                                           the CKPT service returns
>>>                                           SA_AIS_ERR_TRY_AGAIN fault
>>>                                           code in this case.
>>>                                           I guess it's same for other
>>>                                           services.
>>>
>>>                                           @Anders: Could you please
>>>                                           confirm this?
>>>
>>>                                           Best regards,
>>>                                           Nhat Pham
>>>
>>>                                           -----Original Message-----
>>>                                           From: A V Mahesh
>>>                                           [mailto:[email protected]]
>>>                                           Sent: Friday, January 29, 2016
>>>                                           2:11 PM
>>>                                           To: Nhat Pham
>>>                                           <[email protected]>
>>>
> <mailto:[email protected]>;
>>>                                           [email protected]
>>>
> <mailto:[email protected]>
>>>                                           Cc:
>>>
> [email protected]
> <mailto:[email protected]>
>>>                                           Subject: Re: [PATCH 0 of 1]
>>>                                           Review Request for cpsv: Support
>>>                                           preserving and recovering
>>>                                           checkpoint replicas during
>>>                                           headless state
>>>                                           V2 [#1621]
>>>
>>>                                           Hi,
>>>
>>>                                           On 1/29/2016 11:45 AM, Nhat
>>>                                           Pham wrote:
>>>
>>>                                                     - The behavior of
>>>                                               application will be
>>>                                               consistent with other
>>>                                               saf services like imm/amf
>>>                                               behavior  during headless
>>>                                               state.
>>>                                               [Nhat] I'm not clear what
>>>                                               you mean about "consistent"?
>>>
>>>                                           In the obscene of  Director
>>>                                           (SC's) , what is expected
>>>                                           return values
>>>                                           of SAF API should ( all
>>>                                           services ) ,
>>>                                                which are not in
>>>                                           aposition to  provide service
>>>                                           at that moment.
>>>
>>>                                           I think all services should
>>>                                           return same  SAF ERRS., I thinks
>>>                                           currently we don't have  it ,
>>>                                           may be  Anders Widel  will
>>>                                           help us.
>>>
>>>                                           -AVM
>>>
>>>
>>>                                           On 1/29/2016 11:45 AM, Nhat
>>>                                           Pham wrote:
>>>
>>>                                               Hi Mahesh,
>>>
>>>                                               Please see the attachment
>>>                                               for the README. Let me
>>>                                               know if there is
>>>                                               any more information
>>>                                               required.
>>>
>>>                                               Regarding your comments:
>>>                                                     -  during headless
>>>                                               state  applications may
>>>                                               behave like during
>>>                                               CPND restart case [Nhat]
>>>                                               Headless state and CPND
>>>                                               restart are
>>>                                               different events. Thus,
>>>                                               the behavior is different.
>>>                                               Headless state is a case
>>>                                               where both SCs go down.
>>>
>>>                                                     -  The behavior of
>>>                                               application will be
>>>                                               consistent with other
>>>                                               saf services like imm/amf
>>>                                               behavior  during headless
>>>                                               state.
>>>                                               [Nhat] I'm not clear what
>>>                                               you mean about "consistent"?
>>>
>>>                                               Best regards,
>>>                                               Nhat Pham
>>>
>>>                                               -----Original Message-----
>>>                                               From: A V Mahesh
>>>
> [mailto:[email protected]]
>>>                                               Sent: Friday, January 29,
>>>                                               2016 11:12 AM
>>>                                               To: Nhat Pham
>>>                                               <[email protected]>
>>>
> <mailto:[email protected]>;
>>>                                               [email protected]
>>>
> <mailto:[email protected]>
>>>                                               Cc:
>>>
> [email protected]
> <mailto:[email protected]>
>>>                                               Subject: Re: [PATCH 0 of
>>>                                               1] Review Request for
>>>                                               cpsv: Support
>>>                                               preserving and recovering
>>>                                               checkpoint replicas during
>>>                                               headless state
>>>                                               V2 [#1621]
>>>
>>>                                               Hi Nhat Pham,
>>>
>>>                                               I stared reviewing this
>>>                                               patch , so can please
>>>                                               provide README file
>>>                                               with scope and limitations
>>>                                               , that will help to define
>>>                                               testing/reviewing scope .
>>>
>>>                                               Following are minimum
>>>                                               things we can keep in mind
>>>                                               while
>>>                                               reviewing/accepting patch ,
>>>
>>>                                               - Not effecting existing
>>>                                               functionality
>>>                                                     -  during headless
>>>                                               state  applications may
>>>                                               behave like during
>>>                                               CPND restart case
>>>                                                     -  The minimum
>>>                                               functionally of
>>>                                               application works
>>>                                                     -  The behavior of
>>>                                               application will be
>>>                                               consistent with
>>>                                                        other saf
>>>                                               services like imm/amf
>>>                                               behavior  during headless
>>>                                               state.
>>>
>>>                                               So please do provide any
>>>                                               additional detailed in
>>>                                               README if any of
>>>                                               the above is deviated ,
>>>                                               that allow users to know
>>>                                               about the
>>>                                               limitations/deviation.
>>>
>>>                                               -AVM
>>>
>>>                                               On 1/4/2016 3:15 PM, Nhat
>>>                                               Pham wrote:
>>>
>>>                                                   Summary: cpsv: Support
>>>                                                   preserving and
>>>                                                   recovering checkpoint
>>>                                                   replicas during
>>>                                                   headless state [#1621]
>>>                                                   Review request for Trac
>>>                                                   Ticket(s):
>>>                                                   #1621 Peer
>>>                                                   Reviewer(s):
>>>                                                   [email protected]
> <mailto:[email protected]>;
>>>
> [email protected]
> <mailto:[email protected]>
>>>                                                   Pull request to:
>>>                                                   [email protected]
> <mailto:[email protected]>
>>>                                                   Affected branch(es):
>>>                                                   default Development
>>>                                                   branch: default
>>>
>>>
> --------------------------------
>>>                                                   Impacted area
>>>                                                   Impact y/n
>>>
> --------------------------------
>>>                                                   Docs
> n
>>>                                                         Build
>>>                                                   system            n
>>>                                                   RPM/packaging
> n
>>>                                                         Configuration
>>>                                                   files     n
>>>                                                         Startup
>>>                                                   scripts         n
>>>                                                         SAF
>>>                                                   services            y
>>>                                                         OpenSAF
>>>                                                   services        n
>>>                                                         Core
>>>                                                   libraries          n
>>>                                                   Samples
> n
>>>                                                   Tests
> n
>>>                                                   Other
> n
>>>
>>>                                                   Comments (indicate
>>>                                                   scope for each "y"
>>>                                                   above):
>>>
> ---------------------------------------------
>>>
>>>                                                   changeset
>>>
> faec4a4445a4c23e8f630857b19aabb43b5af18d
>>>                                                   Author:    Nhat Pham
>>>
> <[email protected]>
> <mailto:[email protected]>
>>>                                                   Date:    Mon, 04 Jan
>>>                                                   2016 16:34:33 +0700
>>>
>>>                                                         cpsv: Support
>>>                                                   preserving and
>>>                                                   recovering checkpoint
>>>                                                   replicas
>>>                                                   during headless state
>>>                                                   [#1621]
>>>
>>>                                                         Background:
>>>                                                         ---------- This
>>>                                                   enhancement supports
>>>                                                   to preserve checkpoint
>>>                                                   replicas
>>>
>>>                                               in case
>>>
>>>                                                   both SCs down
>>>                                                   (headless state) and
>>>                                                   recover replicas in case
>>>                                                   one of
>>>
>>>                                               SCs up
>>>
>>>                                                   again. If both SCs
>>>                                                   goes down, checkpoint
>>>                                                   replicas on
>>>                                                   surviving nodes
>>>
>>>                                               still
>>>
>>>                                                   remain. When a SC is
>>>                                                   available again,
>>>                                                   surviving replicas are
>>>
>>>                                               automatically
>>>
>>>                                                   registered to the SC
>>>                                                   checkpoint database.
>>>                                                   Content in
>>>                                                   surviving
>>>
>>>                                               replicas are
>>>
>>>                                                   intacted and
>>>                                                   synchronized to new
>>>                                                   replicas.
>>>
>>>                                                         When no SC is
>>>                                                   available, client API
>>>                                                   calls changing
> checkpoint
>>>                                               configuration
>>>
>>>                                                   which requires SC
>>>                                                   communication, are
>>>                                                   rejected. Client API
>>>                                                   calls
>>>
>>>                                               reading and
>>>
>>>                                                   writing existing
>>>                                                   checkpoint replicas
>>>                                                   still work.
>>>
>>>                                                         Limitation: The
>>>                                                   CKPT service does not
>>>                                                   support recovering
>>>                                                   checkpoints
>>>
>>>                                               in
>>>
>>>                                                   following cases:
>>>                                                          - The
>>>                                                   checkpoint which is
>>>                                                   unlinked before
> headless.
>>>                                                          - The
>>>                                                   non-collocated
>>>                                                   checkpoint has active
>>>                                                   replica locating
>>>                                                   on SC.
>>>                                                          - The
>>>                                                   non-collocated
>>>                                                   checkpoint has active
>>>                                                   replica locating
>>>                                                   on a PL
>>>
>>>                                               and this
>>>
>>>                                                         PL restarts
>>>                                                   during headless state.
>>>                                                   In this cases, the
>>>                                                   checkpoint
>>>
>>>                                               replica is
>>>
>>>                                                   destroyed. The fault
>>>                                                   code
>>>                                                   SA_AIS_ERR_BAD_HANDLE
>>>                                                   is returned
>>>                                                   when the
>>>
>>>                                               client
>>>
>>>                                                   accesses the
>>>                                                   checkpoint in these
>>>                                                   cases. The client must
>>>                                                   re-open the
>>>                                                         checkpoint.
>>>
>>>                                                         While in
>>>                                                   headless state,
>>>                                                   accessing checkpoint
>>>                                                   replicas does
>>>                                                   not work
>>>
>>>                                               if the
>>>
>>>                                                   node which hosts the
>>>                                                   active replica goes
>>>                                                   down. It will back
>>>                                                   working
>>>
>>>                                               when a
>>>
>>>                                                         SC available
> again.
>>>                                                         Solution:
>>>                                                         --------- The
>>>                                                   solution for this
>>>                                                   enhancement includes 2
>>>                                                   parts:
>>>
>>>                                                         1. To destroy
>>>                                                   un-recoverable
>>>                                                   checkpoint described
>>>                                                   above when
>>>                                                   both
>>>
>>>                                               SCs are
>>>
>>>                                                   down: When both SCs
>>>                                                   are down, the CPND
>>>                                                   deletes un-recoverable
>>>
>>>                                               checkpoint
>>>
>>>                                                   nodes and replicas on
>>>                                                   PLs. Then it requests
>>>                                                   CPA to destroy
>>>
>>>                                               corresponding
>>>
>>>                                                   checkpoint node by
>>>                                                   using new message
>>>
> CPA_EVT_ND2A_CKPT_DESTROY
>>>                                                         2. To update CPD
>>>                                                   with checkpoint
>>>                                                   information When an
>>>                                                   active
>>>                                                   SC is up
>>>
>>>                                               after
>>>
>>>                                                   headless, CPND will
>>>                                                   update CPD with
>>>                                                   checkpoint information
> by
>>>                                                   using
>>>
>>>                                               new
>>>
>>>                                                   message
>>>
> CPD_EVT_ND2D_CKPT_INFO_UPDATE
>>>                                                   instead of using
>>>
> CPD_EVT_ND2D_CKPT_CREATE.
>>>                                                   This is because the
>>>                                                   CPND will
>>>                                                   create new
>>>
>>>                                               ckpt_id
>>>
>>>                                                         for the
>>>                                                   checkpoint which might
>>>                                                   be different with the
>>>                                                   current
>>>                                                   ckpt id
>>>
>>>                                               if the
>>>
>>>                                                   CPD_EVT_ND2D_CKPT_CREATE
>>>                                                   is used. The CPD
>>>                                                   collects checkpoint
>>>
>>>                                               information
>>>
>>>                                                   within 6s. During this
>>>                                                   updating time,
>>>                                                   following requests is
>>>                                                   rejected
>>>
>>>                                               with
>>>
>>>                                                   fault code
>>>                                                   SA_AIS_ERR_TRY_AGAIN:
>>>                                                         -
>>>                                                   CPD_EVT_ND2D_CKPT_CREATE
>>>                                                         -
>>>                                                   CPD_EVT_ND2D_CKPT_UNLINK
>>>                                                         -
>>>                                                   CPD_EVT_ND2D_ACTIVE_SET
>>>                                                         -
>>>                                                   CPD_EVT_ND2D_CKPT_RDSET
>>>
>>>
>>>                                                   Complete diffstat:
>>>                                                   ------------------
>>>
> osaf/libs/agents/saf/cpa/cpa_proc.c
>>>                                                   |   52
>>>
>>>
> +++++++++++++++++++++++++++++++++++
>>>
>>>
> osaf/libs/common/cpsv/cpsv_edu.c
>>>                                                   |   43
>>>
>>>
> +++++++++++++++++++++++++++++
>>>
> osaf/libs/common/cpsv/include/cpd_cb.h
>>>                                                   |    3 ++
>>>
> osaf/libs/common/cpsv/include/cpd_imm.h
>>>                                                   |    1 +
>>>
> osaf/libs/common/cpsv/include/cpd_proc.h
>>>                                                   |    7 ++++
>>>
> osaf/libs/common/cpsv/include/cpd_tmr.h
>>>                                                   |    3 +-
>>>
> osaf/libs/common/cpsv/include/cpnd_cb.h
>>>                                                   |    1 +
>>>
> osaf/libs/common/cpsv/include/cpnd_init.h
>>>                                                   |    2 +
>>>
> osaf/libs/common/cpsv/include/cpsv_evt.h
>>>                                                   |   20 +++++++++++++
>>>
> osaf/services/saf/cpsv/cpd/Makefile.am
>>>                                                   |    3 +-
>>>
> osaf/services/saf/cpsv/cpd/cpd_evt.c
>>>                                                   |  229
>>>
>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>                                               ++++
>>>
>>>
> osaf/services/saf/cpsv/cpd/cpd_imm.c
>>>                                                   |  112
>>>
>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
> osaf/services/saf/cpsv/cpd/cpd_init.c
>>>                                                   |   20 ++++++++++++-
>>>
> osaf/services/saf/cpsv/cpd/cpd_proc.c
>>>                                                   |  309
>>>
>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
> osaf/services/saf/cpsv/cpd/cpd_tmr.c
>>>                                                   |    7 ++++
>>>
> osaf/services/saf/cpsv/cpnd/cpnd_db.c
>>>                                                   |   16 ++++++++++
>>>
> osaf/services/saf/cpsv/cpnd/cpnd_evt.c
>>>                                                   |   22 +++++++++++++++
>>>
> osaf/services/saf/cpsv/cpnd/cpnd_init.c
>>>                                                   |   23 ++++++++++++++-
>>>
> osaf/services/saf/cpsv/cpnd/cpnd_mds.c
>>>                                                   |   13 ++++++++
>>>
> osaf/services/saf/cpsv/cpnd/cpnd_proc.c
>>>                                                   |  314
>>>
>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>
>>>                                                         20 files
>>>                                                   changed, 1189
>>>                                                   insertions(+), 11
>>>                                                   deletions(-)
>>>
>>>
>>>                                                   Testing Commands:
>>>                                                   -----------------
>>>                                                   -
>>>
>>>                                                   Testing, Expected
>>>                                                   Results:
>>>
> --------------------------
>>>                                                   -
>>>
>>>
>>>                                                   Conditions of
> Submission:
> -------------------------
>>>                                                         <<HOW MANY DAYS
>>>                                                   BEFORE PUSHING,
>>>                                                   CONSENSUS ETC>>
>>>
>>>
>>>                                                   Arch      Built
>>>                                                   Started    Linux distro
>>>
> -------------------------------------------
>>>                                                   mips        n          n
>>>                                                   mips64      n          n
>>>                                                   x86         n          n
>>>                                                   x86_64      n          n
>>>                                                   powerpc     n          n
>>>                                                   powerpc64   n          n
>>>
>>>
>>>                                                   Reviewer Checklist:
>>>                                                   -------------------
>>>                                                   [Submitters: make sure
>>>                                                   that your review
>>>                                                   doesn't trigger any
>>>                                                   checkmarks!]
>>>
>>>
>>>                                                   Your checkin has not
>>>                                                   passed review because
>>>                                                   (see checked entries):
>>>
>>>                                                   ___ Your RR template
>>>                                                   is generally
>>>                                                   incomplete; it has too
>>>                                                   many
>>>                                                   blank
>>>
>>>                                               entries
>>>
>>>                                                   that need proper data
>>>                                                   filled in.
>>>
>>>                                                   ___ You have failed to
>>>                                                   nominate the proper
>>>                                                   persons for review and
>>>                                                   push.
>>>
>>>                                                   ___ Your patches do
>>>                                                   not have proper
>>>                                                   short+long header
>>>
>>>                                                   ___ You have
>>>                                                   grammar/spelling in
>>>                                                   your header that is
>>>                                                   unacceptable.
>>>
>>>                                                   ___ You have exceeded
>>>                                                   a sensible line length
>>>                                                   in your
>>>
>>>                                               headers/comments/text.
>>>
>>>                                                   ___ You have failed to
>>>                                                   put in a proper Trac
>>>                                                   Ticket # into your
>>>                                                   commits.
>>>
>>>                                                   ___ You have
>>>                                                   incorrectly put/left
>>>                                                   internal data in your
>>>                                                   comments/files
>>>                                                            (i.e.
>>>                                                   internal bug tracking
>>>                                                   tool IDs, product
>>>                                                   names etc)
>>>
>>>                                                   ___ You have not given
>>>                                                   any evidence of
>>>                                                   testing beyond basic
>>>                                                   build
>>>                                                   tests.
>>>                                                            Demonstrate
>>>                                                   some level of runtime
>>>                                                   or other sanity testing.
>>>
>>>                                                   ___ You have ^M
>>>                                                   present in some of
>>>                                                   your files. These have
>>>                                                   to be
>>>                                                   removed.
>>>
>>>                                                   ___ You have
>>>                                                   needlessly changed
>>>                                                   whitespace or added
>>>                                                   whitespace crimes
>>>                                                            like trailing
>>>                                                   spaces, or spaces
>>>                                                   before tabs.
>>>
>>>                                                   ___ You have mixed
>>>                                                   real technical changes
>>>                                                   with whitespace and
> other
>>>                                                            cosmetic code
>>>                                                   cleanup changes. These
>>>                                                   have to be separate
>>>                                                   commits.
>>>
>>>                                                   ___ You need to
>>>                                                   refactor your
>>>                                                   submission into
>>>                                                   logical chunks; there is
>>>                                                            too much
>>>                                                   content into a single
>>>                                                   commit.
>>>
>>>                                                   ___ You have
>>>                                                   extraneous garbage in
>>>                                                   your review (merge
>>>                                                   commits etc)
>>>
>>>                                                   ___ You have giant
>>>                                                   attachments which
>>>                                                   should never have been
>>>                                                   sent;
>>>                                                            Instead you
>>>                                                   should place your
>>>                                                   content in a public
>>>                                                   tree to
>>>                                                   be pulled.
>>>
>>>                                                   ___ You have too many
>>>                                                   commits attached to an
>>>                                                   e-mail; resend as
>>>                                                   threaded
>>>                                                            commits, or
>>>                                                   place in a public tree
>>>                                                   for a pull.
>>>
>>>                                                   ___ You have resent
>>>                                                   this content multiple
>>>                                                   times without a clear
>>>                                                   indication
>>>                                                            of what has
>>>                                                   changed between each
>>>                                                   re-send.
>>>
>>>                                                   ___ You have failed to
>>>                                                   adequately and
>>>                                                   individually address
>>>                                                   all of the
>>>                                                            comments and
>>>                                                   change requests that
>>>                                                   were proposed in the
>>>                                                   initial
>>>
>>>                                               review.
>>>
>>>                                                   ___ You have a
>>>                                                   misconfigured ~/.hgrc
>>>                                                   file (i.e. username,
>>>                                                   email
>>>                                                   etc)
>>>
>>>                                                   ___ Your computer have
>>>                                                   a badly configured
>>>                                                   date and time;
>>>                                                   confusing the
>>>                                                            the threaded
>>>                                                   patch review.
>>>
>>>                                                   ___ Your changes
>>>                                                   affect IPC mechanism,
>>>                                                   and you don't present
> any
>>>                                                   results
>>>                                                            for
>>>                                                   in-service
>>>                                                   upgradability test.
>>>
>>>                                                   ___ Your changes
>>>                                                   affect user manual and
>>>                                                   documentation, your
> patch
>>>                                                   series
>>>                                                            do not
>>>                                                   contain the patch that
>>>                                                   updates the Doxygen
>>>                                                   manual.
>>>
> ----------------------------------------------------------------------------
> --
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>> _______________________________________________
>> Opensaf-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Reply via email to