Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Anders Widell Fri, 26 Feb 2016 03:12:57 -0800

See my comments marked [AndersW6].

regards,
Anders Widell


On 02/26/2016 10:31 AM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> Please see my comment below with [NhatPham6]
>
> Best regards,
>
> Nhat Pham
>
> *From:*A V Mahesh [mailto:[email protected]]
> *Sent:* Friday, February 26, 2016 12:31 PM
> *To:* Nhat Pham <[email protected]>; 'Anders Widell' 
> <[email protected]>
> *Cc:* [email protected]; 'Beatriz Brandao' 
> <[email protected]>; 'Minh Chau H' <[email protected]>
> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support 
> preserving and recovering checkpoint replicas during headless state V2 
> [#1621]
>
> Hi Nhat Pham,
>
> Please find my answers.
>
> -AVM
>
> On 2/26/2016 10:23 AM, Nhat Pham wrote:
>
>     Hi Mahesh,
>
>      Please see my answers below with [NhatPham5]
>
>      Best regards,
>
>     Nhat Pham
>
>     *From:* A V Mahesh [mailto:[email protected]]
>     *Sent:* Friday, February 26, 2016 11:17 AM
>     *To:* Nhat Pham <[email protected]>
>     <mailto:[email protected]>; 'Anders Widell'
>     <[email protected]> <mailto:[email protected]>
>     *Cc:* [email protected]
>     <mailto:[email protected]>; 'Beatriz Brandao'
>     <[email protected]>
>     <mailto:[email protected]>; 'Minh Chau H'
>     <[email protected]> <mailto:[email protected]>
>     *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>     preserving and recovering checkpoint replicas during headless
>     state V2 [#1621]
>
>      Hi Nhat Pham,
>
>     >>[NhatPham4] To be more correct, the application will get
>     SA_AIS_ERR_BAD_HANDLE when trying to access the lost checkpoint
>     because all data was destroyed.
>     >>[AndersW4] If this is a problem we could re-create the
>     checkpoint with no sections in it.
>
>         Even I come across this approach ,  instead of destroying the
>     checkpoint information (current patch doing)  from CPND of
>     payloads  and  returning the SA_AIS_ERR_BAD_HANDLE applcation on PL`s
>         in the NEW patch  V3 check the possibility of re-cremating 
>     the checkpoint with sections  ( you can send this data from PL to
>     SC up on CPD up)   .
>
>     [NhatPham5] The “all data” here I means the checkpoint node
>     information in database controlled by cpnd (not the replica). In
>     this case, all replicas were lost. How can the checkpoint be
>     re-created with sections?
>
> [AVM] I know replicas  are lost  ,  I am suggesting  to use the 
> `checkpoint node information` available  at  PL  CPND ( only one CPD 
> will volunteer for this  if multiple application opened on )
>            assume and try to recreate the  checkpoint  as fresh 
> request came from CPA  (  CPD assumes   request came all the way form  
> CPA-->CPND--->CPD but it is not )
>            so that  CPD  will create new  replicas with sections  with 
> clean/no-data  instead of  asking application to recreate it  .
>
> [Nhat Pham]I think it would be simpler and safer for application to 
> re-create the checkpoint in this case.
>
> I checked the implementation. The cpnd which doesn’t host the replica 
> doesn’t maintain section database. Thus, it can’t restore checkpoint 
> with section.
>
[AndersW6] My suggestion was to re-create the checkpoint without any 
sections. If the sections were re-created, the application wouldn't know 
that data has been lost.

I think the BAD_HANDLE approach is okay since we have used it in other 
services, but I see it as kind of a hack solution that is not really in 
line with the specs. The specs never intended BAD_HANDLE to be something 
that can happen spontaneously on a previously valid handle, lest you are 
suffering from memory corruption. In the future we could consider the 
feasibility of avoiding spontaneous BAD_HANDLE where possible, and in 
CKPT I think it may be possible by re-creating the checkpoints.
>
>
>             I think the  LOG stems are getting recreated with 
> empty/fresh/no data like this with data avlible on LGA .
>
>     For other cases where the checkpoint replicas survive, the
>     checkpoint is restored when the SC is up again.
>
>     Ex: A checkpoint is created on a PL. There are 3 replicas created
>     on SCs and PL. The headless state happens. After the SC is up, the
>     checkpoint is recovered.
>
>
>
>     -AVM
>
>     On 2/26/2016 8:11 AM, Nhat Pham wrote:
>
>         Hi,
>
>         Please see my comment below with [NhatPham4]
>
>         Best regards,
>
>         Nhat Pham
>
>         *From:* Anders Widell [mailto:[email protected]]
>         *Sent:* Thursday, February 25, 2016 9:25 PM
>         *To:* A V Mahesh <[email protected]>
>         <mailto:[email protected]>; Nhat Pham
>         <[email protected]> <mailto:[email protected]>
>         *Cc:* [email protected]
>         <mailto:[email protected]>; 'Beatriz
>         Brandao' <[email protected]>
>         <mailto:[email protected]>; 'Minh Chau H'
>         <[email protected]> <mailto:[email protected]>
>         *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>         preserving and recovering checkpoint replicas during headless
>         state V2 [#1621]
>
>         Hi!
>
>         See my comments inline, marked [AndersW4].
>
>         regards,
>         Anders Widell
>
>         On 02/25/2016 05:26 AM, A V Mahesh wrote:
>
>             Hi Nhat Pham,
>
>             Please see my comment below.
>
>             -AVM
>
>             On 2/25/2016 7:54 AM, Nhat Pham wrote:
>
>                 Hi Mahesh,
>
>                 Would you  agree with the comment below?
>
>                 To summarize, following are the comment so far:
>
>                 *Comment 1*: This functionality should be under checks
>                 if Hydra configuration is enabled in IMM attrName =
>
>                 const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>
>                 Action: The code will be updated accordingly.
>
>         [AndersW4] Just a question here: is this really needed? If the
>         code is already 100% backwards compatible when the headless
>         feature is disabled, what would be the point of reading the
>         configuration and taking different paths in the code depending
>         on it? Maybe the code is not 100% backwards compatible and
>         then I agree that we need to read the configuration.
>
>         The reason why I am asking is that I had the impression that
>         the code would only cause different behaviour in the cases
>         where both system controllers die at the same time, and this
>         cannot happen when the headless feature is disabled (or
>         rather: it can happen, but it would trigger an immediate
>         cluster restart so any difference in behaviour after that
>         point is irrelevant).
>
>         [NhatPham4] The code is backwards compatible when the headless
>         feature is disable.
>
>         For V2 patch, cpnd will update cpd with recoverable checkpoint
>         data when SC is up after headless state.(From implementation
>         point of view)
>
>         In current system if the headless feature is disable, whole
>         cluster reboots. Thus all data is destroyed.
>
>
>         For V2 patch + checking scAbsenceAllowed, cpnd destroys all
>         the checkpoint data (as original implementation). (From
>         implementation point of view)
>
>         In current system if the headless feature is disable, whole
>         cluster reboots. Thus all data is destroyed.
>
>         So if you ask if the checking is really needed in current
>         situation. The answer is not really.
>
>         This checking is just to make sure that all checkpoint data is
>         destroyed in case headless feature is disable.
>
>         How do you think?
>
>                 *Comment 2*: To keep the scope of CPSV service as
>                 non-collocated checkpoint creation NOT_SUPPORTED , if
>                 cluster is running with IMMSV_SC_ABSENCE_ALLOWED (
>                 headless state configuration enabled at the time of
>                 cluster startup  currently it is not configurable , so
>                 there no chance of  run-time configuration change ).
>
>                 Action: No change in code. The CPSV still keep
>                 supporting non-collocated checkpoint even if
>                 IMMSV_SC_ABSENCE_ALLOWED is enable.
>
>              >>[AndersW3] No, I think we ought to support
>             non-colocated checkpoints also when
>             IMMSV_SC_ABSENCE_ALLOWED is set. The fact that we have
>             "system controllers" is an implementation detail of
>             OpenSAF. I don't think the CKPT SAF specification implies
>             that
>              >>non-colocated checkpoints must be fully replicated on
>             all the nodes in the cluster, and thus we must have the
>             possibility that all replicas are lost. It is not clear
>             exactly what to expect from the APIs when this happens,
>             but you could handle it in a similar way as the case >>
>             when all sections have been automatically deleted by the
>             checkpoint service because the sections have expired.
>
>             [AVM]  I am not in agreement with both comments , we can
>             not  handle it in a similar to sections expiration case
>             hear , in case of sections expiration checkpoint  replica 
>             still exist only section deleted
>
>         [AndersW4] If this is a problem we could re-create the
>         checkpoint with no sections in it.
>
>
>                         CPSV specification says  if two replicas exist
>             ( in our case Only on SC`s) at a certain point in time,
>             and the nodes hosting both of these replicas is
>                         administratively taken out of service, the
>             Checkpoint Service should allocate another replica on
>             another node while this node is not available
>                         please check section `3.1.7.2 Non-Collocated
>             Checkpoints`  of cpsv specification .
>
>         [AndersW4] The spec actually says "may" rather than "should"
>         in this section. And the purpose of allocating another replica
>         is to "enhance the availability of checkpoints". When I read
>         this section, I think it is quite clear that the spec does not
>         perceive non-colocated checkpoints as guaranteed to preserve
>         data in the case of node failures:
>
>         "The Checkpoint Service may create replicas
>         other than the ones that may be created when opening a
>         checkpoint. These other
>         replicas can be useful to enhance the availability of
>         checkpoints. For example, if two
>         replicas exist at a certain point in time, and the node
>         hosting one of these replicas is
>         administratively taken out of service, the Checkpoint Service
>         may allocate another
>         replica on another node while this node is not available."
>
>         So, data can be lost due to (multiple) node failures. There
>         are two other cases where data is lost: automatic deletion of
>         the entire checkpoint if it has not been opened by any process
>         for the duration of the retention time, and automatic deletion
>         of sections within a checkpoint when the sections reach their
>         expiration times. The APIs specify the return code
>         SA_AIS_ERR_NOT_EXIST to signal that a specific section, or the
>         entire checkpoint, doesn't exist. Thus, there support in the
>         API for reporting loss of checkpoint data (whatever the reason
>         of the loss may be). If the headless feature is disabled, we
>         cannot lose non-colocated checkpoints due to node failures,
>         but when the headless feature is enabled we can.
>
>
>                          For example,  take a case of application on
>             PL is in progress of writing to non-collocated checkpoint
>             sections ( physical replica exist only on  SC`s )
>                          what will happen to application on PL ?   ,
>             ok let us consider user agreed to loose the checkpoint 
>             and he what to recreated it , what will happen to  cpnd DB
>             on PL and the complexity involved in it (clean up) ,
>                          and this will lead to lot of maintainability
>             issues.
>
>         [AndersW4] The thing that will happen (from an application's
>         perspective) is that you will get the SA_AIS_ERR_NOT_EXIST
>         error code from the CKPT API when trying to access the lost
>         checkpoint. I don't know the complexity at the code level for
>         implementing this, but isn't this already supported by the
>         code which is out on review (Nhat, correct me if I am wrong)?
>
>         [NhatPham4] To be more correct, the application will get
>         SA_AIS_ERR_BAD_HANDLE when trying to access the lost
>         checkpoint because all data was destroyed.
>
>         But for opening the checkpoint (not creating), it will get
>         SA_AIS_ERR_NOT_EXIST.
>
>
>                         On top of that  CKPT SAF specification only
>             says that non-collocated checkpoint and all its sections
>             should survive if the Checkpoint Service running  on
>             cluster and
>                         replica is  USER private data ( not Opensaf
>             States ) ,  loosing any USER private data not acceptable .
>
>                 *Comment 3*: This is about case where checkpoint node
>                 director (cpnd) crashes during headless state. In this
>                 case the cpnd can’t finish starting because it can’t
>                 initialize CLM service.
>
>                 Then after time out, the AMF triggers a restart again.
>                 Finally, the node is rebooted.
>
>                 It is expected that this problem should not lead to a
>                 node reboot.
>
>                 Action: No change in code. This is the limitation of
>                 the system during headless state.
>
>
>             [AVM]  code changes required in CPSV CLM integration  code
>             need to be revisited to handle TRYAGAIN.
>
>                 If you agree with the summary above, I’ll update code
>                 and send out the V3 for review.
>
>                 Best regards,
>
>                 Nhat Pham
>
>                 *From:* Anders Widell [mailto:[email protected]]
>                 *Sent:* Wednesday, February 24, 2016 9:26 PM
>                 *To:* Nhat Pham <[email protected]>
>                 <mailto:[email protected]>; 'A V Mahesh'
>                 <[email protected]> <mailto:[email protected]>
>                 *Cc:* [email protected]
>                 <mailto:[email protected]>; 'Beatriz
>                 Brandao' <[email protected]>
>                 <mailto:[email protected]>; 'Minh Chau H'
>                 <[email protected]>
>                 <mailto:[email protected]>
>                 *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
>                 Support preserving and recovering checkpoint replicas
>                 during headless state V2 [#1621]
>
>                 See my comments inline, marked [AndersW3].
>
>                 regards,
>                 Anders Widell
>
>                 On 02/24/2016 07:32 AM, Nhat Pham wrote:
>
>                     Hi Mahesh and Anders,
>
>                     Please see my comments below.
>
>                     Best regards,
>
>                     Nhat Pham
>
>                     *From:* A V Mahesh [mailto:[email protected]]
>                     *Sent:* Wednesday, February 24, 2016 11:06 AM
>                     *To:* Nhat Pham <[email protected]>
>                     <mailto:[email protected]>; 'Anders Widell'
>                     <[email protected]>
>                     <mailto:[email protected]>
>                     *Cc:* [email protected]
>                     <mailto:[email protected]>;
>                     'Beatriz Brandao' <[email protected]>
>                     <mailto:[email protected]>; 'Minh Chau
>                     H' <[email protected]>
>                     <mailto:[email protected]>
>                     *Subject:* Re: [PATCH 0 of 1] Review Request for
>                     cpsv: Support preserving and recovering checkpoint
>                     replicas during headless state V2 [#1621]
>
>                     Hi Nhat Pham,
>
>                     If component ( CPND ) restart allows while
>                     Controllers absent , before  requesting CLM going
>                     to change return value to**SA_AIS_ERR_TRY_AGAIN ,
>                     We need to get clarification from  AMF guys on few
>                     things  why because  if CPND is on
>                     SA_AIS_ERR_TRY_AGAIN and component restart timeout
>                     then AMF will restart component again ( this
>                     become cyclic ) and after saAmfSGCompRestartMax 
>                     configured value Node gose for reboot as next
>                     level escalation,
>                     in that case we may required changes in  AMF as
>                     well,  to not to act on component restart timeout
>                     in case of Controllers absent ( i am not sure it
>                     is deviation of AMF specification ) .
>
>                     */[Nhat Pham] In headless state, I’m not sure
>                     about this either. /*
>
>                     */@Anders: Would you have comments about this?/*
>
>                 [AndersW3] Ok, first of all I would like to point out
>                 that normally, the OpenSAF checkpoint node director
>                 should not crash. So we are talking about a situation
>                 where multiple faults have occurred: first both the
>                 active and the standby system controllers have died,
>                 and then shortly afterwards - before we have a new
>                 active system controller - the checkpoint node
>                 director also crashes. Sure, these may not be totally
>                 independent events, but still there are a lot of
>                 faults that have happened within a short period of
>                 time. We should test the node director and make sure
>                 it doesn't crash in this type of scenario.
>
>                 Now, let's consider the case where we have a fault in
>                 the node director that causes it to crash during the
>                 headless state. The general philosophy of the headless
>                 feature is that when things work fine - i.e. in the
>                 absence of fault - we should be able to continue
>                 running while the system controllers are absent.
>                 However, if a fault happens during the headless state,
>                 we may not be able to recover from the fault until
>                 there is an active system controller. AMF does provide
>                 support for restarting components, but as you have
>                 pointed out, the node director will be stuck in a
>                 TRY_AGAIN loop immediately after it has been
>                 restarted. So this means that if the node director
>                 crashes during the headless state, we have lost the
>                 checkpoint functionality on that node and we will not
>                 get it back until there is an active system
>                 controller. Other services like IMM will still work
>                 for a while, but AMF will as you say eventually
>                 escalate the checkpoint node director failure to a
>                 node restart and then the whole node is gone. The node
>                 will not come back until we have an active system
>                 controller. So to summarize: there is very limited
>                 support for recovering from faults that happen during
>                 the headless state. The full recovery will not happen
>                 until we have an active system controller.
>
>                     Please do incorporate current comments ( in design
>                     prospective ) and republish the patch , I will
>                     re-test V3 patch and provide review comments on
>                     function issue/bugs if I found any.
>
>                     One Important note  , in the new patch  let us not
>                     have any complexity of  allowing non-collocated
>                     checkpoint creation and then documenting that  in
>                     some scenario ,
>                     non-collocated checkpoint  replicas are
>                     recoverable  , why because replica is  USER
>                     private data ( not Opensaf States ) , loosing USER
>                     private data  not acceptable .
>                     so let us keep the scope of CPSV service as
>                     non-collocated checkpoint creation NOT_SUPPORTED ,
>                     if cluster is running with
>                      IMMSV_SC_ABSENCE_ALLOWED ( headless state
>                     configuration enabled at the time of cluster
>                     startup  currently it is not configurable , so
>                     their no chance of  run-time configuration change ).
>
>                     We can provide support for non-collocated in
>                     subsequent enhancements by having  solution like
>                     replica on lower node ID PL will also created
>                     non-collocated  ( max three riplicas in cluster
>                     regradless of where non-collocated is opened ).
>
>                     So for now, regardless of the heads (SC`s) status
>                     exist not exist  CPSV should return
>                     SA_AIS_ERR_NOT_SUPPORTED in case of
>                     IMMSV_SC_ABSENCE_ALLOWED enabled cluster ,
>                     and let us document it as well.
>
>                     */[Nhat Pham] The patch is to limit loosing
>                     replicas and checkpoints in case of headless state./*
>
>                     */In case both replicas locate on SCs and they
>                     reboot, loosing checkpoint is unpreventable with
>                     current design after headless state./*
>
>                     */Even if we implement the proposal “/*max three
>                     riplicas in cluster regradless of where
>                     non-collocated is opened*/”, there is still the
>                     case where the checkpoint is lost. Ex. The SCs and
>                     the PL which hosts the replica reboot same time./*
>
>                     */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if
>                     both SCs reboot, this leads whole cluster reboots.
>                     Then the checkpoint is lost.
>
>                     */What I mean is there are cases where the
>                     checkpoint is lost. The point is what we can do to
>                     limit loosing data./*
>
>                     */For the proposal of reject creating
>                     non-collocated checkpoint in case
>                     of/* IMMSV_SC_ABSENCE_ALLOWED enabled, I think
>                     this will lead to in compatible problem.
>
>                     */@Anders: How do you think about rejecting
>                     creating non-collocated checkpoint in case of
>                     /*IMMSV_SC_ABSENCE_ALLOWED enabled?
>
>                 [AndersW3] No, I think we ought to support
>                 non-colocated checkpoints also when
>                 IMMSV_SC_ABSENCE_ALLOWED is set. The fact that we have
>                 "system controllers" is an implementation detail of
>                 OpenSAF. I don't think the CKPT SAF specification
>                 implies that non-colocated checkpoints must be fully
>                 replicated on all the nodes in the cluster, and thus
>                 we must have the possibility that all replicas are
>                 lost. It is not clear exactly what to expect from the
>                 APIs when this happens, but you could handle it in a
>                 similar way as the case when all sections have been
>                 automatically deleted by the checkpoint service
>                 because the sections have expired.
>
>
>                     -AVM
>
>                     On 2/24/2016 6:51 AM, Nhat Pham wrote:
>
>                         Hi Mahesh,
>
>                         Do you have any further comments?
>
>                         Best regards,
>
>                         Nhat Pham
>
>                         *From:* A V Mahesh
>                         [mailto:[email protected]]
>                         *Sent:* Monday, February 22, 2016 10:37 AM
>                         *To:* Nhat Pham <[email protected]>
>                         <mailto:[email protected]>; 'Anders
>                         Widell' <[email protected]>
>                         <mailto:[email protected]>
>                         *Cc:* [email protected]
>                         <mailto:[email protected]>;
>                         'Beatriz Brandao'
>                         <[email protected]>
>                         <mailto:[email protected]>; 'Minh
>                         Chau H' <[email protected]>
>                         <mailto:[email protected]>
>                         *Subject:* Re: [PATCH 0 of 1] Review Request
>                         for cpsv: Support preserving and recovering
>                         checkpoint replicas during headless state V2
>                         [#1621]
>
>                         Hi,
>
>                         >>BTW, have you finished the review and test?
>
>                         I will finish by today.
>
>                         -AVM
>
>                         On 2/22/2016 7:48 AM, Nhat Pham wrote:
>
>                             Hi Mahesh and Anders,
>
>                             Please see my comment below.
>
>                             BTW, have you finished the review and test?
>
>                             Best regards,
>
>                             Nhat Pham
>
>                             *From:* A V Mahesh
>                             [mailto:[email protected]]
>                             *Sent:* Friday, February 19, 2016 2:28 PM
>                             *To:* Nhat Pham <[email protected]>
>                             <mailto:[email protected]>; 'Anders
>                             Widell' <[email protected]>
>                             <mailto:[email protected]>; 'Minh
>                             Chau H' <[email protected]>
>                             <mailto:[email protected]>
>                             *Cc:* [email protected]
>                             <mailto:[email protected]>;
>                             'Beatriz Brandao'
>                             <[email protected]>
>                             <mailto:[email protected]>
>                             *Subject:* Re: [PATCH 0 of 1] Review
>                             Request for cpsv: Support preserving and
>                             recovering checkpoint replicas during
>                             headless state V2 [#1621]
>
>                             Hi Nhat Pham,
>
>                             On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
>                                 Could you please give more detailed
>                                 information about steps to reproduce
>                                 the problem below? Thanks.
>
>
>                             Don't see this as specific bug  , we need
>                             to see the issue as  CLM integrated
>                             service point  of view ,
>                             by considering Anders Widell explication
>                             about CLM  application behavior during
>                             headless state
>                             we need to reintegrate CPND with CLM (
>                             before this  headless state feature  no
>                             case of CPND existence in the obscene of
>                             CLMD  , but now it is ).
>
>                             And this will be the consistent across the
>                             all services who integrated with CLM  (
>                             you may need some changes in CLM also )
>
>                             */[Nhat Pham] I think CLM should return
>                             /*SA_AIS_ERR_TRY_AGAIN in this case.
>
>                             @Anders. How would you think?
>
>                             To start with let us consider case CPND 
>                             on payload restarted on PL during headless
>                             state
>                             and an application is in running on PL.
>
>                             */[Nhat Pham] Regarding the CPND as CLM
>                             application, I’m not sure what it can do
>                             in this case. In case it restarts, it is
>                             monitored by AMF./*
>
>                             */If it blocks for too long, AMF will also
>                             trigger a node reboot./*
>
>                             */In my test case, the CPND get blocked by
>                             CLM. It doesn’t get out of the
>                             saClmInitialize. How do you get the “/ER
>                             cpnd clm init failed with return value:31/”?/*
>
>                             */Following is the cpnd trace./*
>
>                             Feb 22 8:56:41.188122 osafckptnd
>                             [736:cpnd_init.c:0183] >> cpnd_lib_init
>
>                             Feb 22 8:56:41.188332 osafckptnd
>                             [736:cpnd_init.c:0412] >> cpnd_cb_db_init
>
>                             Feb 22 8:56:41.188600 osafckptnd
>                             [736:cpnd_init.c:0437] << cpnd_cb_db_init
>
>                             Feb 22 8:56:41.188778 osafckptnd
>                             [736:clma_api.c:0503] >> saClmInitialize
>
>                             Feb 22 8:56:41.188945 osafckptnd
>                             [736:clma_api.c:0593] >> clmainitialize
>
>                             Feb 22 8:56:41.190052 osafckptnd
>                             [736:clma_util.c:0100] >> clma_startup:
>                             clma_use_count: 0
>
>                             Feb 22 8:56:41.190273 osafckptnd
>                             [736:clma_mds.c:1124] >> clma_mds_init
>
>                             Feb 22 8:56:41.190825 osafckptnd
>                             [736:clma_mds.c:1170] << clma_mds_init
>
>                             -AVM
>
>                             On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
>                                 Hi Mahesh,
>
>                                 Could you please give more detailed
>                                 information about steps to reproduce
>                                 the problem below? Thanks.
>
>                                 Best regards,
>
>                                 Nhat Pham
>
>                                 *From:* A V Mahesh
>                                 [mailto:[email protected]]
>                                 *Sent:* Friday, February 19, 2016 1:06 PM
>                                 *To:* Anders Widell
>                                 <[email protected]>
>                                 <mailto:[email protected]>;
>                                 Nhat Pham <[email protected]>
>                                 <mailto:[email protected]>;
>                                 'Minh Chau H'
>                                 <[email protected]>
>                                 <mailto:[email protected]>
>                                 *Cc:*
>                                 [email protected]
>                                 <mailto:[email protected]>;
>                                 'Beatriz Brandao'
>                                 <[email protected]>
>                                 <mailto:[email protected]>
>                                 *Subject:* Re: [PATCH 0 of 1] Review
>                                 Request for cpsv: Support preserving
>                                 and recovering checkpoint replicas
>                                 during headless state V2 [#1621]
>
>                                 Hi Anders Widell,
>                                 Thanks for the detailed explanation 
>                                 about CLM during headless state.
>
>                                 HI  Nhat Pham ,
>
>                                 Comment : 3
>                                 Please see below  the problem I was
>                                 interpreted now I  seeing it  during
>                                 CLMD obscene ( during headless state ),
>                                 so now CPND/CLMA need to  to address
>                                 below case , currently cpnd clm init
>                                 failed with return value:  
>                                 SA_AIS_ERR_UNAVAILABLE
>                                 but should be SA_AIS_ERR_TRY_AGAIN
>
>                                 
> ==================================================
>                                 Feb 19 11:18:28 PL-4 osafimmnd[5422]:
>                                 NO NODE STATE->
>                                 IMM_NODE_FULLY_AVAILABLE 17418
>                                 Feb 19 11:18:28 PL-4 osafimmloadd: NO
>                                 Sync ending normally
>                                 Feb 19 11:18:28 PL-4 osafimmnd[5422]:
>                                 NO Epoch set to 9 in ImmModel
>                                 Feb 19 11:18:28 PL-4 cpsv_app: IN
>                                 Received PROC_STALE_CLIENTS
>                                 Feb 19 11:18:28 PL-4 osafimmnd[5422]:
>                                 NO Implementer connected: 42
>                                 (MsgQueueService132111) <108, 2040f>
>                                 Feb 19 11:18:28 PL-4 osafimmnd[5422]:
>                                 NO Implementer connected: 43
>                                 (MsgQueueService131855) <0, 2030f>
>                                 Feb 19 11:18:28 PL-4 osafimmnd[5422]:
>                                 NO Implementer connected: 44
>                                 (safLogService) <0, 2010f>
>                                 Feb 19 11:18:28 PL-4 osafimmnd[5422]:
>                                 NO SERVER STATE:
>                                 IMM_SERVER_SYNC_SERVER -->
>                                 IMM_SERVER_READY
>                                 Feb 19 11:18:28 PL-4 osafimmnd[5422]:
>                                 NO Implementer connected: 45
>                                 (safClmService) <0, 2010f>
>                                 *Feb 19 11:18:28 PL-4
>                                 osafckptnd[7718]: ER cpnd clm init
>                                 failed with return value:31
>                                 Feb 19 11:18:28 PL-4 osafckptnd[7718]:
>                                 ER cpnd init failed
>                                 Feb 19 11:18:28 PL-4 osafckptnd[7718]:
>                                 ER cpnd_lib_req FAILED
>                                 Feb 19 11:18:28 PL-4 osafckptnd[7718]:
>                                 __init_cpnd() failed*
>                                 Feb 19 11:18:28 PL-4 osafclmna[5432]:
>                                 NO
>                                 safNode=PL-4,safCluster=myClmCluster
>                                 Joined cluster, nodeid=2040f
>                                 Feb 19 11:18:28 PL-4 osafamfnd[5441]:
>                                 NO AVD NEW_ACTIVE, adest:1
>                                 Feb 19 11:18:28 PL-4 osafamfnd[5441]:
>                                 NO Sending node up due to
>                                 NCSMDS_NEW_ACTIVE
>                                 Feb 19 11:18:28 PL-4 osafamfnd[5441]:
>                                 NO 1 SISU states sent
>                                 Feb 19 11:18:28 PL-4 osafamfnd[5441]:
>                                 NO 1 SU states sent
>                                 Feb 19 11:18:28 PL-4 osafamfnd[5441]:
>                                 NO 7 CSICOMP states synced
>                                 Feb 19 11:18:28 PL-4 osafamfnd[5441]:
>                                 NO 7 SU states sent
>                                 Feb 19 11:18:28 PL-4 osafimmnd[5422]:
>                                 NO Implementer connected: 46
>                                 (safAmfService) <0, 2010f>
>                                 Feb 19 11:18:30 PL-4 osafamfnd[5441]:
>                                 NO
>                                 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' 
> Component
>                                 or SU restart probation timer expired
>                                 Feb 19 11:18:35 PL-4 osafamfnd[5441]:
>                                 NO Instantiation of
>                                 
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                                 failed
>                                 Feb 19 11:18:35 PL-4 osafamfnd[5441]:
>                                 NO Reason: component registration
>                                 timer expired
>                                 Feb 19 11:18:35 PL-4 osafamfnd[5441]:
>                                 WA
>                                 
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                                 Presence State RESTARTING =>
>                                 INSTANTIATION_FAILED
>                                 Feb 19 11:18:35 PL-4 osafamfnd[5441]:
>                                 NO Component Failover trigerred for
>                                 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF':
>                                 Failed component:
>                                 
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>                                 Feb 19 11:18:35 PL-4 osafamfnd[5441]:
>                                 ER
>                                 
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got
>                                 Inst failed
>                                 Feb 19 11:18:35 PL-4 osafamfnd[5441]:
>                                 Rebooting OpenSAF NodeId = 132111 EE
>                                 Name = , Reason: NCS component
>                                 Instantiation failed, OwnNodeId =
>                                 132111, SupervisionTime = 60
>                                 Feb 19 11:18:36 PL-4 opensaf_reboot:
>                                 Rebooting local node; timeout=60
>                                 Feb 19 11:18:39 PL-4 kernel: [
>                                 4877.338518] md: stopping all md devices.
>                                 
> ==================================================
>
>                                 -AVM
>
>                                 On 2/15/2016 5:11 PM, Anders Widell wrote:
>
>                                     Hi!
>
>                                     Please find my answer inline,
>                                     marked [AndersW].
>
>                                     regards,
>                                     Anders Widell
>
>                                     On 02/15/2016 10:38 AM, Nhat Pham
>                                     wrote:
>
>                                         Hi Mahesh,
>
>                                         It's good. Thank you. :)
>
>                                         [AVM]  Up on rejoining of the
>                                         SC`s The replica should be
>                                         re-created regardless
>                                         of another application opens
>                                         it on PL4.
>                                                        ( Note : this
>                                         comment is based on your
>                                         explanation have not yet
>                                         reviewed/tested  ,
>                                                           currently i
>                                         am struggling with SC`s    not
>                                         rejoining
>                                         after headless state , i can
>                                         provide you more on this once
>                                         i  complte my
>                                         review/testing)
>
>                                         [Nhat] To make cloud
>                                         resilience works, you need the
>                                         patches from other
>                                         services (log, amf, clm, ntf).
>                                         @Minh: I heard that you
>                                         created tar file which
>                                         includes all patches. Could you
>                                         please send it to Mahesh? Thanks
>
>                                         [AVM] I understand that ,
>                                         before I comment more on
>                                         this   please allow me to
>                                         understand
>                                                       I am not still
>                                         not very clear of the headless
>                                         design in detail.
>                                                       For example
>                                         cluster membership of PL`s
>                                         during headless state ,
>                                                        In the absence
>                                         of  SC`s  (CLMD) dose the PLs
>                                         is considered as
>                                         cluster nodes or not (cluster
>                                         membership) ?
>
>                                         [Nhat] I don't know much about
>                                         this.
>                                         @ Anders: Could you please
>                                         have comment about this? Thanks
>
>                                     [AndersW] First of all, keep in
>                                     mind that the "headless" state
>                                     should ideally not last a very
>                                     long time. Once we have the spare
>                                     SC feature in place (ticket
>                                     [#79]), a new SC should become
>                                     active within a matter of a few
>                                     seconds after we have lost both
>                                     the active and the standby SC.
>
>                                     I think you should view the state
>                                     of the cluster in the headless
>                                     state in the same way as you view
>                                     the state of the cluster during a
>                                     failover between the active and
>                                     the standby SC. Imagine that the
>                                     active SC dies. It takes the
>                                     standby SC 1.5 seconds to detect
>                                     the failure of the active SC (this
>                                     is due to the TIPC timeout). If
>                                     you have configured the
>                                     PROMOTE_ACTIVE_TIMER, there is an
>                                     additional delay before the
>                                     standby takes over as active. What
>                                     is the state of the cluster during
>                                     the time after the active SC
>                                     failed and before the standby
>                                     takes over?
>
>                                     The state of the cluster while it
>                                     is headless is very similar. The
>                                     difference is that this state may
>                                     last a little bit longer (though
>                                     not more than a few seconds, until
>                                     one of the spare SCs becomes
>                                     active). Another difference is
>                                     that we may have lost some state.
>                                     With a "perfect" implementation of
>                                     the headless feature we should not
>                                     lose any state at all, but with
>                                     the current set of patches we do
>                                     lose state.
>
>                                     So specifically if we talk about
>                                     cluster membership and ask the
>                                     question: is a particular PL a
>                                     member of the cluster or not
>                                     during the headless state? Well,
>                                     if you ask CLM about this during
>                                     the headless state, then you will
>                                     not know - because CLM doesn't
>                                     provide any service during the
>                                     headless state. If you keep
>                                     retrying you query to CLM, you
>                                     will eventually get an answer -
>                                     but you will not get this answer
>                                     until there is an active SC again
>                                     and we have exited the headless
>                                     state. When viewed in this way,
>                                     the answer to the question about a
>                                     node's membership is undefined
>                                     during the headless state, since
>                                     CLM will not provide you with any
>                                     answer until there is an active SC.
>
>                                     However, if you asked CLM about
>                                     the node's cluster membership
>                                     status before the cluster went
>                                     headless, you probably saved a
>                                     cached copy of the cluster
>                                     membership state. Maybe you also
>                                     installed a CLM track callback and
>                                     intend to update this cached copy
>                                     every time the cluster membership
>                                     status changes. The question then
>                                     is: can you continue using this
>                                     cached copy of the cluster
>                                     membership state during the
>                                     headless state? The answer is YES:
>                                     since CLM doesn't provide any
>                                     service during the headless state,
>                                     it also means that the cluster
>                                     membership view cannot change
>                                     during this time. Nodes can of
>                                     course reboot or die, but CLM will
>                                     not notice and hence the cluster
>                                     view will not be updated. You can
>                                     argue that this is bad because the
>                                     cluster view doesn't reflect
>                                     reality, but notice that this will
>                                     always be the case. We can never
>                                     propagate information
>                                     instantaneously, and detection of
>                                     node failures will take 1.5
>                                     seconds due to the TIPC timeout.
>                                     You can never be sure that a node
>                                     is alive at this very moment just
>                                     because CLM tells you that it is a
>                                     member of the cluster. If we are
>                                     unfortunate enough to lose both
>                                     system controller nodes
>                                     simultaneously, updates to the
>                                     cluster membership view will be
>                                     delayed a few seconds longer than
>                                     usual.
>
>
>                                         Best regards,
>                                         Nhat Pham
>
>                                         -----Original Message-----
>                                         From: A V Mahesh
>                                         [mailto:[email protected]]
>                                         Sent: Monday, February 15,
>                                         2016 11:19 AM
>                                         To: Nhat Pham
>                                         <[email protected]>
>                                         <mailto:[email protected]>;
>                                         [email protected]
>                                         <mailto:[email protected]>
>
>                                         Cc:
>                                         [email protected]
>                                         
> <mailto:[email protected]>;
>                                         'Beatriz Brandao'
>                                         <[email protected]>
>                                         <mailto:[email protected]>
>
>                                         Subject: Re: [PATCH 0 of 1]
>                                         Review Request for cpsv:
>                                         Support preserving and
>                                         recovering checkpoint replicas
>                                         during headless state V2 [#1621]
>
>                                         Hi Nhat Pham,
>
>                                         How is your holiday went
>
>                                         Please find my comments below
>
>                                         On 2/15/2016 8:43 AM, Nhat
>                                         Pham wrote:
>
>                                             Hi Mahesh,
>
>                                             For the comment 1, the
>                                             patch will be updated
>                                             accordingly.
>
>                                         [AVM] Please hold , I will
>                                         provide more comments in this
>                                         week , so we can
>                                         have consolidated V3
>
>                                             For the comment 2, I think
>                                             the CKPT service will not
>                                             be backward
>                                             compatible if the
>                                             scAbsenceAllowed is true.
>                                             The client can't create
>                                             non-collocated checkpoint
>                                             on SCs.
>
>                                             Furthermore, this solution
>                                             only protects the CKPT
>                                             service from the
>                                             case "The non-collocated
>                                             checkpoint  is created on
>                                             a SC"
>                                             there are still the cases
>                                             where the replicas are
>                                             completely lost. Ex:
>
>                                             - The non-collocated
>                                             checkpoint created on a
>                                             PL. The PL reboots. Both
>                                             replicas now locate on
>                                             SCs. Then, headless state
>                                             happens. All replicas are
>                                             lost.
>                                             - The non-collocated
>                                             checkpoint has active
>                                             replica locating on a PL
>                                             and this PL restarts
>                                             during headless state
>                                             - The non-collocated
>                                             checkpoint is created on
>                                             PL3. This checkpoint is
>                                             also opened on PL4. Then
>                                             SCs and PL3 reboot.
>
>                                         [AVM]  Up on rejoining of the
>                                         SC`s The replica should be
>                                         re-created regardless
>                                         of another application opens
>                                         it on PL4.
>                                                        ( Note : this
>                                         comment is based on your
>                                         explanation have not yet
>                                         reviewed/tested  ,
>                                                           currently i
>                                         am struggling with SC`s    not
>                                         rejoining
>                                         after headless state , i can
>                                         provide you more on this once
>                                         i  complte my
>                                         review/testing)
>
>                                             In this case, all replicas
>                                             are lost and the client
>                                             has to create it again.
>
>                                             In case multiple nodes
>                                             (which including SCs)
>                                             reboot, losing replicas
>                                             is unpreventable. The
>                                             patch is to recover the
>                                             checkpoints in possible
>                                             cases.
>                                             How do you think?
>
>                                         [AVM] I understand that ,
>                                         before I comment more on this
>                                         please allow
>                                         me to understand
>                                                       I am not still
>                                         not very clear of the headless
>                                         design in detail.
>
>                                                       For example
>                                         cluster membership of PL`s
>                                         during headless
>                                         state ,
>                                                        In the absence
>                                         of  SC`s  (CLMD) dose the PLs
>                                         is considered as
>                                         cluster nodes or not (cluster
>                                         membership) ?
>
>                                                              - if not
>                                         consider as  NON cluster nodes
>                                         Checkpoint Service
>                                         API  should  leverage the SA
>                                         Forum Cluster
>                                         Membership Service  and API's
>                                         can fail with
>                                         SA_AIS_ERR_UNAVAILABLE
>
>                                                              - if
>                                         considers as cluster nodes we
>                                         need to follow all the
>                                         defined rules which are
>                                         defined in
>                                         SAI-AIS-CKPT-B.02.02
>                                         specification
>
>                                                       so give me some
>                                         more time to review it
>                                         completely , so that we
>                                         can  have consolidated patch V3
>
>                                         -AVM
>
>                                             Best regards,
>                                             Nhat Pham
>
>                                             -----Original Message-----
>                                             From: A V Mahesh
>                                             [mailto:[email protected]]
>
>                                             Sent: Friday, February 12,
>                                             2016 11:10 AM
>                                             To: Nhat Pham
>                                             <[email protected]>
>                                             <mailto:[email protected]>;
>                                             [email protected]
>                                             
> <mailto:[email protected]>
>
>                                             Cc:
>                                             
> [email protected]
>                                             
> <mailto:[email protected]>;
>                                             Beatriz Brandao
>                                             <[email protected]>
>                                             
> <mailto:[email protected]>
>
>                                             Subject: Re: [PATCH 0 of
>                                             1] Review Request for
>                                             cpsv: Support
>                                             preserving and recovering
>                                             checkpoint replicas during
>                                             headless state V2
>                                             [#1621]
>
>
>                                             Comment 2 :
>
>                                             After incorporating the
>                                             comment one all the
>                                             Limitations should be
>                                             prevented based on Hydra
>                                             configuration is enabled
>                                             in IMM status.
>
>                                             Foe example :  if some
>                                             application is trying to
>                                             create
>
>                                             non-collocated checkpoint
>                                             active replica getting
>                                             generated/locating on
>                                             SC then ,regardless of the
>                                             heads (SC`s) status exist
>                                             not exist should
>                                             return
>                                             SA_AIS_ERR_NOT_SUPPORTED
>
>                                             In other words, rather
>                                             that allowing to created
>                                             non-collocated
>                                             checkpoint when
>                                             heads(SC`s)  are exit ,
>                                             and non-collocated
>                                             checkpoint getting
>                                             unrecoverable after
>                                             heads(SC`s) rejoins.
>
>                                             
> ======================================================================
>
>                                             =======================
>
>                                                     Limitation: The
>                                                 CKPT service doesn't
>                                                 support recovering
>                                                 checkpoints in
>                                                     following cases:
>                                                     . The checkpoint
>                                                 which is unlinked
>                                                 before headless.
>                                                     . The
>                                                 non-collocated
>                                                 checkpoint has active
>                                                 replica locating on SC.
>                                                     . The
>                                                 non-collocated
>                                                 checkpoint has active
>                                                 replica locating on a PL
>                                                 and this PL
>                                                     restarts during
>                                                 headless state. In
>                                                 this cases, the
>                                                 checkpoint replica is
>                                                     destroyed. The
>                                                 fault code
>                                                 SA_AIS_ERR_BAD_HANDLE
>                                                 is returned when the
>                                                 client
>                                                     accesses the
>                                                 checkpoint in these
>                                                 cases. The client must
>                                                 re-open the
>                                                     checkpoint.
>
>                                             
> ======================================================================
>
>                                             =======================
>
>                                             -AVM
>
>
>                                             On 2/11/2016 12:52 PM, A V
>                                             Mahesh wrote:
>
>                                                 Hi,
>
>                                                 I jut starred
>                                                 reviewing patch , I
>                                                 will be giving
>                                                 comments as soon as
>                                                 I crossover any , to
>                                                 save some time.
>
>                                                 Comment 1 :
>                                                 This functionality
>                                                 should be under 
>                                                 checks if Hydra
>                                                 configuration is
>                                                 enabled in IMM attrName =
>                                                 
> const_cast<SaImmAttrNameT>("scAbsenceAllowed")
>
>
>                                                 Please see example how
>                                                 LOG/AMF  services
>                                                 implemented it.
>
>                                                 -AVM
>
>
>                                                 On 1/29/2016 1:02 PM,
>                                                 Nhat Pham wrote:
>
>                                                     Hi Mahesh,
>
>                                                     As described in
>                                                     the README, the
>                                                     CKPT service returns
>                                                     SA_AIS_ERR_TRY_AGAIN
>                                                     fault code in this
>                                                     case.
>                                                     I guess it's same
>                                                     for other services.
>
>                                                     @Anders: Could you
>                                                     please confirm this?
>
>                                                     Best regards,
>                                                     Nhat Pham
>
>                                                     -----Original
>                                                     Message-----
>                                                     From: A V Mahesh
>                                                     
> [mailto:[email protected]]
>
>                                                     Sent: Friday,
>                                                     January 29, 2016
>                                                     2:11 PM
>                                                     To: Nhat Pham
>                                                     <[email protected]>
>                                                     
> <mailto:[email protected]>;
>                                                     [email protected]
>                                                     
> <mailto:[email protected]>
>
>                                                     Cc:
>                                                     
> [email protected]
>                                                     
> <mailto:[email protected]>
>
>                                                     Subject: Re:
>                                                     [PATCH 0 of 1]
>                                                     Review Request for
>                                                     cpsv: Support
>                                                     preserving and
>                                                     recovering
>                                                     checkpoint
>                                                     replicas during
>                                                     headless state
>                                                     V2 [#1621]
>
>                                                     Hi,
>
>                                                     On 1/29/2016 11:45
>                                                     AM, Nhat Pham wrote:
>
>                                                         -  The
>                                                         behavior of
>                                                         application
>                                                         will be
>                                                         consistent
>                                                         with other
>                                                         saf services
>                                                         like imm/amf
>                                                         behavior
>                                                         during
>                                                         headless state.
>                                                         [Nhat] I'm not
>                                                         clear what you
>                                                         mean about
>                                                         "consistent"?
>
>                                                     In the obscene of
>                                                     Director (SC's) ,
>                                                     what is expected
>                                                     return values
>                                                     of SAF API should
>                                                     ( all services ) ,
>                                                          which are not
>                                                     in aposition to 
>                                                     provide service at
>                                                     that moment.
>
>                                                     I think all
>                                                     services should
>                                                     return same SAF
>                                                     ERRS., I thinks
>                                                     currently we don't
>                                                     have  it , may be
>                                                     Anders Widel  will
>                                                     help us.
>
>                                                     -AVM
>
>
>                                                     On 1/29/2016 11:45
>                                                     AM, Nhat Pham wrote:
>
>                                                         Hi Mahesh,
>
>                                                         Please see the
>                                                         attachment for
>                                                         the README.
>                                                         Let me know if
>                                                         there is
>                                                         any more
>                                                         information
>                                                         required.
>
>                                                         Regarding your
>                                                         comments:
>                                                               - 
>                                                         during
>                                                         headless state
>                                                         applications
>                                                         may behave
>                                                         like during
>                                                         CPND restart
>                                                         case [Nhat]
>                                                         Headless state
>                                                         and CPND
>                                                         restart are
>                                                         different
>                                                         events. Thus,
>                                                         the behavior
>                                                         is different.
>                                                         Headless state
>                                                         is a case
>                                                         where both SCs
>                                                         go down.
>
>                                                               -  The
>                                                         behavior of
>                                                         application
>                                                         will be
>                                                         consistent
>                                                         with other
>                                                         saf services
>                                                         like imm/amf
>                                                         behavior
>                                                         during
>                                                         headless state.
>                                                         [Nhat] I'm not
>                                                         clear what you
>                                                         mean about
>                                                         "consistent"?
>
>                                                         Best regards,
>                                                         Nhat Pham
>
>                                                         -----Original
>                                                         Message-----
>                                                         From: A V
>                                                         Mahesh
>                                                         
> [mailto:[email protected]]
>
>                                                         Sent: Friday,
>                                                         January 29,
>                                                         2016 11:12 AM
>                                                         To: Nhat Pham
>                                                         
> <[email protected]>
>                                                         
> <mailto:[email protected]>;
>
>                                                         
> [email protected]
>                                                         
> <mailto:[email protected]>
>
>                                                         Cc:
>                                                         
> [email protected]
>                                                         
> <mailto:[email protected]>
>
>                                                         Subject: Re:
>                                                         [PATCH 0 of 1]
>                                                         Review Request
>                                                         for cpsv: Support
>                                                         preserving and
>                                                         recovering
>                                                         checkpoint
>                                                         replicas
>                                                         during
>                                                         headless state
>                                                         V2 [#1621]
>
>                                                         Hi Nhat Pham,
>
>                                                         I stared
>                                                         reviewing this
>                                                         patch , so can
>                                                         please provide
>                                                         README file
>                                                         with scope and
>                                                         limitations ,
>                                                         that will help
>                                                         to define
>                                                         testing/reviewing
>                                                         scope .
>
>                                                         Following are
>                                                         minimum things
>                                                         we can keep in
>                                                         mind while
>                                                         reviewing/accepting
>                                                         patch ,
>
>                                                         - Not
>                                                         effecting
>                                                         existing
>                                                         functionality
>                                                               - 
>                                                         during
>                                                         headless state
>                                                         applications
>                                                         may behave
>                                                         like during
>                                                         CPND restart case
>                                                               -  The
>                                                         minimum
>                                                         functionally
>                                                         of application
>                                                         works
>                                                               -  The
>                                                         behavior of
>                                                         application
>                                                         will be
>                                                         consistent with
>                                                                  other
>                                                         saf services
>                                                         like imm/amf
>                                                         behavior
>                                                         during
>                                                         headless state.
>
>                                                         So please do
>                                                         provide any
>                                                         additional
>                                                         detailed in
>                                                         README if any of
>                                                         the above is
>                                                         deviated ,
>                                                         that allow
>                                                         users to know
>                                                         about the
>                                                         limitations/deviation.
>
>
>                                                         -AVM
>
>                                                         On 1/4/2016
>                                                         3:15 PM, Nhat
>                                                         Pham wrote:
>
>                                                             Summary:
>                                                             cpsv:
>                                                             Support
>                                                             preserving
>                                                             and
>                                                             recovering
>                                                             checkpoint
>                                                             replicas
>                                                             during
>                                                             headless
>                                                             state
>                                                             [#1621]
>                                                             Review
>                                                             request
>                                                             for Trac
>                                                             Ticket(s):
>                                                             #1621 Peer
>                                                             Reviewer(s):
>                                                             
> [email protected]
>                                                             
> <mailto:[email protected]>;
>
>                                                             
> [email protected]
>                                                             
> <mailto:[email protected]>
>                                                             Pull
>                                                             request to:
>                                                             
> [email protected]
>                                                             
> <mailto:[email protected]>
>                                                             Affected
>                                                             branch(es): 
> default
>                                                             Development
>                                                             branch:
>                                                             default
>
>                                                             
> --------------------------------
>
>                                                             Impacted
>                                                             area      
>                                                             Impact y/n
>                                                             
> --------------------------------
>
>                                                             Docs n
>                                                                  
>                                                             Build
>                                                             system n
>                                                             RPM/packaging
>                                                             n
>                                                             Configuration
>                                                             files     n
>                                                                  
>                                                             Startup
>                                                             scripts        
>                                                             n
>                                                                   SAF
>                                                             services y
>                                                                  
>                                                             OpenSAF
>                                                             services       
>                                                             n
>                                                                   Core
>                                                             libraries n
>                                                             Samples n
>                                                             Tests n
>                                                             Other n
>
>
>                                                             Comments
>                                                             (indicate
>                                                             scope for
>                                                             each "y"
>                                                             above):
>                                                             
> ---------------------------------------------
>
>
>                                                             changeset
>                                                             
> faec4a4445a4c23e8f630857b19aabb43b5af18d
>
>                                                             Author:   
>                                                             Nhat Pham
>                                                             
> <[email protected]>
>                                                             
> <mailto:[email protected]>
>
>                                                             Date:   
>                                                             Mon, 04
>                                                             Jan 2016
>                                                             16:34:33
>                                                             +0700
>
>                                                                  
>                                                             cpsv:
>                                                             Support
>                                                             preserving
>                                                             and
>                                                             recovering
>                                                             checkpoint
>                                                             replicas
>                                                             during
>                                                             headless
>                                                             state [#1621]
>
>                                                                  
>                                                             Background:
>                                                                  
>                                                             ----------
>                                                             This
>                                                             enhancement 
> supports
>                                                             to
>                                                             preserve
>                                                             checkpoint
>                                                             replicas
>
>                                                         in case
>
>                                                             both SCs
>                                                             down
>                                                             (headless
>                                                             state) and
>                                                             recover
>                                                             replicas
>                                                             in case
>                                                             one of
>
>                                                         SCs up
>
>                                                             again. If
>                                                             both SCs
>                                                             goes down,
>                                                             checkpoint
>                                                             replicas on
>                                                             surviving
>                                                             nodes
>
>                                                         still
>
>                                                             remain.
>                                                             When a SC
>                                                             is
>                                                             available
>                                                             again,
>                                                             surviving
>                                                             replicas are
>
>                                                         automatically
>
>                                                             registered
>                                                             to the SC
>                                                             checkpoint
>                                                             database.
>                                                             Content in
>                                                             surviving
>
>                                                         replicas are
>
>                                                             intacted
>                                                             and
>                                                             synchronized
>                                                             to new
>                                                             replicas.
>
>                                                                   When
>                                                             no SC is
>                                                             available,
>                                                             client API
>                                                             calls
>                                                             changing
>                                                             checkpoint
>
>                                                         configuration
>
>                                                             which
>                                                             requires
>                                                             SC
>                                                             communication,
>                                                             are
>                                                             rejected.
>                                                             Client API
>                                                             calls
>
>                                                         reading and
>
>                                                             writing
>                                                             existing
>                                                             checkpoint
>                                                             replicas
>                                                             still work.
>
>                                                                  
>                                                             Limitation: The
>                                                             CKPT
>                                                             service
>                                                             does not
>                                                             support
>                                                             recovering
>                                                             checkpoints
>
>                                                         in
>
>                                                             following
>                                                             cases:
>                                                                    -
>                                                             The
>                                                             checkpoint
>                                                             which is
>                                                             unlinked
>                                                             before
>                                                             headless.
>                                                                    -
>                                                             The
>                                                             non-collocated
>                                                             checkpoint
>                                                             has active
>                                                             replica
>                                                             locating
>                                                             on SC.
>                                                                    -
>                                                             The
>                                                             non-collocated
>                                                             checkpoint
>                                                             has active
>                                                             replica
>                                                             locating
>                                                             on a PL
>
>                                                         and this
>
>                                                             PL
>                                                             restarts
>                                                             during
>                                                             headless
>                                                             state. In
>                                                             this
>                                                             cases, the
>                                                             checkpoint
>
>                                                         replica is
>
>                                                             destroyed.
>                                                             The fault
>                                                             code
>                                                             
> SA_AIS_ERR_BAD_HANDLE
>                                                             is returned
>                                                             when the
>
>                                                         client
>
>                                                             accesses
>                                                             the
>                                                             checkpoint
>                                                             in these
>                                                             cases. The
>                                                             client must
>                                                             re-open the
>                                                                  
>                                                             checkpoint.
>
>                                                                  
>                                                             While in
>                                                             headless
>                                                             state,
>                                                             accessing
>                                                             checkpoint
>                                                             replicas does
>                                                             not work
>
>                                                         if the
>
>                                                             node which
>                                                             hosts the
>                                                             active
>                                                             replica
>                                                             goes down.
>                                                             It will back
>                                                             working
>
>                                                         when a
>
>                                                             SC
>                                                             available
>                                                             again.
>
>                                                                  
>                                                             Solution:
>                                                                  
>                                                             ---------
>                                                             The
>                                                             solution
>                                                             for this
>                                                             enhancement 
> includes
>                                                             2 parts:
>
>                                                                   1.
>                                                             To destroy
>                                                             un-recoverable
>                                                             checkpoint
>                                                             described
>                                                             above when
>                                                             both
>
>                                                         SCs are
>
>                                                             down: When
>                                                             both SCs
>                                                             are down,
>                                                             the CPND
>                                                             deletes
>                                                             un-recoverable
>
>
>                                                         checkpoint
>
>                                                             nodes and
>                                                             replicas
>                                                             on PLs.
>                                                             Then it
>                                                             requests
>                                                             CPA to
>                                                             destroy
>
>                                                         corresponding
>
>                                                             checkpoint
>                                                             node by
>                                                             using new
>                                                             message
>                                                             
> CPA_EVT_ND2A_CKPT_DESTROY
>
>
>                                                                   2.
>                                                             To update
>                                                             CPD with
>                                                             checkpoint
>                                                             information When
>                                                             an active
>                                                             SC is up
>
>                                                         after
>
>                                                             headless,
>                                                             CPND will
>                                                             update CPD
>                                                             with
>                                                             checkpoint
>                                                             information by
>
>                                                             using
>
>                                                         new
>
>                                                             message
>                                                             
> CPD_EVT_ND2D_CKPT_INFO_UPDATE
>                                                             instead of
>                                                             using
>                                                             
> CPD_EVT_ND2D_CKPT_CREATE.
>                                                             This is
>                                                             because
>                                                             the CPND will
>                                                             create new
>
>                                                         ckpt_id
>
>                                                             for the
>                                                             checkpoint
>                                                             which
>                                                             might be
>                                                             different
>                                                             with the
>                                                             current
>                                                             ckpt id
>
>                                                         if the
>
>                                                             
> CPD_EVT_ND2D_CKPT_CREATE
>                                                             is used.
>                                                             The CPD
>                                                             collects
>                                                             checkpoint
>
>                                                         information
>
>                                                             within 6s.
>                                                             During
>                                                             this
>                                                             updating
>                                                             time,
>                                                             following
>                                                             requests is
>                                                             rejected
>
>                                                         with
>
>                                                             fault code
>                                                             
> SA_AIS_ERR_TRY_AGAIN:
>
>                                                                   -
>                                                             
> CPD_EVT_ND2D_CKPT_CREATE
>
>                                                                   -
>                                                             
> CPD_EVT_ND2D_CKPT_UNLINK
>
>                                                                   -
>                                                             
> CPD_EVT_ND2D_ACTIVE_SET
>
>                                                                   -
>                                                             
> CPD_EVT_ND2D_CKPT_RDSET
>
>
>
>                                                             Complete
>                                                             diffstat:
>                                                             ------------------
>
>                                                             
> osaf/libs/agents/saf/cpa/cpa_proc.c
>                                                             |   52
>
>                                                         
> +++++++++++++++++++++++++++++++++++
>
>
>                                                             
> osaf/libs/common/cpsv/cpsv_edu.c
>                                                             |   43
>
>                                                         
> +++++++++++++++++++++++++++++
>
>
>                                                             
> osaf/libs/common/cpsv/include/cpd_cb.h
>                                                             |    3 ++
>                                                             
> osaf/libs/common/cpsv/include/cpd_imm.h
>                                                             |    1 +
>                                                             
> osaf/libs/common/cpsv/include/cpd_proc.h
>                                                             |    7 ++++
>                                                             
> osaf/libs/common/cpsv/include/cpd_tmr.h
>                                                             |    3 +-
>                                                             
> osaf/libs/common/cpsv/include/cpnd_cb.h
>                                                             |    1 +
>                                                             
> osaf/libs/common/cpsv/include/cpnd_init.h
>                                                             |    2 +
>                                                             
> osaf/libs/common/cpsv/include/cpsv_evt.h
>                                                             |   20
>                                                             +++++++++++++
>                                                             
> osaf/services/saf/cpsv/cpd/Makefile.am
>                                                             |    3 +-
>                                                             
> osaf/services/saf/cpsv/cpd/cpd_evt.c
>                                                             |  229
>
>                                                         
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                         
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                         ++++
>
>                                                             
> osaf/services/saf/cpsv/cpd/cpd_imm.c
>                                                             |  112
>
>                                                         
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                             
> osaf/services/saf/cpsv/cpd/cpd_init.c
>                                                             |   20
>                                                             ++++++++++++-
>                                                             
> osaf/services/saf/cpsv/cpd/cpd_proc.c
>                                                             |  309
>
>                                                         
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                         
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                         
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                             
> osaf/services/saf/cpsv/cpd/cpd_tmr.c
>                                                             |    7 ++++
>                                                             
> osaf/services/saf/cpsv/cpnd/cpnd_db.c
>                                                             |   16
>                                                             ++++++++++
>                                                             
> osaf/services/saf/cpsv/cpnd/cpnd_evt.c
>                                                             |   22
>                                                             +++++++++++++++
>
>                                                             
> osaf/services/saf/cpsv/cpnd/cpnd_init.c
>                                                             |   23
>                                                             ++++++++++++++-
>
>                                                             
> osaf/services/saf/cpsv/cpnd/cpnd_mds.c
>                                                             |   13
>                                                             ++++++++
>                                                             
> osaf/services/saf/cpsv/cpnd/cpnd_proc.c
>                                                             |  314
>
>                                                         
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                         
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>                                                         
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>
>
>                                                             20 files
>                                                             changed,
>                                                             1189
>                                                             insertions(+),
>                                                             11
>                                                             deletions(-)
>
>
>                                                             Testing
>                                                             Commands:
>                                                             -----------------
>
>                                                             -
>
>                                                             Testing,
>                                                             Expected
>                                                             Results:
>                                                             
> --------------------------
>
>                                                             -
>
>
>                                                             Conditions
>                                                             of
>                                                             Submission:
>                                                             
> -------------------------
>
>                                                                  
>                                                             <<HOW MANY
>                                                             DAYS
>                                                             BEFORE
>                                                             PUSHING,
>                                                             CONSENSUS
>                                                             ETC>>
>
>
>                                                             Arch Built
>                                                             Started   
>                                                             Linux distro
>                                                             
> -------------------------------------------
>
>                                                             mips
>                                                             n          n
>                                                             mips64
>                                                             n          n
>                                                             x86
>                                                             n          n
>                                                             x86_64
>                                                             n          n
>                                                             powerpc
>                                                             n          n
>                                                             powerpc64
>                                                             n          n
>
>
>                                                             Reviewer
>                                                             Checklist:
>                                                             
> -------------------
>
>                                                             [Submitters:
>                                                             make sure
>                                                             that your
>                                                             review
>                                                             doesn't
>                                                             trigger any
>                                                             checkmarks!]
>
>
>                                                             Your
>                                                             checkin
>                                                             has not
>                                                             passed
>                                                             review
>                                                             because
>                                                             (see
>                                                             checked
>                                                             entries):
>
>                                                             ___ Your
>                                                             RR
>                                                             template
>                                                             is
>                                                             generally
>                                                             incomplete; it
>                                                             has too many
>                                                             blank
>
>                                                         entries
>
>                                                             that need
>                                                             proper
>                                                             data
>                                                             filled in.
>
>                                                             ___ You
>                                                             have
>                                                             failed to
>                                                             nominate
>                                                             the proper
>                                                             persons
>                                                             for review
>                                                             and
>                                                             push.
>
>                                                             ___ Your
>                                                             patches do
>                                                             not have
>                                                             proper
>                                                             short+long
>                                                             header
>
>                                                             ___ You
>                                                             have
>                                                             grammar/spelling
>                                                             in your
>                                                             header
>                                                             that is
>                                                             unacceptable.
>
>                                                             ___ You
>                                                             have
>                                                             exceeded a
>                                                             sensible
>                                                             line
>                                                             length in
>                                                             your
>
>                                                         headers/comments/text.
>
>
>                                                             ___ You
>                                                             have
>                                                             failed to
>                                                             put in a
>                                                             proper
>                                                             Trac
>                                                             Ticket #
>                                                             into your
>                                                             commits.
>
>                                                             ___ You
>                                                             have
>                                                             incorrectly 
> put/left
>                                                             internal
>                                                             data in
>                                                             your
>                                                             comments/files
>
>                                                                     
>                                                             (i.e.
>                                                             internal
>                                                             bug
>                                                             tracking
>                                                             tool IDs,
>                                                             product
>                                                             names etc)
>
>                                                             ___ You
>                                                             have not
>                                                             given any
>                                                             evidence
>                                                             of testing
>                                                             beyond
>                                                             basic build
>                                                             tests.
>                                                             Demonstrate some
>                                                             level of
>                                                             runtime or
>                                                             other
>                                                             sanity
>                                                             testing.
>
>                                                             ___ You
>                                                             have ^M
>                                                             present in
>                                                             some of
>                                                             your
>                                                             files.
>                                                             These have
>                                                             to be
>                                                             removed.
>
>                                                             ___ You
>                                                             have
>                                                             needlessly
>                                                             changed
>                                                             whitespace
>                                                             or added
>                                                             whitespace
>                                                             crimes
>                                                                     
>                                                             like
>                                                             trailing
>                                                             spaces, or
>                                                             spaces
>                                                             before tabs.
>
>                                                             ___ You
>                                                             have mixed
>                                                             real
>                                                             technical
>                                                             changes
>                                                             with
>                                                             whitespace
>                                                             and other
>                                                                     
>                                                             cosmetic
>                                                             code
>                                                             cleanup
>                                                             changes.
>                                                             These have
>                                                             to be
>                                                             separate
>                                                             commits.
>
>                                                             ___ You
>                                                             need to
>                                                             refactor
>                                                             your
>                                                             submission
>                                                             into
>                                                             logical
>                                                             chunks;
>                                                             there is
>                                                                     
>                                                             too much
>                                                             content
>                                                             into a
>                                                             single
>                                                             commit.
>
>                                                             ___ You
>                                                             have
>                                                             extraneous
>                                                             garbage in
>                                                             your
>                                                             review
>                                                             (merge
>                                                             commits etc)
>
>                                                             ___ You
>                                                             have giant
>                                                             attachments which
>                                                             should
>                                                             never have
>                                                             been sent;
>                                                                     
>                                                             Instead
>                                                             you should
>                                                             place your
>                                                             content in
>                                                             a public
>                                                             tree to
>                                                             be pulled.
>
>                                                             ___ You
>                                                             have too
>                                                             many
>                                                             commits
>                                                             attached
>                                                             to an
>                                                             e-mail;
>                                                             resend as
>                                                             threaded
>                                                                     
>                                                             commits,
>                                                             or place
>                                                             in a
>                                                             public
>                                                             tree for a
>                                                             pull.
>
>                                                             ___ You
>                                                             have
>                                                             resent
>                                                             this
>                                                             content
>                                                             multiple
>                                                             times
>                                                             without a
>                                                             clear
>                                                             indication
>                                                                     
>                                                             of what
>                                                             has
>                                                             changed
>                                                             between
>                                                             each re-send.
>
>                                                             ___ You
>                                                             have
>                                                             failed to
>                                                             adequately
>                                                             and
>                                                             individually
>                                                             address
>                                                             all of the
>                                                                     
>                                                             comments
>                                                             and change
>                                                             requests
>                                                             that were
>                                                             proposed
>                                                             in the
>                                                             initial
>
>                                                         review.
>
>                                                             ___ You
>                                                             have a
>                                                             misconfigured
>                                                             ~/.hgrc
>                                                             file (i.e.
>                                                             username,
>                                                             email
>                                                             etc)
>
>                                                             ___ Your
>                                                             computer
>                                                             have a
>                                                             badly
>                                                             configured
>                                                             date and
>                                                             time;
>                                                             confusing the
>                                                                     
>                                                             the
>                                                             threaded
>                                                             patch review.
>
>                                                             ___ Your
>                                                             changes
>                                                             affect IPC
>                                                             mechanism,
>                                                             and you
>                                                             don't
>                                                             present any
>                                                             results
>                                                                     
>                                                             for
>                                                             in-service
>                                                             upgradability
>                                                             test.
>
>                                                             ___ Your
>                                                             changes
>                                                             affect
>                                                             user
>                                                             manual and
>                                                             documentation,
>                                                             your patch
>                                                             series
>                                                                     
>                                                             do not
>                                                             contain
>                                                             the patch
>                                                             that
>                                                             updates
>                                                             the
>                                                             Doxygen
>                                                             manual.
>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Reply via email to