Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Nhat Pham Thu, 25 Feb 2016 02:21:13 -0800

Hi Mahesh,

Please see my answers below with [NhatPham4]


Best regards,
Nhat Pham

-----Original Message-----
From: A V Mahesh [mailto:[email protected]]
Sent: Thursday, February 25, 2016 4:31 PM
To: Nhat Pham <[email protected]>; 'Anders Widell'
<[email protected]>
Cc: 'Beatriz Brandao' <[email protected]>; 'Minh Chau H'
<[email protected]>; [email protected]
Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
preserving and recovering checkpoint replicas during headless state V2
[#1621]

Hi Nhat Pham,

 >> With this patch the CPND detects un-recoverable checkpoints and
deletes them all from the DB in case the headless state happens.

  By the way I didn't tested some cases,  can you clarify below :

- which error will be revived by cpsv  application of PL ,  for the
unrecoverable  checkpoint ?
- Is  accessing  SaCkptHandleT  valid  after head recovery ?
[NhatPham4] It's still valid during headless state and after head recovery.
During headless, saCkptCheckpointOpen() returns SA_AIS_ERR_TRY_AGAIN. It's
back working after head recovery.

- Is  accessing SaCkptCheckpointHandleT  return SA_AIS_ERR_BAD_HANDLE
after head recovery ?
[NhatPham4] Yes, it returns SA_AIS_ERR_BAD_HANDLE  during headless state and
after head recovery. But the SaCkptHandleT is still valid so application can
re-create the checkpoint.

   -AVM

On 2/25/2016 12:43 PM, A V Mahesh wrote:
> Hi Nhat Pham,
>
> Please see my comment.
>
> -AVM
>
> On 2/25/2016 12:07 PM, Nhat Pham wrote:
>> Hi Mahesh,
>>
>> Please see my comment below with [NhatPham2].
>>
>> Best regards,
>>
>> Nhat Pham
>>
>> *From:*A V Mahesh [mailto:[email protected]]
>> *Sent:* Thursday, February 25, 2016 11:26 AM
>> *To:* Nhat Pham <[email protected]>; 'Anders Widell'
>> <[email protected]>
>> *Cc:* [email protected]; 'Beatriz Brandao'
>> <[email protected]>; 'Minh Chau H' <[email protected]>
>> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>> preserving and recovering checkpoint replicas during headless state V2
>> [#1621]
>>
>> Hi Nhat Pham,
>>
>> Please see my comment below.
>>
>> -AVM
>>
>> On 2/25/2016 7:54 AM, Nhat Pham wrote:
>>
>>      Hi Mahesh,
>>
>>      Would you  agree with the comment below?
>>
>>      To summarize, following are the comment so far:
>>
>>      *Comment 1*: This functionality should be under checks if Hydra
>>      configuration is enabled in IMM attrName =
>>
>>      const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>>
>>      Action: The code will be updated accordingly.
>>
>>      *Comment 2*: To keep the scope of CPSV service as non-collocated
>>      checkpoint creation NOT_SUPPORTED , if cluster is running with
>>      IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at
>>      the time of cluster startup  currently it is not configurable , so
>>      there no chance of  run-time configuration change ).
>>
>>      Action: No change in code. The CPSV still keep supporting
>>      non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is
enable.
>>
>>   >>[AndersW3] No, I think we ought to support non-colocated
>> checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The fact that
>> we have "system controllers" is an implementation detail of OpenSAF. I
>> don't think the CKPT SAF specification implies that
>>   >>non-colocated checkpoints must be fully replicated on all the nodes
>> in the cluster, and thus we must have the possibility that all
>> replicas are lost. It is not clear exactly what to expect from the
>> APIs when this happens, but you could handle it in a similar way as
>> the case >> when all sections have been automatically deleted by the
>> checkpoint service because the sections have expired.
>>
>> [AVM]  I am not in agreement with both comments ,   we can not  handle
>> it in a similar to sections expiration case hear , in case of sections
>> expiration checkpoint  replica  still exist only section deleted
>>
>>              CPSV specification says  if two replicas exist ( in our
>> case Only on SC`s) at a certain point in time, and the nodes hosting
>> both of these replicas is
>>              administratively taken out of service, the Checkpoint
>> Service should allocate another replica on another node while this
>> node is not available
>>              please check section `3.1.7.2 Non-Collocated Checkpoints`
>> of cpsv specification .
>>
>>               For example,  take a case of  application on PL is in
>> progress of writing to non-collocated checkpoint sections ( physical
>> replica exist only on  SC`s )
>>               what will happen to application on PL ?   , ok let us
>> consider user agreed to loose the checkpoint  and he what to recreated
>> it , what will happen to  cpnd DB on PL and the complexity involved in
>> it (clean up) ,
>>               and this will lead to lot of maintainability issues.
>>
>>              On top of that  CKPT SAF specification only says that
>> non-collocated checkpoint and all its sections should survive if the
>> Checkpoint Service running  on cluster and
>>              replica is  USER private data ( not Opensaf States ) ,
>> loosing any USER private data  not acceptable .
>>
>> [NhatPham2] According to SAI-AIS-CKPT-B.02.02 (chapter 3.1.8
>> Persistence of Checkpoints):
>>
>> "As has been stated in Section 2.1 on page 13, the Checkpoint Service
>> typically stores
>>
>> checkpoint data in the main memory of the nodes. *Regardless of the
>> retention time, a *
>>
>> *checkpoint and all its sections do not survive if the Checkpoint
>> Service stops running *
>>
>> *on all nodes hosting replicas for this checkpoint. The stop of the
>> Checkpoint Service *
>>
>> *can be caused by administrative actions or node failures*."
>>
>> This states that the checkpoint doesn't not survive in case the nodes
>> hosting its replicas failures (i.e SCs in our case).
>>
> [AVM If we read further section `3.1.7.2 Non-Collocated Checkpoints` ,
> it explains with example :
>
> "For example, if two replicas exist at a certain point in time, and the
> node hosting one of these replicas is
> administratively taken out of service, the Checkpoint Service may
> allocate another
> replica on another node while this node is not available."
>
>> Regarding the case you mentioned about the lost checkpoint, what will
>> happen to cpnd DB on PL.
>>
>> With this patch the CPND detects un-recoverable checkpoints and
>> deletes them all from the DB in case the headless state happens.
>>
> [AVM]  I know  , I was saying  maintaining  such flow involved with
> transport  `no active timer`   will  enable lot of  new issue in CPSV
> and this becomes code maintainability issue,
>                for example :
>
>                   1)  both SC`s rejoined quickly ( below  `no active
> timer`  timeout i think it is currently  ) we will end up with  not
> deleting DB
>                        to address this we need collect evidences to
> detect headless state happens.
>
>
>>      *Comment 3*: This is about case where checkpoint node director
>>      (cpnd) crashes during headless state. In this case the cpnd can't
>>      finish starting because it can't initialize CLM service.
>>
>>      Then after time out, the AMF triggers a restart again. Finally,
>>      the node is rebooted.
>>
>>      It is expected that this problem should not lead to a node reboot.
>>
>>      Action: No change in code. This is the limitation of the system
>>      during headless state.
>>
>>
>> [AVM]  code changes required in CPSV CLM integration  code need to be
>> revisited to handle TRYAGAIN.
>>
>> [NhatPham2] Agree. The CPND code will updated to re-initialize clm for
>> TRY AGAIN fault code.
>>
>>      If you agree with the summary above, I'll update code and send out
>>      the V3 for review.
>>
>>      Best regards,
>>
>>      Nhat Pham
>>
>>      *From:* Anders Widell [mailto:[email protected]]
>>      *Sent:* Wednesday, February 24, 2016 9:26 PM
>>      *To:* Nhat Pham <[email protected]>
>>      <mailto:[email protected]>; 'A V Mahesh'
>>      <[email protected]> <mailto:[email protected]>
>>      *Cc:* [email protected]
>>      <mailto:[email protected]>; 'Beatriz Brandao'
>>      <[email protected]>
>>      <mailto:[email protected]>; 'Minh Chau H'
>>      <[email protected]> <mailto:[email protected]>
>>      *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>>      preserving and recovering checkpoint replicas during headless
>>      state V2 [#1621]
>>
>>      See my comments inline, marked [AndersW3].
>>
>>      regards,
>>      Anders Widell
>>
>>      On 02/24/2016 07:32 AM, Nhat Pham wrote:
>>
>>          Hi Mahesh and Anders,
>>
>>          Please see my comments below.
>>
>>          Best regards,
>>
>>          Nhat Pham
>>
>>          *From:* A V Mahesh [mailto:[email protected]]
>>          *Sent:* Wednesday, February 24, 2016 11:06 AM
>>          *To:* Nhat Pham <[email protected]>
>>          <mailto:[email protected]>; 'Anders Widell'
>>          <[email protected]> <mailto:[email protected]>
>>          *Cc:* [email protected]
>>          <mailto:[email protected]>; 'Beatriz
>>          Brandao' <[email protected]>
>>          <mailto:[email protected]>; 'Minh Chau H'
>>          <[email protected]> <mailto:[email protected]>
>>          *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
>>          preserving and recovering checkpoint replicas during headless
>>          state V2 [#1621]
>>
>>          Hi Nhat Pham,
>>
>>          If component ( CPND ) restart allows while Controllers absent
>>          , before  requesting CLM going to change return value
>>          to**SA_AIS_ERR_TRY_AGAIN ,
>>          We need to get clarification from  AMF guys  on few things
>>          why because  if CPND is on SA_AIS_ERR_TRY_AGAIN and component
>>          restart timeout
>>          then AMF will restart component again ( this become cyclic )
>>          and after   saAmfSGCompRestartMax  configured value Node gose
>>          for reboot as next level escalation,
>>          in that case we may required changes in  AMF as well, to not
>>          to act on component restart timeout in case of Controllers
>>          absent ( i am not sure it is deviation of AMF specification ) .
>>
>>          */[Nhat Pham] In headless state, I'm not sure about this
>>          either. /*
>>
>>          */@Anders: Would you have comments about this?/*
>>
>>      [AndersW3] Ok, first of all I would like to point out that
>>      normally, the OpenSAF checkpoint node director should not crash.
>>      So we are talking about a situation where multiple faults have
>>      occurred: first both the active and the standby system controllers
>>      have died, and then shortly afterwards - before we have a new
>>      active system controller - the checkpoint node director also
>>      crashes. Sure, these may not be totally independent events, but
>>      still there are a lot of faults that have happened within a short
>>      period of time. We should test the node director and make sure it
>>      doesn't crash in this type of scenario.
>>
>>      Now, let's consider the case where we have a fault in the node
>>      director that causes it to crash during the headless state. The
>>      general philosophy of the headless feature is that when things
>>      work fine - i.e. in the absence of fault - we should be able to
>>      continue running while the system controllers are absent. However,
>>      if a fault happens during the headless state, we may not be able
>>      to recover from the fault until there is an active system
>>      controller. AMF does provide support for restarting components,
>>      but as you have pointed out, the node director will be stuck in a
>>      TRY_AGAIN loop immediately after it has been restarted. So this
>>      means that if the node director crashes during the headless state,
>>      we have lost the checkpoint functionality on that node and we will
>>      not get it back until there is an active system controller. Other
>>      services like IMM will still work for a while, but AMF will as you
>>      say eventually escalate the checkpoint node director failure to a
>>      node restart and then the whole node is gone. The node will not
>>      come back until we have an active system controller. So to
>>      summarize: there is very limited support for recovering from
>>      faults that happen during the headless state. The full recovery
>>      will not happen until we have an active system controller.
>>
>>          Please do incorporate current comments ( in design prospective
>>          )  and republish the patch , I will re-test V3 patch and
>>          provide review comments on function issue/bugs if I found any.
>>
>>          One Important note  , in the new patch  let us not have any
>>          complexity of  allowing   non-collocated checkpoint creation
>>          and then documenting that  in some scenario ,
>>          non-collocated checkpoint  replicas are recoverable  , why
>>          because replica is  USER private data ( not Opensaf States )
>>          ,  loosing USER private data  not acceptable .
>>          so let us keep the scope of CPSV service as non-collocated
>>          checkpoint creation NOT_SUPPORTED , if cluster is running with
>>           IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration
>>          enabled at the time of cluster startup currently it is not
>>          configurable , so their no chance of  run-time configuration
>>          change ).
>>
>>          We can provide support for non-collocated in subsequent
>>          enhancements by having  solution like replica on lower node ID
>>          PL will also created
>>          non-collocated  ( max three riplicas in cluster regradless of
>>          where non-collocated is opened ).
>>
>>          So for now, regardless of the heads (SC`s) status exist not
>>          exist  CPSV should return SA_AIS_ERR_NOT_SUPPORTED in case of
>>          IMMSV_SC_ABSENCE_ALLOWED enabled cluster ,
>>          and let us document it as well.
>>
>>          */[Nhat Pham] The patch is to limit loosing replicas and
>>          checkpoints in case of headless state./*
>>
>>          */In case both replicas locate on SCs and they reboot, loosing
>>          checkpoint is unpreventable with current design after headless
>>          state./*
>>
>>          */Even if we implement the proposal "/*max three riplicas in
>>          cluster regradless of where non-collocated is opened*/", there
>>          is still the case where the checkpoint is lost. Ex. The SCs
>>          and the PL which hosts the replica reboot same time./*
>>
>>          */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if both SCs
>>          reboot, this leads whole cluster reboots. Then the checkpoint
>>          is lost.
>>
>>          */What I mean is there are cases where the checkpoint is lost.
>>          The point is what we can do to limit loosing data./*
>>
>>          */For the proposal of reject creating non-collocated
>>          checkpoint in case of/* IMMSV_SC_ABSENCE_ALLOWED enabled, I
>>          think this will lead to in compatible problem.
>>
>>          */@Anders: How do you think about rejecting creating
>>          non-collocated checkpoint in case of
>>          /*IMMSV_SC_ABSENCE_ALLOWED enabled?
>>
>>      [AndersW3] No, I think we ought to support non-colocated
>>      checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The fact
>>      that we have "system controllers" is an implementation detail of
>>      OpenSAF. I don't think the CKPT SAF specification implies that
>>      non-colocated checkpoints must be fully replicated on all the
>>      nodes in the cluster, and thus we must have the possibility that
>>      all replicas are lost. It is not clear exactly what to expect from
>>      the APIs when this happens, but you could handle it in a similar
>>      way as the case when all sections have been automatically deleted
>>      by the checkpoint service because the sections have expired.
>>
>>
>>          -AVM
>>
>>          On 2/24/2016 6:51 AM, Nhat Pham wrote:
>>
>>              Hi Mahesh,
>>
>>              Do you have any further comments?
>>
>>              Best regards,
>>
>>              Nhat Pham
>>
>>              *From:* A V Mahesh [mailto:[email protected]]
>>              *Sent:* Monday, February 22, 2016 10:37 AM
>>              *To:* Nhat Pham <[email protected]>
>>              <mailto:[email protected]>; 'Anders Widell'
>>              <[email protected]>
>>              <mailto:[email protected]>
>>              *Cc:* [email protected]
>>              <mailto:[email protected]>; 'Beatriz
>>              Brandao' <[email protected]>
>>              <mailto:[email protected]>; 'Minh Chau H'
>>              <[email protected]> <mailto:[email protected]>
>>              *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
>>              Support preserving and recovering checkpoint replicas
>>              during headless state V2 [#1621]
>>
>>              Hi,
>>
>>              >>BTW, have you finished the review and test?
>>
>>              I will finish by today.
>>
>>              -AVM
>>
>>              On 2/22/2016 7:48 AM, Nhat Pham wrote:
>>
>>                  Hi Mahesh and Anders,
>>
>>                  Please see my comment below.
>>
>>                  BTW, have you finished the review and test?
>>
>>                  Best regards,
>>
>>                  Nhat Pham
>>
>>                  *From:* A V Mahesh [mailto:[email protected]]
>>                  *Sent:* Friday, February 19, 2016 2:28 PM
>>                  *To:* Nhat Pham <[email protected]>
>>                  <mailto:[email protected]>; 'Anders Widell'
>>                  <[email protected]>
>>                  <mailto:[email protected]>; 'Minh Chau H'
>>                  <[email protected]>
>>                  <mailto:[email protected]>
>>                  *Cc:* [email protected]
>>                  <mailto:[email protected]>; 'Beatriz
>>                  Brandao' <[email protected]>
>>                  <mailto:[email protected]>
>>                  *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
>>                  Support preserving and recovering checkpoint replicas
>>                  during headless state V2 [#1621]
>>
>>                  Hi Nhat Pham,
>>
>>                  On 2/19/2016 12:28 PM, Nhat Pham wrote:
>>
>>                      Could you please give more detailed information
>>                      about steps to reproduce the problem below? Thanks.
>>
>>
>>                  Don't see this as specific bug  , we need to see the
>>                  issue as  CLM integrated service point  of view ,
>>                  by considering Anders Widell  explication about CLM
>>                  application behavior during headless state
>>                  we need to reintegrate CPND with CLM ( before this
>>                  headless state feature  no case of CPND existence in
>>                  the obscene of CLMD  , but now it is ).
>>
>>                  And this will be the consistent across the all
>>                  services who integrated with CLM  ( you may need some
>>                  changes in CLM also )
>>
>>                  */[Nhat Pham] I think CLM should return
>>                  /*SA_AIS_ERR_TRY_AGAIN in this case.
>>
>>                  @Anders. How would you think?
>>
>>                  To start with let us consider case CPND  on payload
>>                  restarted on PL  during headless state
>>                  and an application is in running on PL.
>>
>>                  */[Nhat Pham] Regarding the CPND as CLM application,
>>                  I'm not sure what it can do in this case. In case it
>>                  restarts, it is monitored by AMF./*
>>
>>                  */If it blocks for too long, AMF will also trigger a
>>                  node reboot./*
>>
>>                  */In my test case, the CPND get blocked by CLM. It
>>                  doesn't get out of the saClmInitialize. How do you get
>>                  the "/ER cpnd clm init failed with return value:31/"?/*
>>
>>                  */Following is the cpnd trace./*
>>
>>                  Feb 22  8:56:41.188122 osafckptnd
>>                  [736:cpnd_init.c:0183] >> cpnd_lib_init
>>
>>                  Feb 22  8:56:41.188332 osafckptnd
>>                  [736:cpnd_init.c:0412] >> cpnd_cb_db_init
>>
>>                  Feb 22  8:56:41.188600 osafckptnd
>>                  [736:cpnd_init.c:0437] << cpnd_cb_db_init
>>
>>                  Feb 22  8:56:41.188778 osafckptnd
>>                  [736:clma_api.c:0503] >> saClmInitialize
>>
>>                  Feb 22  8:56:41.188945 osafckptnd
>>                  [736:clma_api.c:0593] >> clmainitialize
>>
>>                  Feb 22  8:56:41.190052 osafckptnd
>>                  [736:clma_util.c:0100] >> clma_startup: clma_use_count:
0
>>
>>                  Feb 22  8:56:41.190273 osafckptnd
>>                  [736:clma_mds.c:1124] >> clma_mds_init
>>
>>                  Feb 22  8:56:41.190825 osafckptnd
>>                  [736:clma_mds.c:1170] << clma_mds_init
>>
>>                  -AVM
>>
>>                  On 2/19/2016 12:28 PM, Nhat Pham wrote:
>>
>>                      Hi Mahesh,
>>
>>                      Could you please give more detailed information
>>                      about steps to reproduce the problem below? Thanks.
>>
>>                      Best regards,
>>
>>                      Nhat Pham
>>
>>                      *From:* A V Mahesh [mailto:[email protected]]
>>                      *Sent:* Friday, February 19, 2016 1:06 PM
>>                      *To:* Anders Widell <[email protected]>
>>                      <mailto:[email protected]>; Nhat Pham
>>                      <[email protected]>
>>                      <mailto:[email protected]>; 'Minh Chau H'
>>                      <[email protected]>
>>                      <mailto:[email protected]>
>>                      *Cc:* [email protected]
>>                      <mailto:[email protected]>;
>>                      'Beatriz Brandao' <[email protected]>
>>                      <mailto:[email protected]>
>>                      *Subject:* Re: [PATCH 0 of 1] Review Request for
>>                      cpsv: Support preserving and recovering checkpoint
>>                      replicas during headless state V2 [#1621]
>>
>>                      Hi Anders Widell,
>>                      Thanks for the detailed explanation  about CLM
>>                      during headless state.
>>
>>                      HI  Nhat Pham ,
>>
>>                      Comment : 3
>>                      Please see below  the problem I was interpreted
>>                      now I  seeing it  during CLMD obscene ( during
>>                      headless state ),
>>                      so now CPND/CLMA need to  to address below case ,
>>                      currently cpnd clm init failed with return
>>                      value:   SA_AIS_ERR_UNAVAILABLE
>>                      but should be SA_AIS_ERR_TRY_AGAIN
>>
>>                      ==================================================
>>                      Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE
>>                      STATE-> IMM_NODE_FULLY_AVAILABLE 17418
>>                      Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync ending
>>                      normally
>>                      Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch set
>>                      to 9 in ImmModel
>>                      Feb 19 11:18:28 PL-4 cpsv_app: IN Received
>>                      PROC_STALE_CLIENTS
>>                      Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>                      Implementer connected: 42 (MsgQueueService132111)
>>                      <108, 2040f>
>>                      Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>                      Implementer connected: 43 (MsgQueueService131855)
>>                      <0, 2030f>
>>                      Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>                      Implementer connected: 44 (safLogService) <0, 2010f>
>>                      Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER
>>                      STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
>>                      Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>                      Implementer connected: 45 (safClmService) <0, 2010f>
>>                      *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd
>>                      clm init failed with return value:31
>>                      Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd
>>                      init failed
>>                      Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER
>>                      cpnd_lib_req FAILED
>>                      Feb 19 11:18:28 PL-4 osafckptnd[7718]:
>>                      __init_cpnd() failed*
>>                      Feb 19 11:18:28 PL-4 osafclmna[5432]: NO
>>                      safNode=PL-4,safCluster=myClmCluster Joined
>>                      cluster, nodeid=2040f
>>                      Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD
>>                      NEW_ACTIVE, adest:1
>>                      Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO Sending
>>                      node up due to NCSMDS_NEW_ACTIVE
>>                      Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SISU
>>                      states sent
>>                      Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU
>>                      states sent
>>                      Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 CSICOMP
>>                      states synced
>>                      Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU
>>                      states sent
>>                      Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
>>                      Implementer connected: 46 (safAmfService) <0, 2010f>
>>                      Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO
>>                      'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Component
>>                      or SU restart probation timer expired
>>                      Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
>>                      Instantiation of
>>                      'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>>                      failed
>>                      Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Reason:
>>                      component registration timer expired
>>                      Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA
>>                      'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>>                      Presence State RESTARTING => INSTANTIATION_FAILED
>>                      Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Component
>>                      Failover trigerred for
>>                      'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': Failed
>>                      component:
>>                      'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
>>                      Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER
>>
'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got
>>                      Inst failed
>>                      Feb 19 11:18:35 PL-4 osafamfnd[5441]: Rebooting
>>                      OpenSAF NodeId = 132111 EE Name = , Reason: NCS
>>                      component Instantiation failed, OwnNodeId =
>>                      132111, SupervisionTime = 60
>>                      Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting
>>                      local node; timeout=60
>>                      Feb 19 11:18:39 PL-4 kernel: [ 4877.338518] md:
>>                      stopping all md devices.
>>                      ==================================================
>>
>>                      -AVM
>>
>>                      On 2/15/2016 5:11 PM, Anders Widell wrote:
>>
>>                          Hi!
>>
>>                          Please find my answer inline, marked [AndersW].
>>
>>                          regards,
>>                          Anders Widell
>>
>>                          On 02/15/2016 10:38 AM, Nhat Pham wrote:
>>
>>                              Hi Mahesh,
>>
>>                              It's good. Thank you. :)
>>
>>                              [AVM]  Up on rejoining of the SC`s The
>>                              replica should be re-created regardless
>>                              of another application opens it on PL4.
>>                                             ( Note : this comment is
>>                              based on your explanation have not yet
>>                              reviewed/tested  ,
>>                                                currently i am
>>                              struggling with  SC`s    not rejoining
>>                              after headless state , i can provide you
>>                              more on this once i  complte my
>>                              review/testing)
>>
>>                              [Nhat] To make cloud resilience works, you
>>                              need the patches from other
>>                              services (log, amf, clm, ntf).
>>                              @Minh: I heard that you created tar file
>>                              which includes all patches. Could you
>>                              please send it to Mahesh? Thanks
>>
>>                              [AVM] I understand that , before I comment
>>                              more on this   please allow me to
>>                              understand
>>                                            I am not still not very
>>                              clear of the headless design in detail.
>>                                            For example cluster
>>                              membership of PL`s   during headless state ,
>>                                             In the absence of SC`s
>>                              (CLMD) dose the PLs is considered as
>>                              cluster nodes or not (cluster membership) ?
>>
>>                              [Nhat] I don't know much about this.
>>                              @ Anders: Could you please have comment
>>                              about this? Thanks
>>
>>                          [AndersW] First of all, keep in mind that the
>>                          "headless" state should ideally not last a
>>                          very long time. Once we have the spare SC
>>                          feature in place (ticket [#79]), a new SC
>>                          should become active within a matter of a few
>>                          seconds after we have lost both the active and
>>                          the standby SC.
>>
>>                          I think you should view the state of the
>>                          cluster in the headless state in the same way
>>                          as you view the state of the cluster during a
>>                          failover between the active and the standby
>>                          SC. Imagine that the active SC dies. It takes
>>                          the standby SC 1.5 seconds to detect the
>>                          failure of the active SC (this is due to the
>>                          TIPC timeout). If you have configured the
>>                          PROMOTE_ACTIVE_TIMER, there is an additional
>>                          delay before the standby takes over as active.
>>                          What is the state of the cluster during the
>>                          time after the active SC failed and before the
>>                          standby takes over?
>>
>>                          The state of the cluster while it is headless
>>                          is very similar. The difference is that this
>>                          state may last a little bit longer (though not
>>                          more than a few seconds, until one of the
>>                          spare SCs becomes active). Another difference
>>                          is that we may have lost some state. With a
>>                          "perfect" implementation of the headless
>>                          feature we should not lose any state at all,
>>                          but with the current set of patches we do lose
>>                          state.
>>
>>                          So specifically if we talk about cluster
>>                          membership and ask the question: is a
>>                          particular PL a member of the cluster or not
>>                          during the headless state? Well, if you ask
>>                          CLM about this during the headless state, then
>>                          you will not know - because CLM doesn't
>>                          provide any service during the headless state.
>>                          If you keep retrying you query to CLM, you
>>                          will eventually get an answer - but you will
>>                          not get this answer until there is an active
>>                          SC again and we have exited the headless
>>                          state. When viewed in this way, the answer to
>>                          the question about a node's membership is
>>                          undefined during the headless state, since CLM
>>                          will not provide you with any answer until
>>                          there is an active SC.
>>
>>                          However, if you asked CLM about the node's
>>                          cluster membership status before the cluster
>>                          went headless, you probably saved a cached
>>                          copy of the cluster membership state. Maybe
>>                          you also installed a CLM track callback and
>>                          intend to update this cached copy every time
>>                          the cluster membership status changes. The
>>                          question then is: can you continue using this
>>                          cached copy of the cluster membership state
>>                          during the headless state? The answer is YES:
>>                          since CLM doesn't provide any service during
>>                          the headless state, it also means that the
>>                          cluster membership view cannot change during
>>                          this time. Nodes can of course reboot or die,
>>                          but CLM will not notice and hence the cluster
>>                          view will not be updated. You can argue that
>>                          this is bad because the cluster view doesn't
>>                          reflect reality, but notice that this will
>>                          always be the case. We can never propagate
>>                          information instantaneously, and detection of
>>                          node failures will take 1.5 seconds due to the
>>                          TIPC timeout. You can never be sure that a
>>                          node is alive at this very moment just because
>>                          CLM tells you that it is a member of the
>>                          cluster. If we are unfortunate enough to lose
>>                          both system controller nodes simultaneously,
>>                          updates to the cluster membership view will be
>>                          delayed a few seconds longer than usual.
>>
>>
>>                              Best regards,
>>                              Nhat Pham
>>
>>                              -----Original Message-----
>>                              From: A V Mahesh
>>                              [mailto:[email protected]]
>>                              Sent: Monday, February 15, 2016 11:19 AM
>>                              To: Nhat Pham <[email protected]>
>>                              <mailto:[email protected]>;
>>                              [email protected]
>>                              <mailto:[email protected]>
>>                              Cc: [email protected]
>>
<mailto:[email protected]>;
>>                              'Beatriz Brandao'
>>                              <[email protected]>
>>                              <mailto:[email protected]>
>>                              Subject: Re: [PATCH 0 of 1] Review Request
>>                              for cpsv: Support preserving and
>>                              recovering checkpoint replicas during
>>                              headless state V2 [#1621]
>>
>>                              Hi Nhat Pham,
>>
>>                              How is your holiday went
>>
>>                              Please find my comments below
>>
>>                              On 2/15/2016 8:43 AM, Nhat Pham wrote:
>>
>>                                  Hi Mahesh,
>>
>>                                  For the comment 1, the patch will be
>>                                  updated accordingly.
>>
>>                              [AVM]  Please hold , I will provide more
>>                              comments in this week , so we can
>>                              have consolidated V3
>>
>>                                  For the comment 2, I think the CKPT
>>                                  service will not be backward
>>                                  compatible if the scAbsenceAllowed is
>>                                  true.
>>                                  The client can't create non-collocated
>>                                  checkpoint on SCs.
>>
>>                                  Furthermore, this solution only
>>                                  protects the CKPT service from the
>>                                  case "The non-collocated checkpoint is
>>                                  created on a SC"
>>                                  there are still the cases where the
>>                                  replicas are completely lost. Ex:
>>
>>                                  - The non-collocated checkpoint
>>                                  created on a PL. The PL reboots. Both
>>                                  replicas now locate on SCs. Then,
>>                                  headless state happens. All replicas are
>>                                  lost.
>>                                  - The non-collocated checkpoint has
>>                                  active replica locating on a PL
>>                                  and this PL restarts during headless
>>                                  state
>>                                  - The non-collocated checkpoint is
>>                                  created on PL3. This checkpoint is
>>                                  also opened on PL4. Then SCs and PL3
>>                                  reboot.
>>
>>                              [AVM]  Up on rejoining of the SC`s The
>>                              replica should be re-created regardless
>>                              of another application opens it on PL4.
>>                                             ( Note : this comment is
>>                              based on your explanation have not yet
>>                              reviewed/tested  ,
>>                                                currently i am
>>                              struggling with  SC`s    not rejoining
>>                              after headless state , i can provide you
>>                              more on this once i  complte my
>>                              review/testing)
>>
>>                                  In this case, all replicas are lost
>>                                  and the client has to create it again.
>>
>>                                  In case multiple nodes (which
>>                                  including SCs) reboot, losing replicas
>>                                  is unpreventable. The patch is to
>>                                  recover the checkpoints in possible
>>                                  cases.
>>                                  How do you think?
>>
>>                              [AVM] I understand that , before I comment
>>                              more on this please allow
>>                              me to understand
>>                                            I am not still not very
>>                              clear of the headless design in detail.
>>
>>                                            For example cluster
>>                              membership of PL`s   during headless
>>                              state ,
>>                                             In the absence of SC`s
>>                              (CLMD) dose the PLs is considered as
>>                              cluster nodes or not (cluster membership) ?
>>
>>                                                   - if not consider as
>>                              NON cluster nodes Checkpoint Service
>>                              API  should  leverage the SA Forum Cluster
>>                                                     Membership Service
>>                              and API's can fail with
>>                              SA_AIS_ERR_UNAVAILABLE
>>
>>                                                   - if considers as
>>                              cluster nodes  we need to follow all the
>>                              defined rules which are defined in
>>                              SAI-AIS-CKPT-B.02.02 specification
>>
>>                                            so give me some more time to
>>                              review it completely , so that we
>>                              can  have consolidated patch V3
>>
>>                              -AVM
>>
>>                                  Best regards,
>>                                  Nhat Pham
>>
>>                                  -----Original Message-----
>>                                  From: A V Mahesh
>>                                  [mailto:[email protected]]
>>                                  Sent: Friday, February 12, 2016 11:10 AM
>>                                  To: Nhat Pham
>>                                  <[email protected]>
>>                                  <mailto:[email protected]>;
>>                                  [email protected]
>>                                  <mailto:[email protected]>
>>                                  Cc:
>>                                  [email protected]
>>
<mailto:[email protected]>;
>>                                  Beatriz Brandao
>>                                  <[email protected]>
>>                                  <mailto:[email protected]>
>>                                  Subject: Re: [PATCH 0 of 1] Review
>>                                  Request for cpsv: Support
>>                                  preserving and recovering checkpoint
>>                                  replicas during headless state V2
>>                                  [#1621]
>>
>>
>>                                  Comment 2 :
>>
>>                                  After incorporating the comment one
>>                                  all the Limitations should be
>>                                  prevented based on Hydra configuration
>>                                  is enabled in IMM status.
>>
>>                                  Foe example :  if some application is
>>                                  trying to create
>>
>>                                  non-collocated checkpoint active
>>                                  replica getting generated/locating on
>>                                  SC then ,regardless of the heads
>>                                  (SC`s) status exist not exist should
>>                                  return SA_AIS_ERR_NOT_SUPPORTED
>>
>>                                  In other words, rather that allowing
>>                                  to created non-collocated
>>                                  checkpoint when
>>                                  heads(SC`s)  are exit , and
>>                                  non-collocated checkpoint getting
>>                                  unrecoverable after heads(SC`s) rejoins.
>>
>>
======================================================================
>>
>>                                  =======================
>>
>>                                          Limitation: The CKPT service
>>                                      doesn't support recovering
>>                                      checkpoints in
>>                                          following cases:
>>                                          . The checkpoint which is
>>                                      unlinked before headless.
>>                                          . The non-collocated
>>                                      checkpoint has active replica
>>                                      locating on SC.
>>                                          . The non-collocated
>>                                      checkpoint has active replica
>>                                      locating on a PL
>>                                      and this PL
>>                                          restarts during headless
>>                                      state. In this cases, the
>>                                      checkpoint replica is
>>                                          destroyed. The fault code
>>                                      SA_AIS_ERR_BAD_HANDLE is returned
>>                                      when the
>>                                      client
>>                                          accesses the checkpoint in
>>                                      these cases. The client must
>>                                      re-open the
>>                                          checkpoint.
>>
>>
======================================================================
>>
>>                                  =======================
>>
>>                                  -AVM
>>
>>
>>                                  On 2/11/2016 12:52 PM, A V Mahesh wrote:
>>
>>                                      Hi,
>>
>>                                      I jut starred reviewing patch , I
>>                                      will be  giving comments as soon as
>>                                      I crossover any , to save some time.
>>
>>                                      Comment 1 :
>>                                      This functionality should be
>>                                      under  checks if Hydra
>>                                      configuration is
>>                                      enabled in IMM attrName =
>>
const_cast<SaImmAttrNameT>("scAbsenceAllowed")
>>
>>
>>                                      Please see example how  LOG/AMF
>>                                      services implemented it.
>>
>>                                      -AVM
>>
>>
>>                                      On 1/29/2016 1:02 PM, Nhat Pham
>>                                      wrote:
>>
>>                                          Hi Mahesh,
>>
>>                                          As described in the README,
>>                                          the CKPT service returns
>>                                          SA_AIS_ERR_TRY_AGAIN fault
>>                                          code in this case.
>>                                          I guess it's same for other
>>                                          services.
>>
>>                                          @Anders: Could you please
>>                                          confirm this?
>>
>>                                          Best regards,
>>                                          Nhat Pham
>>
>>                                          -----Original Message-----
>>                                          From: A V Mahesh
>>                                          [mailto:[email protected]]
>>                                          Sent: Friday, January 29, 2016
>>                                          2:11 PM
>>                                          To: Nhat Pham
>>                                          <[email protected]>
>>
<mailto:[email protected]>;
>>                                          [email protected]
>>
<mailto:[email protected]>
>>
>>                                          Cc:
>>
[email protected]
>>
<mailto:[email protected]>
>>
>>                                          Subject: Re: [PATCH 0 of 1]
>>                                          Review Request for cpsv: Support
>>                                          preserving and recovering
>>                                          checkpoint replicas during
>>                                          headless state
>>                                          V2 [#1621]
>>
>>                                          Hi,
>>
>>                                          On 1/29/2016 11:45 AM, Nhat
>>                                          Pham wrote:
>>
>>                                                    - The behavior of
>>                                              application will be
>>                                              consistent with other
>>                                              saf services like imm/amf
>>                                              behavior  during headless
>>                                              state.
>>                                              [Nhat] I'm not clear what
>>                                              you mean about "consistent"?
>>
>>                                          In the obscene of  Director
>>                                          (SC's) , what is expected
>>                                          return values
>>                                          of SAF API should ( all
>>                                          services ) ,
>>                                               which are not in
>>                                          aposition to  provide service
>>                                          at that moment.
>>
>>                                          I think all services should
>>                                          return same  SAF ERRS., I thinks
>>                                          currently we don't have  it ,
>>                                          may be  Anders Widel  will
>>                                          help us.
>>
>>                                          -AVM
>>
>>
>>                                          On 1/29/2016 11:45 AM, Nhat
>>                                          Pham wrote:
>>
>>                                              Hi Mahesh,
>>
>>                                              Please see the attachment
>>                                              for the README. Let me
>>                                              know if there is
>>                                              any more information
>>                                              required.
>>
>>                                              Regarding your comments:
>>                                                    -  during headless
>>                                              state  applications may
>>                                              behave like during
>>                                              CPND restart case [Nhat]
>>                                              Headless state and CPND
>>                                              restart are
>>                                              different events. Thus,
>>                                              the behavior is different.
>>                                              Headless state is a case
>>                                              where both SCs go down.
>>
>>                                                    -  The behavior of
>>                                              application will be
>>                                              consistent with other
>>                                              saf services like imm/amf
>>                                              behavior  during headless
>>                                              state.
>>                                              [Nhat] I'm not clear what
>>                                              you mean about "consistent"?
>>
>>                                              Best regards,
>>                                              Nhat Pham
>>
>>                                              -----Original Message-----
>>                                              From: A V Mahesh
>>
[mailto:[email protected]]
>>
>>                                              Sent: Friday, January 29,
>>                                              2016 11:12 AM
>>                                              To: Nhat Pham
>>                                              <[email protected]>
>>
<mailto:[email protected]>;
>>
>>                                              [email protected]
>>
<mailto:[email protected]>
>>
>>                                              Cc:
>>
[email protected]
>>
<mailto:[email protected]>
>>
>>                                              Subject: Re: [PATCH 0 of
>>                                              1] Review Request for
>>                                              cpsv: Support
>>                                              preserving and recovering
>>                                              checkpoint replicas during
>>                                              headless state
>>                                              V2 [#1621]
>>
>>                                              Hi Nhat Pham,
>>
>>                                              I stared reviewing this
>>                                              patch , so can please
>>                                              provide README file
>>                                              with scope and limitations
>>                                              , that will help to define
>>                                              testing/reviewing scope .
>>
>>                                              Following are minimum
>>                                              things we can keep in mind
>>                                              while
>>                                              reviewing/accepting patch ,
>>
>>                                              - Not effecting existing
>>                                              functionality
>>                                                    -  during headless
>>                                              state  applications may
>>                                              behave like during
>>                                              CPND restart case
>>                                                    -  The minimum
>>                                              functionally of
>>                                              application works
>>                                                    -  The behavior of
>>                                              application will be
>>                                              consistent with
>>                                                       other saf
>>                                              services like imm/amf
>>                                              behavior  during headless
>>                                              state.
>>
>>                                              So please do provide any
>>                                              additional detailed in
>>                                              README if any of
>>                                              the above is deviated ,
>>                                              that allow users to know
>>                                              about the
>>                                              limitations/deviation.
>>
>>                                              -AVM
>>
>>                                              On 1/4/2016 3:15 PM, Nhat
>>                                              Pham wrote:
>>
>>                                                  Summary: cpsv: Support
>>                                                  preserving and
>>                                                  recovering checkpoint
>>                                                  replicas during
>>                                                  headless state [#1621]
>>                                                  Review request for Trac
>>                                                  Ticket(s):
>>                                                  #1621 Peer
>>                                                  Reviewer(s):
>>                                                  [email protected]
<mailto:[email protected]>;
>>
>>
[email protected]
>>
<mailto:[email protected]>
>>                                                  Pull request to:
>>                                                  [email protected]
<mailto:[email protected]>
>>                                                  Affected branch(es):
>>                                                  default Development
>>                                                  branch: default
>>
>>
--------------------------------
>>
>>                                                  Impacted area
>>                                                  Impact y/n
>>
--------------------------------
>>
>>                                                  Docs
n
>>                                                        Build
>>                                                  system            n
>>                                                  RPM/packaging
n
>>                                                        Configuration
>>                                                  files     n
>>                                                        Startup
>>                                                  scripts         n
>>                                                        SAF
>>                                                  services            y
>>                                                        OpenSAF
>>                                                  services        n
>>                                                        Core
>>                                                  libraries          n
>>                                                  Samples
n
>>                                                  Tests
n
>>                                                  Other
n
>>
>>
>>                                                  Comments (indicate
>>                                                  scope for each "y"
>>                                                  above):
>>
---------------------------------------------
>>
>>
>>                                                  changeset
>>
faec4a4445a4c23e8f630857b19aabb43b5af18d
>>
>>                                                  Author:    Nhat Pham
>>
<[email protected]>
>>
<mailto:[email protected]>
>>
>>                                                  Date:    Mon, 04 Jan
>>                                                  2016 16:34:33 +0700
>>
>>                                                        cpsv: Support
>>                                                  preserving and
>>                                                  recovering checkpoint
>>                                                  replicas
>>                                                  during headless state
>>                                                  [#1621]
>>
>>                                                        Background:
>>                                                        ---------- This
>>                                                  enhancement supports
>>                                                  to preserve checkpoint
>>                                                  replicas
>>
>>                                              in case
>>
>>                                                  both SCs down
>>                                                  (headless state) and
>>                                                  recover replicas in case
>>                                                  one of
>>
>>                                              SCs up
>>
>>                                                  again. If both SCs
>>                                                  goes down, checkpoint
>>                                                  replicas on
>>                                                  surviving nodes
>>
>>                                              still
>>
>>                                                  remain. When a SC is
>>                                                  available again,
>>                                                  surviving replicas are
>>
>>                                              automatically
>>
>>                                                  registered to the SC
>>                                                  checkpoint database.
>>                                                  Content in
>>                                                  surviving
>>
>>                                              replicas are
>>
>>                                                  intacted and
>>                                                  synchronized to new
>>                                                  replicas.
>>
>>                                                        When no SC is
>>                                                  available, client API
>>                                                  calls changing
checkpoint
>>
>>                                              configuration
>>
>>                                                  which requires SC
>>                                                  communication, are
>>                                                  rejected. Client API
>>                                                  calls
>>
>>                                              reading and
>>
>>                                                  writing existing
>>                                                  checkpoint replicas
>>                                                  still work.
>>
>>                                                        Limitation: The
>>                                                  CKPT service does not
>>                                                  support recovering
>>                                                  checkpoints
>>
>>                                              in
>>
>>                                                  following cases:
>>                                                         - The
>>                                                  checkpoint which is
>>                                                  unlinked before
headless.
>>                                                         - The
>>                                                  non-collocated
>>                                                  checkpoint has active
>>                                                  replica locating
>>                                                  on SC.
>>                                                         - The
>>                                                  non-collocated
>>                                                  checkpoint has active
>>                                                  replica locating
>>                                                  on a PL
>>
>>                                              and this
>>
>>                                                        PL restarts
>>                                                  during headless state.
>>                                                  In this cases, the
>>                                                  checkpoint
>>
>>                                              replica is
>>
>>                                                  destroyed. The fault
>>                                                  code
>>                                                  SA_AIS_ERR_BAD_HANDLE
>>                                                  is returned
>>                                                  when the
>>
>>                                              client
>>
>>                                                  accesses the
>>                                                  checkpoint in these
>>                                                  cases. The client must
>>                                                  re-open the
>>                                                        checkpoint.
>>
>>                                                        While in
>>                                                  headless state,
>>                                                  accessing checkpoint
>>                                                  replicas does
>>                                                  not work
>>
>>                                              if the
>>
>>                                                  node which hosts the
>>                                                  active replica goes
>>                                                  down. It will back
>>                                                  working
>>
>>                                              when a
>>
>>                                                        SC available
again.
>>
>>                                                        Solution:
>>                                                        --------- The
>>                                                  solution for this
>>                                                  enhancement includes 2
>>                                                  parts:
>>
>>                                                        1. To destroy
>>                                                  un-recoverable
>>                                                  checkpoint described
>>                                                  above when
>>                                                  both
>>
>>                                              SCs are
>>
>>                                                  down: When both SCs
>>                                                  are down, the CPND
>>                                                  deletes un-recoverable
>>
>>                                              checkpoint
>>
>>                                                  nodes and replicas on
>>                                                  PLs. Then it requests
>>                                                  CPA to destroy
>>
>>                                              corresponding
>>
>>                                                  checkpoint node by
>>                                                  using new message
>>
CPA_EVT_ND2A_CKPT_DESTROY
>>
>>                                                        2. To update CPD
>>                                                  with checkpoint
>>                                                  information When an
>>                                                  active
>>                                                  SC is up
>>
>>                                              after
>>
>>                                                  headless, CPND will
>>                                                  update CPD with
>>                                                  checkpoint information
by
>>                                                  using
>>
>>                                              new
>>
>>                                                  message
>>
CPD_EVT_ND2D_CKPT_INFO_UPDATE
>>                                                  instead of using
>>
CPD_EVT_ND2D_CKPT_CREATE.
>>                                                  This is because the
>>                                                  CPND will
>>                                                  create new
>>
>>                                              ckpt_id
>>
>>                                                        for the
>>                                                  checkpoint which might
>>                                                  be different with the
>>                                                  current
>>                                                  ckpt id
>>
>>                                              if the
>>
>>                                                  CPD_EVT_ND2D_CKPT_CREATE
>>                                                  is used. The CPD
>>                                                  collects checkpoint
>>
>>                                              information
>>
>>                                                  within 6s. During this
>>                                                  updating time,
>>                                                  following requests is
>>                                                  rejected
>>
>>                                              with
>>
>>                                                  fault code
>>                                                  SA_AIS_ERR_TRY_AGAIN:
>>                                                        -
>>                                                  CPD_EVT_ND2D_CKPT_CREATE
>>                                                        -
>>                                                  CPD_EVT_ND2D_CKPT_UNLINK
>>                                                        -
>>                                                  CPD_EVT_ND2D_ACTIVE_SET
>>                                                        -
>>                                                  CPD_EVT_ND2D_CKPT_RDSET
>>
>>
>>                                                  Complete diffstat:
>>                                                  ------------------
>>
osaf/libs/agents/saf/cpa/cpa_proc.c
>>                                                  |   52
>>
>>
+++++++++++++++++++++++++++++++++++
>>
>>
>>
osaf/libs/common/cpsv/cpsv_edu.c
>>                                                  |   43
>>
>>
+++++++++++++++++++++++++++++
>>
>>
osaf/libs/common/cpsv/include/cpd_cb.h
>>                                                  |    3 ++
>>
osaf/libs/common/cpsv/include/cpd_imm.h
>>                                                  |    1 +
>>
osaf/libs/common/cpsv/include/cpd_proc.h
>>                                                  |    7 ++++
>>
osaf/libs/common/cpsv/include/cpd_tmr.h
>>                                                  |    3 +-
>>
osaf/libs/common/cpsv/include/cpnd_cb.h
>>                                                  |    1 +
>>
osaf/libs/common/cpsv/include/cpnd_init.h
>>                                                  |    2 +
>>
osaf/libs/common/cpsv/include/cpsv_evt.h
>>                                                  |   20 +++++++++++++
>>
osaf/services/saf/cpsv/cpd/Makefile.am
>>                                                  |    3 +-
>>
osaf/services/saf/cpsv/cpd/cpd_evt.c
>>                                                  |  229
>>
>>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>                                              ++++
>>
>>
osaf/services/saf/cpsv/cpd/cpd_imm.c
>>                                                  |  112
>>
>>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
osaf/services/saf/cpsv/cpd/cpd_init.c
>>                                                  |   20 ++++++++++++-
>>
osaf/services/saf/cpsv/cpd/cpd_proc.c
>>                                                  |  309
>>
>>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
osaf/services/saf/cpsv/cpd/cpd_tmr.c
>>                                                  |    7 ++++
>>
osaf/services/saf/cpsv/cpnd/cpnd_db.c
>>                                                  |   16 ++++++++++
>>
osaf/services/saf/cpsv/cpnd/cpnd_evt.c
>>                                                  |   22 +++++++++++++++
>>
osaf/services/saf/cpsv/cpnd/cpnd_init.c
>>                                                  |   23 ++++++++++++++-
>>
osaf/services/saf/cpsv/cpnd/cpnd_mds.c
>>                                                  |   13 ++++++++
>>
osaf/services/saf/cpsv/cpnd/cpnd_proc.c
>>                                                  |  314
>>
>>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>
>>
>>                                                        20 files
>>                                                  changed, 1189
>>                                                  insertions(+), 11
>>                                                  deletions(-)
>>
>>
>>                                                  Testing Commands:
>>                                                  -----------------
>>                                                  -
>>
>>                                                  Testing, Expected
>>                                                  Results:
>>
--------------------------
>>
>>                                                  -
>>
>>
>>                                                  Conditions of
Submission:
>>
-------------------------
>>                                                        <<HOW MANY DAYS
>>                                                  BEFORE PUSHING,
>>                                                  CONSENSUS ETC>>
>>
>>
>>                                                  Arch      Built
>>                                                  Started    Linux distro
>>
-------------------------------------------
>>
>>                                                  mips        n          n
>>                                                  mips64      n          n
>>                                                  x86         n          n
>>                                                  x86_64      n          n
>>                                                  powerpc     n          n
>>                                                  powerpc64   n          n
>>
>>
>>                                                  Reviewer Checklist:
>>                                                  -------------------
>>                                                  [Submitters: make sure
>>                                                  that your review
>>                                                  doesn't trigger any
>>                                                  checkmarks!]
>>
>>
>>                                                  Your checkin has not
>>                                                  passed review because
>>                                                  (see checked entries):
>>
>>                                                  ___ Your RR template
>>                                                  is generally
>>                                                  incomplete; it has too
>>                                                  many
>>                                                  blank
>>
>>                                              entries
>>
>>                                                  that need proper data
>>                                                  filled in.
>>
>>                                                  ___ You have failed to
>>                                                  nominate the proper
>>                                                  persons for review and
>>                                                  push.
>>
>>                                                  ___ Your patches do
>>                                                  not have proper
>>                                                  short+long header
>>
>>                                                  ___ You have
>>                                                  grammar/spelling in
>>                                                  your header that is
>>                                                  unacceptable.
>>
>>                                                  ___ You have exceeded
>>                                                  a sensible line length
>>                                                  in your
>>
>>                                              headers/comments/text.
>>
>>                                                  ___ You have failed to
>>                                                  put in a proper Trac
>>                                                  Ticket # into your
>>                                                  commits.
>>
>>                                                  ___ You have
>>                                                  incorrectly put/left
>>                                                  internal data in your
>>                                                  comments/files
>>                                                           (i.e.
>>                                                  internal bug tracking
>>                                                  tool IDs, product
>>                                                  names etc)
>>
>>                                                  ___ You have not given
>>                                                  any evidence of
>>                                                  testing beyond basic
>>                                                  build
>>                                                  tests.
>>                                                           Demonstrate
>>                                                  some level of runtime
>>                                                  or other sanity testing.
>>
>>                                                  ___ You have ^M
>>                                                  present in some of
>>                                                  your files. These have
>>                                                  to be
>>                                                  removed.
>>
>>                                                  ___ You have
>>                                                  needlessly changed
>>                                                  whitespace or added
>>                                                  whitespace crimes
>>                                                           like trailing
>>                                                  spaces, or spaces
>>                                                  before tabs.
>>
>>                                                  ___ You have mixed
>>                                                  real technical changes
>>                                                  with whitespace and
other
>>                                                           cosmetic code
>>                                                  cleanup changes. These
>>                                                  have to be separate
>>                                                  commits.
>>
>>                                                  ___ You need to
>>                                                  refactor your
>>                                                  submission into
>>                                                  logical chunks; there is
>>                                                           too much
>>                                                  content into a single
>>                                                  commit.
>>
>>                                                  ___ You have
>>                                                  extraneous garbage in
>>                                                  your review (merge
>>                                                  commits etc)
>>
>>                                                  ___ You have giant
>>                                                  attachments which
>>                                                  should never have been
>>                                                  sent;
>>                                                           Instead you
>>                                                  should place your
>>                                                  content in a public
>>                                                  tree to
>>                                                  be pulled.
>>
>>                                                  ___ You have too many
>>                                                  commits attached to an
>>                                                  e-mail; resend as
>>                                                  threaded
>>                                                           commits, or
>>                                                  place in a public tree
>>                                                  for a pull.
>>
>>                                                  ___ You have resent
>>                                                  this content multiple
>>                                                  times without a clear
>>                                                  indication
>>                                                           of what has
>>                                                  changed between each
>>                                                  re-send.
>>
>>                                                  ___ You have failed to
>>                                                  adequately and
>>                                                  individually address
>>                                                  all of the
>>                                                           comments and
>>                                                  change requests that
>>                                                  were proposed in the
>>                                                  initial
>>
>>                                              review.
>>
>>                                                  ___ You have a
>>                                                  misconfigured ~/.hgrc
>>                                                  file (i.e. username,
>>                                                  email
>>                                                  etc)
>>
>>                                                  ___ Your computer have
>>                                                  a badly configured
>>                                                  date and time;
>>                                                  confusing the
>>                                                           the threaded
>>                                                  patch review.
>>
>>                                                  ___ Your changes
>>                                                  affect IPC mechanism,
>>                                                  and you don't present
any
>>                                                  results
>>                                                           for
>>                                                  in-service
>>                                                  upgradability test.
>>
>>                                                  ___ Your changes
>>                                                  affect user manual and
>>                                                  documentation, your
patch
>>                                                  series
>>                                                           do not
>>                                                  contain the patch that
>>                                                  updates the Doxygen
>>                                                  manual.
>>
>
----------------------------------------------------------------------------
--
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Opensaf-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel



------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621]

Reply via email to