Hi Nhat Pham, >>[NhatPham4] To be more correct, the application will get >>SA_AIS_ERR_BAD_HANDLE when trying to access the lost checkpoint because all data was destroyed. >>[AndersW4] If this is a problem we could re-create the checkpoint with no sections in it.
Even I come across this approach , instead of destroying the
checkpoint information (current patch doing) from CPND of payloads and
returning the SA_AIS_ERR_BAD_HANDLE applcation on PL`s
in the NEW patch V3 check the possibility of re-cremating the
checkpoint with sections ( you can send this data from PL to SC up on
CPD up) .
-AVM
On 2/26/2016 8:11 AM, Nhat Pham wrote:
>
> Hi,
>
> Please see my comment below with [NhatPham4]
>
> Best regards,
>
> Nhat Pham
>
> *From:*Anders Widell [mailto:[email protected]]
> *Sent:* Thursday, February 25, 2016 9:25 PM
> *To:* A V Mahesh <[email protected]>; Nhat Pham
> <[email protected]>
> *Cc:* [email protected]; 'Beatriz Brandao'
> <[email protected]>; 'Minh Chau H' <[email protected]>
> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
> preserving and recovering checkpoint replicas during headless state V2
> [#1621]
>
> Hi!
>
> See my comments inline, marked [AndersW4].
>
> regards,
> Anders Widell
>
> On 02/25/2016 05:26 AM, A V Mahesh wrote:
>
> Hi Nhat Pham,
>
> Please see my comment below.
>
> -AVM
>
> On 2/25/2016 7:54 AM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> Would you agree with the comment below?
>
> To summarize, following are the comment so far:
>
> *Comment 1*: This functionality should be under checks if
> Hydra configuration is enabled in IMM attrName =
>
> const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>
> Action: The code will be updated accordingly.
>
> [AndersW4] Just a question here: is this really needed? If the code is
> already 100% backwards compatible when the headless feature is
> disabled, what would be the point of reading the configuration and
> taking different paths in the code depending on it? Maybe the code is
> not 100% backwards compatible and then I agree that we need to read
> the configuration.
>
> The reason why I am asking is that I had the impression that the code
> would only cause different behaviour in the cases where both system
> controllers die at the same time, and this cannot happen when the
> headless feature is disabled (or rather: it can happen, but it would
> trigger an immediate cluster restart so any difference in behaviour
> after that point is irrelevant).
>
> [NhatPham4] The code is backwards compatible when the headless feature
> is disable.
>
> For V2 patch, cpnd will update cpd with recoverable checkpoint data
> when SC is up after headless state.(From implementation point of view)
>
> In current system if the headless feature is disable, whole cluster
> reboots. Thus all data is destroyed.
>
>
> For V2 patch + checking scAbsenceAllowed, cpnd destroys all the
> checkpoint data (as original implementation). (From implementation
> point of view)
>
> In current system if the headless feature is disable, whole cluster
> reboots. Thus all data is destroyed.
>
> So if you ask if the checking is really needed in current situation.
> The answer is not really.
>
> This checking is just to make sure that all checkpoint data is
> destroyed in case headless feature is disable.
>
> How do you think?
>
> *Comment 2*: To keep the scope of CPSV service as
> non-collocated checkpoint creation NOT_SUPPORTED , if cluster
> is running with IMMSV_SC_ABSENCE_ALLOWED ( headless state
> configuration enabled at the time of cluster startup
> currently it is not configurable , so there no chance of
> run-time configuration change ).
>
> Action: No change in code. The CPSV still keep supporting
> non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is
> enable.
>
> >>[AndersW3] No, I think we ought to support non-colocated
> checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The fact
> that we have "system controllers" is an implementation detail of
> OpenSAF. I don't think the CKPT SAF specification implies that
> >>non-colocated checkpoints must be fully replicated on all the
> nodes in the cluster, and thus we must have the possibility that
> all replicas are lost. It is not clear exactly what to expect from
> the APIs when this happens, but you could handle it in a similar
> way as the case >> when all sections have been automatically
> deleted by the checkpoint service because the sections have expired.
>
> [AVM] I am not in agreement with both comments , we can not
> handle it in a similar to sections expiration case hear , in case
> of sections expiration checkpoint replica still exist only
> section deleted
>
> [AndersW4] If this is a problem we could re-create the checkpoint with
> no sections in it.
>
>
> CPSV specification says if two replicas exist ( in
> our case Only on SC`s) at a certain point in time, and the nodes
> hosting both of these replicas is
> administratively taken out of service, the Checkpoint
> Service should allocate another replica on another node while this
> node is not available
> please check section `3.1.7.2 Non-Collocated
> Checkpoints` of cpsv specification .
>
> [AndersW4] The spec actually says "may" rather than "should" in this
> section. And the purpose of allocating another replica is to "enhance
> the availability of checkpoints". When I read this section, I think it
> is quite clear that the spec does not perceive non-colocated
> checkpoints as guaranteed to preserve data in the case of node failures:
>
> "The Checkpoint Service may create replicas
> other than the ones that may be created when opening a checkpoint.
> These other
> replicas can be useful to enhance the availability of checkpoints. For
> example, if two
> replicas exist at a certain point in time, and the node hosting one of
> these replicas is
> administratively taken out of service, the Checkpoint Service may
> allocate another
> replica on another node while this node is not available."
>
> So, data can be lost due to (multiple) node failures. There are two
> other cases where data is lost: automatic deletion of the entire
> checkpoint if it has not been opened by any process for the duration
> of the retention time, and automatic deletion of sections within a
> checkpoint when the sections reach their expiration times. The APIs
> specify the return code SA_AIS_ERR_NOT_EXIST to signal that a specific
> section, or the entire checkpoint, doesn't exist. Thus, there support
> in the API for reporting loss of checkpoint data (whatever the reason
> of the loss may be). If the headless feature is disabled, we cannot
> lose non-colocated checkpoints due to node failures, but when the
> headless feature is enabled we can.
>
>
> For example, take a case of application on PL is in
> progress of writing to non-collocated checkpoint sections (
> physical replica exist only on SC`s )
> what will happen to application on PL ? , ok let us
> consider user agreed to loose the checkpoint and he what to
> recreated it , what will happen to cpnd DB on PL and the
> complexity involved in it (clean up) ,
> and this will lead to lot of maintainability issues.
>
> [AndersW4] The thing that will happen (from an application's
> perspective) is that you will get the SA_AIS_ERR_NOT_EXIST error code
> from the CKPT API when trying to access the lost checkpoint. I don't
> know the complexity at the code level for implementing this, but isn't
> this already supported by the code which is out on review (Nhat,
> correct me if I am wrong)?
>
> [NhatPham4] To be more correct, the application will get
> SA_AIS_ERR_BAD_HANDLE when trying to access the lost checkpoint
> because all data was destroyed.
>
> But for opening the checkpoint (not creating), it will get
> SA_AIS_ERR_NOT_EXIST.
>
>
> On top of that CKPT SAF specification only says that
> non-collocated checkpoint and all its sections should survive if
> the Checkpoint Service running on cluster and
> replica is USER private data ( not Opensaf States )
> , loosing any USER private data not acceptable .
>
>
>
> *Comment 3*: This is about case where checkpoint node director
> (cpnd) crashes during headless state. In this case the cpnd
> can’t finish starting because it can’t initialize CLM service.
>
> Then after time out, the AMF triggers a restart again.
> Finally, the node is rebooted.
>
> It is expected that this problem should not lead to a node reboot.
>
> Action: No change in code. This is the limitation of the
> system during headless state.
>
>
> [AVM] code changes required in CPSV CLM integration code need to
> be revisited to handle TRYAGAIN.
>
>
> If you agree with the summary above, I’ll update code and send
> out the V3 for review.
>
> Best regards,
>
> Nhat Pham
>
> *From:* Anders Widell [mailto:[email protected]]
> *Sent:* Wednesday, February 24, 2016 9:26 PM
> *To:* Nhat Pham <[email protected]>
> <mailto:[email protected]>; 'A V Mahesh'
> <[email protected]> <mailto:[email protected]>
> *Cc:* [email protected]
> <mailto:[email protected]>; 'Beatriz
> Brandao' <[email protected]>
> <mailto:[email protected]>; 'Minh Chau H'
> <[email protected]> <mailto:[email protected]>
> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support
> preserving and recovering checkpoint replicas during headless
> state V2 [#1621]
>
> See my comments inline, marked [AndersW3].
>
> regards,
> Anders Widell
>
> On 02/24/2016 07:32 AM, Nhat Pham wrote:
>
> Hi Mahesh and Anders,
>
> Please see my comments below.
>
> Best regards,
>
> Nhat Pham
>
> *From:* A V Mahesh [mailto:[email protected]]
> *Sent:* Wednesday, February 24, 2016 11:06 AM
> *To:* Nhat Pham <[email protected]>
> <mailto:[email protected]>; 'Anders Widell'
> <[email protected]>
> <mailto:[email protected]>
> *Cc:* [email protected]
> <mailto:[email protected]>; 'Beatriz
> Brandao' <[email protected]>
> <mailto:[email protected]>; 'Minh Chau H'
> <[email protected]> <mailto:[email protected]>
> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
> Support preserving and recovering checkpoint replicas
> during headless state V2 [#1621]
>
> Hi Nhat Pham,
>
> If component ( CPND ) restart allows while Controllers
> absent , before requesting CLM going to change return
> value to**SA_AIS_ERR_TRY_AGAIN ,
> We need to get clarification from AMF guys on few
> things why because if CPND is on SA_AIS_ERR_TRY_AGAIN
> and component restart timeout
> then AMF will restart component again ( this become cyclic
> ) and after saAmfSGCompRestartMax configured value Node
> gose for reboot as next level escalation,
> in that case we may required changes in AMF as well, to
> not to act on component restart timeout in case of
> Controllers absent ( i am not sure it is deviation of AMF
> specification ) .
>
> */[Nhat Pham] In headless state, I’m not sure about this
> either. /*
>
> */@Anders: Would you have comments about this?/*
>
> [AndersW3] Ok, first of all I would like to point out that
> normally, the OpenSAF checkpoint node director should not
> crash. So we are talking about a situation where multiple
> faults have occurred: first both the active and the standby
> system controllers have died, and then shortly afterwards -
> before we have a new active system controller - the checkpoint
> node director also crashes. Sure, these may not be totally
> independent events, but still there are a lot of faults that
> have happened within a short period of time. We should test
> the node director and make sure it doesn't crash in this type
> of scenario.
>
> Now, let's consider the case where we have a fault in the node
> director that causes it to crash during the headless state.
> The general philosophy of the headless feature is that when
> things work fine - i.e. in the absence of fault - we should be
> able to continue running while the system controllers are
> absent. However, if a fault happens during the headless state,
> we may not be able to recover from the fault until there is an
> active system controller. AMF does provide support for
> restarting components, but as you have pointed out, the node
> director will be stuck in a TRY_AGAIN loop immediately after
> it has been restarted. So this means that if the node director
> crashes during the headless state, we have lost the checkpoint
> functionality on that node and we will not get it back until
> there is an active system controller. Other services like IMM
> will still work for a while, but AMF will as you say
> eventually escalate the checkpoint node director failure to a
> node restart and then the whole node is gone. The node will
> not come back until we have an active system controller. So to
> summarize: there is very limited support for recovering from
> faults that happen during the headless state. The full
> recovery will not happen until we have an active system
> controller.
>
> Please do incorporate current comments ( in design
> prospective ) and republish the patch , I will re-test V3
> patch and provide review comments on function issue/bugs
> if I found any.
>
> One Important note , in the new patch let us not have
> any complexity of allowing non-collocated checkpoint
> creation and then documenting that in some scenario ,
> non-collocated checkpoint replicas are recoverable , why
> because replica is USER private data ( not Opensaf States
> ) , loosing USER private data not acceptable .
> so let us keep the scope of CPSV service as non-collocated
> checkpoint creation NOT_SUPPORTED , if cluster is running
> with
> IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration
> enabled at the time of cluster startup currently it is
> not configurable , so their no chance of run-time
> configuration change ).
>
> We can provide support for non-collocated in subsequent
> enhancements by having solution like replica on lower
> node ID PL will also created
> non-collocated ( max three riplicas in cluster regradless
> of where non-collocated is opened ).
>
> So for now, regardless of the heads (SC`s) status exist
> not exist CPSV should return SA_AIS_ERR_NOT_SUPPORTED in
> case of IMMSV_SC_ABSENCE_ALLOWED enabled cluster ,
> and let us document it as well.
>
> */[Nhat Pham] The patch is to limit loosing replicas and
> checkpoints in case of headless state./*
>
> */In case both replicas locate on SCs and they reboot,
> loosing checkpoint is unpreventable with current design
> after headless state./*
>
> */Even if we implement the proposal “/*max three riplicas
> in cluster regradless of where non-collocated is
> opened*/”, there is still the case where the checkpoint is
> lost. Ex. The SCs and the PL which hosts the replica
> reboot same time./*
>
> */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if both SCs
> reboot, this leads whole cluster reboots. Then the
> checkpoint is lost.
>
> */What I mean is there are cases where the checkpoint is
> lost. The point is what we can do to limit loosing data./*
>
> */For the proposal of reject creating non-collocated
> checkpoint in case of/* IMMSV_SC_ABSENCE_ALLOWED enabled,
> I think this will lead to in compatible problem.
>
> */@Anders: How do you think about rejecting creating
> non-collocated checkpoint in case of
> /*IMMSV_SC_ABSENCE_ALLOWED enabled?
>
> [AndersW3] No, I think we ought to support non-colocated
> checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The
> fact that we have "system controllers" is an implementation
> detail of OpenSAF. I don't think the CKPT SAF specification
> implies that non-colocated checkpoints must be fully
> replicated on all the nodes in the cluster, and thus we must
> have the possibility that all replicas are lost. It is not
> clear exactly what to expect from the APIs when this happens,
> but you could handle it in a similar way as the case when all
> sections have been automatically deleted by the checkpoint
> service because the sections have expired.
>
>
> -AVM
>
> On 2/24/2016 6:51 AM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> Do you have any further comments?
>
> Best regards,
>
> Nhat Pham
>
> *From:* A V Mahesh [mailto:[email protected]]
> *Sent:* Monday, February 22, 2016 10:37 AM
> *To:* Nhat Pham <[email protected]>
> <mailto:[email protected]>; 'Anders Widell'
> <[email protected]>
> <mailto:[email protected]>
> *Cc:* [email protected]
> <mailto:[email protected]>; 'Beatriz
> Brandao' <[email protected]>
> <mailto:[email protected]>; 'Minh Chau H'
> <[email protected]>
> <mailto:[email protected]>
> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv:
> Support preserving and recovering checkpoint replicas
> during headless state V2 [#1621]
>
> Hi,
>
> >>BTW, have you finished the review and test?
>
> I will finish by today.
>
> -AVM
>
> On 2/22/2016 7:48 AM, Nhat Pham wrote:
>
> Hi Mahesh and Anders,
>
> Please see my comment below.
>
> BTW, have you finished the review and test?
>
> Best regards,
>
> Nhat Pham
>
> *From:* A V Mahesh [mailto:[email protected]]
> *Sent:* Friday, February 19, 2016 2:28 PM
> *To:* Nhat Pham <[email protected]>
> <mailto:[email protected]>; 'Anders Widell'
> <[email protected]>
> <mailto:[email protected]>; 'Minh Chau H'
> <[email protected]>
> <mailto:[email protected]>
> *Cc:* [email protected]
> <mailto:[email protected]>;
> 'Beatriz Brandao' <[email protected]>
> <mailto:[email protected]>
> *Subject:* Re: [PATCH 0 of 1] Review Request for
> cpsv: Support preserving and recovering checkpoint
> replicas during headless state V2 [#1621]
>
> Hi Nhat Pham,
>
> On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
> Could you please give more detailed
> information about steps to reproduce the
> problem below? Thanks.
>
>
> Don't see this as specific bug , we need to see
> the issue as CLM integrated service point of view ,
> by considering Anders Widell explication about
> CLM application behavior during headless state
> we need to reintegrate CPND with CLM ( before
> this headless state feature no case of CPND
> existence in the obscene of CLMD , but now it is ).
>
> And this will be the consistent across the all
> services who integrated with CLM ( you may need
> some changes in CLM also )
>
> */[Nhat Pham] I think CLM should return
> /*SA_AIS_ERR_TRY_AGAIN in this case.
>
> @Anders. How would you think?
>
> To start with let us consider case CPND on
> payload restarted on PL during headless state
> and an application is in running on PL.
>
> */[Nhat Pham] Regarding the CPND as CLM
> application, I’m not sure what it can do in this
> case. In case it restarts, it is monitored by AMF./*
>
> */If it blocks for too long, AMF will also trigger
> a node reboot./*
>
> */In my test case, the CPND get blocked by CLM. It
> doesn’t get out of the saClmInitialize. How do you
> get the “/ER cpnd clm init failed with return
> value:31/”?/*
>
> */Following is the cpnd trace./*
>
> Feb 22 8:56:41.188122 osafckptnd
> [736:cpnd_init.c:0183] >> cpnd_lib_init
>
> Feb 22 8:56:41.188332 osafckptnd
> [736:cpnd_init.c:0412] >> cpnd_cb_db_init
>
> Feb 22 8:56:41.188600 osafckptnd
> [736:cpnd_init.c:0437] << cpnd_cb_db_init
>
> Feb 22 8:56:41.188778 osafckptnd
> [736:clma_api.c:0503] >> saClmInitialize
>
> Feb 22 8:56:41.188945 osafckptnd
> [736:clma_api.c:0593] >> clmainitialize
>
> Feb 22 8:56:41.190052 osafckptnd
> [736:clma_util.c:0100] >> clma_startup:
> clma_use_count: 0
>
> Feb 22 8:56:41.190273 osafckptnd
> [736:clma_mds.c:1124] >> clma_mds_init
>
> Feb 22 8:56:41.190825 osafckptnd
> [736:clma_mds.c:1170] << clma_mds_init
>
> -AVM
>
> On 2/19/2016 12:28 PM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> Could you please give more detailed
> information about steps to reproduce the
> problem below? Thanks.
>
> Best regards,
>
> Nhat Pham
>
> *From:* A V Mahesh
> [mailto:[email protected]]
> *Sent:* Friday, February 19, 2016 1:06 PM
> *To:* Anders Widell
> <[email protected]>
> <mailto:[email protected]>; Nhat Pham
> <[email protected]>
> <mailto:[email protected]>; 'Minh Chau
> H' <[email protected]>
> <mailto:[email protected]>
> *Cc:* [email protected]
> <mailto:[email protected]>;
> 'Beatriz Brandao'
> <[email protected]>
> <mailto:[email protected]>
> *Subject:* Re: [PATCH 0 of 1] Review Request
> for cpsv: Support preserving and recovering
> checkpoint replicas during headless state V2
> [#1621]
>
> Hi Anders Widell,
> Thanks for the detailed explanation about CLM
> during headless state.
>
> HI Nhat Pham ,
>
> Comment : 3
> Please see below the problem I was
> interpreted now I seeing it during CLMD
> obscene ( during headless state ),
> so now CPND/CLMA need to to address below
> case , currently cpnd clm init failed with
> return value: SA_AIS_ERR_UNAVAILABLE
> but should be SA_AIS_ERR_TRY_AGAIN
>
> ==================================================
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE
> STATE-> IMM_NODE_FULLY_AVAILABLE 17418
> Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync
> ending normally
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch
> set to 9 in ImmModel
> Feb 19 11:18:28 PL-4 cpsv_app: IN Received
> PROC_STALE_CLIENTS
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
> Implementer connected: 42
> (MsgQueueService132111) <108, 2040f>
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
> Implementer connected: 43
> (MsgQueueService131855) <0, 2030f>
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
> Implementer connected: 44 (safLogService) <0,
> 2010f>
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
> SERVER STATE: IMM_SERVER_SYNC_SERVER -->
> IMM_SERVER_READY
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
> Implementer connected: 45 (safClmService) <0,
> 2010f>
> *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER
> cpnd clm init failed with return value:31
> Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd
> init failed
> Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER
> cpnd_lib_req FAILED
> Feb 19 11:18:28 PL-4 osafckptnd[7718]:
> __init_cpnd() failed*
> Feb 19 11:18:28 PL-4 osafclmna[5432]: NO
> safNode=PL-4,safCluster=myClmCluster Joined
> cluster, nodeid=2040f
> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD
> NEW_ACTIVE, adest:1
> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO
> Sending node up due to NCSMDS_NEW_ACTIVE
> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1
> SISU states sent
> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU
> states sent
> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7
> CSICOMP states synced
> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU
> states sent
> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO
> Implementer connected: 46 (safAmfService) <0,
> 2010f>
> Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO
> 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
> Component or SU restart probation timer expired
> Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
> Instantiation of
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
> failed
> Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
> Reason: component registration timer expired
> Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
> Presence State RESTARTING => INSTANTIATION_FAILED
> Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO
> Component Failover trigerred for
> 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF':
> Failed component:
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'
> Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER
>
> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got
> Inst failed
> Feb 19 11:18:35 PL-4 osafamfnd[5441]:
> Rebooting OpenSAF NodeId = 132111 EE Name = ,
> Reason: NCS component Instantiation failed,
> OwnNodeId = 132111, SupervisionTime = 60
> Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting
> local node; timeout=60
> Feb 19 11:18:39 PL-4 kernel: [ 4877.338518]
> md: stopping all md devices.
> ==================================================
>
> -AVM
>
> On 2/15/2016 5:11 PM, Anders Widell wrote:
>
> Hi!
>
> Please find my answer inline, marked
> [AndersW].
>
> regards,
> Anders Widell
>
> On 02/15/2016 10:38 AM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> It's good. Thank you. :)
>
> [AVM] Up on rejoining of the SC`s The
> replica should be re-created regardless
> of another application opens it on PL4.
> ( Note : this comment
> is based on your explanation have not yet
> reviewed/tested ,
> currently i am
> struggling with SC`s not rejoining
> after headless state , i can provide
> you more on this once i complte my
> review/testing)
>
> [Nhat] To make cloud resilience works,
> you need the patches from other
> services (log, amf, clm, ntf).
> @Minh: I heard that you created tar
> file which includes all patches. Could
> you
> please send it to Mahesh? Thanks
>
> [AVM] I understand that , before I
> comment more on this please allow me to
> understand
> I am not still not very
> clear of the headless design in detail.
> For example cluster
> membership of PL`s during headless
> state ,
> In the absence of SC`s
> (CLMD) dose the PLs is considered as
> cluster nodes or not (cluster
> membership) ?
>
> [Nhat] I don't know much about this.
> @ Anders: Could you please have
> comment about this? Thanks
>
> [AndersW] First of all, keep in mind that
> the "headless" state should ideally not
> last a very long time. Once we have the
> spare SC feature in place (ticket [#79]),
> a new SC should become active within a
> matter of a few seconds after we have lost
> both the active and the standby SC.
>
> I think you should view the state of the
> cluster in the headless state in the same
> way as you view the state of the cluster
> during a failover between the active and
> the standby SC. Imagine that the active SC
> dies. It takes the standby SC 1.5 seconds
> to detect the failure of the active SC
> (this is due to the TIPC timeout). If you
> have configured the PROMOTE_ACTIVE_TIMER,
> there is an additional delay before the
> standby takes over as active. What is the
> state of the cluster during the time after
> the active SC failed and before the
> standby takes over?
>
> The state of the cluster while it is
> headless is very similar. The difference
> is that this state may last a little bit
> longer (though not more than a few
> seconds, until one of the spare SCs
> becomes active). Another difference is
> that we may have lost some state. With a
> "perfect" implementation of the headless
> feature we should not lose any state at
> all, but with the current set of patches
> we do lose state.
>
> So specifically if we talk about cluster
> membership and ask the question: is a
> particular PL a member of the cluster or
> not during the headless state? Well, if
> you ask CLM about this during the headless
> state, then you will not know - because
> CLM doesn't provide any service during the
> headless state. If you keep retrying you
> query to CLM, you will eventually get an
> answer - but you will not get this answer
> until there is an active SC again and we
> have exited the headless state. When
> viewed in this way, the answer to the
> question about a node's membership is
> undefined during the headless state, since
> CLM will not provide you with any answer
> until there is an active SC.
>
> However, if you asked CLM about the node's
> cluster membership status before the
> cluster went headless, you probably saved
> a cached copy of the cluster membership
> state. Maybe you also installed a CLM
> track callback and intend to update this
> cached copy every time the cluster
> membership status changes. The question
> then is: can you continue using this
> cached copy of the cluster membership
> state during the headless state? The
> answer is YES: since CLM doesn't provide
> any service during the headless state, it
> also means that the cluster membership
> view cannot change during this time. Nodes
> can of course reboot or die, but CLM will
> not notice and hence the cluster view will
> not be updated. You can argue that this is
> bad because the cluster view doesn't
> reflect reality, but notice that this will
> always be the case. We can never propagate
> information instantaneously, and detection
> of node failures will take 1.5 seconds due
> to the TIPC timeout. You can never be sure
> that a node is alive at this very moment
> just because CLM tells you that it is a
> member of the cluster. If we are
> unfortunate enough to lose both system
> controller nodes simultaneously, updates
> to the cluster membership view will be
> delayed a few seconds longer than usual.
>
>
> Best regards,
> Nhat Pham
>
> -----Original Message-----
> From: A V Mahesh
> [mailto:[email protected]]
> Sent: Monday, February 15, 2016 11:19 AM
> To: Nhat Pham
> <[email protected]>
> <mailto:[email protected]>;
> [email protected]
> <mailto:[email protected]>
> Cc:
> [email protected]
> <mailto:[email protected]>;
> 'Beatriz Brandao'
> <[email protected]>
> <mailto:[email protected]>
> Subject: Re: [PATCH 0 of 1] Review
> Request for cpsv: Support preserving and
> recovering checkpoint replicas during
> headless state V2 [#1621]
>
> Hi Nhat Pham,
>
> How is your holiday went
>
> Please find my comments below
>
> On 2/15/2016 8:43 AM, Nhat Pham wrote:
>
> Hi Mahesh,
>
> For the comment 1, the patch will
> be updated accordingly.
>
> [AVM] Please hold , I will provide
> more comments in this week , so we can
> have consolidated V3
>
> For the comment 2, I think the
> CKPT service will not be backward
> compatible if the scAbsenceAllowed
> is true.
> The client can't create
> non-collocated checkpoint on SCs.
>
> Furthermore, this solution only
> protects the CKPT service from the
> case "The non-collocated
> checkpoint is created on a SC"
> there are still the cases where
> the replicas are completely lost. Ex:
>
> - The non-collocated checkpoint
> created on a PL. The PL reboots. Both
> replicas now locate on SCs. Then,
> headless state happens. All
> replicas are
> lost.
> - The non-collocated checkpoint
> has active replica locating on a PL
> and this PL restarts during
> headless state
> - The non-collocated checkpoint is
> created on PL3. This checkpoint is
> also opened on PL4. Then SCs and
> PL3 reboot.
>
> [AVM] Up on rejoining of the SC`s The
> replica should be re-created regardless
> of another application opens it on PL4.
> ( Note : this comment
> is based on your explanation have not yet
> reviewed/tested ,
> currently i am
> struggling with SC`s not rejoining
> after headless state , i can provide
> you more on this once i complte my
> review/testing)
>
> In this case, all replicas are
> lost and the client has to create
> it again.
>
> In case multiple nodes (which
> including SCs) reboot, losing
> replicas
> is unpreventable. The patch is to
> recover the checkpoints in
> possible cases.
> How do you think?
>
> [AVM] I understand that , before I
> comment more on this please allow
> me to understand
> I am not still not very
> clear of the headless design in detail.
>
> For example cluster
> membership of PL`s during headless
> state ,
> In the absence of SC`s
> (CLMD) dose the PLs is considered as
> cluster nodes or not (cluster
> membership) ?
>
> - if not consider
> as NON cluster nodes Checkpoint Service
> API should leverage the SA Forum
> Cluster
> Membership
> Service and API's can fail with
> SA_AIS_ERR_UNAVAILABLE
>
> - if considers as
> cluster nodes we need to follow all the
> defined rules which are defined in
> SAI-AIS-CKPT-B.02.02 specification
>
> so give me some more
> time to review it completely , so that we
> can have consolidated patch V3
>
> -AVM
>
> Best regards,
> Nhat Pham
>
> -----Original Message-----
> From: A V Mahesh
> [mailto:[email protected]]
> Sent: Friday, February 12, 2016
> 11:10 AM
> To: Nhat Pham
> <[email protected]>
> <mailto:[email protected]>;
> [email protected]
> <mailto:[email protected]>
> Cc:
> [email protected]
> <mailto:[email protected]>;
> Beatriz Brandao
> <[email protected]>
> <mailto:[email protected]>
> Subject: Re: [PATCH 0 of 1] Review
> Request for cpsv: Support
> preserving and recovering
> checkpoint replicas during
> headless state V2
> [#1621]
>
>
> Comment 2 :
>
> After incorporating the comment
> one all the Limitations should be
> prevented based on Hydra
> configuration is enabled in IMM
> status.
>
> Foe example : if some application
> is trying to create
>
> non-collocated checkpoint active
> replica getting generated/locating on
> SC then ,regardless of the heads
> (SC`s) status exist not exist should
> return SA_AIS_ERR_NOT_SUPPORTED
>
> In other words, rather that
> allowing to created non-collocated
> checkpoint when
> heads(SC`s) are exit , and
> non-collocated checkpoint getting
> unrecoverable after heads(SC`s)
> rejoins.
>
>
> ======================================================================
>
> =======================
>
> Limitation: The CKPT
> service doesn't support
> recovering checkpoints in
> following cases:
> . The checkpoint which is
> unlinked before headless.
> . The non-collocated
> checkpoint has active replica
> locating on SC.
> . The non-collocated
> checkpoint has active replica
> locating on a PL
> and this PL
> restarts during headless
> state. In this cases, the
> checkpoint replica is
> destroyed. The fault code
> SA_AIS_ERR_BAD_HANDLE is
> returned when the
> client
> accesses the checkpoint in
> these cases. The client must
> re-open the
> checkpoint.
>
>
> ======================================================================
>
> =======================
>
> -AVM
>
>
> On 2/11/2016 12:52 PM, A V Mahesh
> wrote:
>
> Hi,
>
> I jut starred reviewing patch
> , I will be giving comments
> as soon as
> I crossover any , to save some
> time.
>
> Comment 1 :
> This functionality should be
> under checks if Hydra
> configuration is
> enabled in IMM attrName =
>
> const_cast<SaImmAttrNameT>("scAbsenceAllowed")
>
>
> Please see example how
> LOG/AMF services implemented it.
>
> -AVM
>
>
> On 1/29/2016 1:02 PM, Nhat
> Pham wrote:
>
> Hi Mahesh,
>
> As described in the
> README, the CKPT service
> returns
> SA_AIS_ERR_TRY_AGAIN fault
> code in this case.
> I guess it's same for
> other services.
>
> @Anders: Could you please
> confirm this?
>
> Best regards,
> Nhat Pham
>
> -----Original Message-----
> From: A V Mahesh
> [mailto:[email protected]]
>
> Sent: Friday, January 29,
> 2016 2:11 PM
> To: Nhat Pham
> <[email protected]>
> <mailto:[email protected]>;
> [email protected]
>
> <mailto:[email protected]>
>
> Cc:
>
> [email protected]
>
> <mailto:[email protected]>
>
> Subject: Re: [PATCH 0 of
> 1] Review Request for
> cpsv: Support
> preserving and recovering
> checkpoint replicas during
> headless state
> V2 [#1621]
>
> Hi,
>
> On 1/29/2016 11:45 AM,
> Nhat Pham wrote:
>
> - The behavior
> of application will be
> consistent with other
> saf services like
> imm/amf behavior
> during headless state.
> [Nhat] I'm not clear
> what you mean about
> "consistent"?
>
> In the obscene of
> Director (SC's) , what is
> expected return values
> of SAF API should ( all
> services ) ,
> which are not in
> aposition to provide
> service at that moment.
>
> I think all services
> should return same SAF
> ERRS., I thinks
> currently we don't have
> it , may be Anders Widel
> will help us.
>
> -AVM
>
>
> On 1/29/2016 11:45 AM,
> Nhat Pham wrote:
>
> Hi Mahesh,
>
> Please see the
> attachment for the
> README. Let me know if
> there is
> any more information
> required.
>
> Regarding your comments:
> - during
> headless state
> applications may
> behave like during
> CPND restart case
> [Nhat] Headless state
> and CPND restart are
> different events.
> Thus, the behavior is
> different.
> Headless state is a
> case where both SCs go
> down.
>
> - The behavior
> of application will be
> consistent with other
> saf services like
> imm/amf behavior
> during headless state.
> [Nhat] I'm not clear
> what you mean about
> "consistent"?
>
> Best regards,
> Nhat Pham
>
> -----Original
> Message-----
> From: A V Mahesh
>
> [mailto:[email protected]]
>
> Sent: Friday, January
> 29, 2016 11:12 AM
> To: Nhat Pham
> <[email protected]>
>
> <mailto:[email protected]>;
>
> [email protected]
>
> <mailto:[email protected]>
>
> Cc:
>
> [email protected]
>
> <mailto:[email protected]>
>
> Subject: Re: [PATCH 0
> of 1] Review Request
> for cpsv: Support
> preserving and
> recovering checkpoint
> replicas during
> headless state
> V2 [#1621]
>
> Hi Nhat Pham,
>
> I stared reviewing
> this patch , so can
> please provide README
> file
> with scope and
> limitations , that
> will help to define
> testing/reviewing scope .
>
> Following are minimum
> things we can keep in
> mind while
> reviewing/accepting
> patch ,
>
> - Not effecting
> existing functionality
> - during
> headless state
> applications may
> behave like during
> CPND restart case
> - The minimum
> functionally of
> application works
> - The behavior
> of application will be
> consistent with
> other saf
> services like imm/amf
> behavior during
> headless state.
>
> So please do provide
> any additional
> detailed in README if
> any of
> the above is deviated
> , that allow users to
> know about the
> limitations/deviation.
>
> -AVM
>
> On 1/4/2016 3:15 PM,
> Nhat Pham wrote:
>
> Summary: cpsv:
> Support preserving
> and recovering
> checkpoint
> replicas during
> headless state
> [#1621] Review
> request for Trac
> Ticket(s):
> #1621 Peer
> Reviewer(s):
> [email protected]
>
> <mailto:[email protected]>;
>
> [email protected]
>
> <mailto:[email protected]>
> Pull request to:
> [email protected]
>
> <mailto:[email protected]>
> Affected
> branch(es):
> default Development
> branch: default
>
>
> --------------------------------
>
> Impacted
> area Impact y/n
>
> --------------------------------
>
> Docs
> n
> Build
> system n
> RPM/packaging
> n
>
> Configuration
> files n
> Startup
> scripts n
> SAF
> services y
> OpenSAF
> services n
> Core
> libraries n
> Samples
> n
> Tests
> n
> Other
> n
>
>
> Comments (indicate
> scope for each "y"
> above):
>
> ---------------------------------------------
>
>
> changeset
>
> faec4a4445a4c23e8f630857b19aabb43b5af18d
>
> Author: Nhat
> Pham
> <[email protected]>
>
> <mailto:[email protected]>
>
> Date: Mon, 04
> Jan 2016 16:34:33
> +0700
>
> cpsv:
> Support preserving
> and recovering
> checkpoint replicas
> during headless
> state [#1621]
>
> Background:
> ----------
> This enhancement
> supports to
> preserve checkpoint
> replicas
>
> in case
>
> both SCs down
> (headless state)
> and recover
> replicas in case
> one of
>
> SCs up
>
> again. If both SCs
> goes down,
> checkpoint
> replicas on
> surviving nodes
>
> still
>
> remain. When a SC
> is available
> again, surviving
> replicas are
>
> automatically
>
> registered to the
> SC checkpoint
> database. Content in
> surviving
>
> replicas are
>
> intacted and
> synchronized to
> new replicas.
>
> When no SC
> is available,
> client API calls
> changing checkpoint
>
> configuration
>
> which requires SC
> communication, are
> rejected. Client API
> calls
>
> reading and
>
> writing existing
> checkpoint
> replicas still work.
>
> Limitation:
> The CKPT service
> does not support
> recovering
> checkpoints
>
> in
>
> following cases:
> - The
> checkpoint which
> is unlinked before
> headless.
> - The
> non-collocated
> checkpoint has
> active replica
> locating
> on SC.
> - The
> non-collocated
> checkpoint has
> active replica
> locating
> on a PL
>
> and this
>
> PL restarts during
> headless state. In
> this cases, the
> checkpoint
>
> replica is
>
> destroyed. The
> fault code
> SA_AIS_ERR_BAD_HANDLE
> is returned
> when the
>
> client
>
> accesses the
> checkpoint in
> these cases. The
> client must
> re-open the
> checkpoint.
>
> While in
> headless state,
> accessing
> checkpoint
> replicas does
> not work
>
> if the
>
> node which hosts
> the active replica
> goes down. It will
> back
> working
>
> when a
>
> SC available again.
>
> Solution:
> ---------
> The solution for
> this enhancement
> includes 2 parts:
>
> 1. To
> destroy
> un-recoverable
> checkpoint
> described above when
> both
>
> SCs are
>
> down: When both
> SCs are down, the
> CPND deletes
> un-recoverable
>
> checkpoint
>
> nodes and replicas
> on PLs. Then it
> requests CPA to
> destroy
>
> corresponding
>
> checkpoint node by
> using new message
> CPA_EVT_ND2A_CKPT_DESTROY
>
>
> 2. To update
> CPD with
> checkpoint
> information When
> an active
> SC is up
>
> after
>
> headless, CPND
> will update CPD
> with checkpoint
> information by
> using
>
> new
>
> message
>
> CPD_EVT_ND2D_CKPT_INFO_UPDATE
> instead of using
> CPD_EVT_ND2D_CKPT_CREATE.
> This is because
> the CPND will
> create new
>
> ckpt_id
>
> for the checkpoint
> which might be
> different with the
> current
> ckpt id
>
> if the
>
> CPD_EVT_ND2D_CKPT_CREATE
> is used. The CPD
> collects checkpoint
>
> information
>
> within 6s. During
> this updating
> time, following
> requests is
> rejected
>
> with
>
> fault code
> SA_AIS_ERR_TRY_AGAIN:
> -
> CPD_EVT_ND2D_CKPT_CREATE
>
> -
> CPD_EVT_ND2D_CKPT_UNLINK
>
> -
> CPD_EVT_ND2D_ACTIVE_SET
>
> -
> CPD_EVT_ND2D_CKPT_RDSET
>
>
>
> Complete diffstat:
> ------------------
>
> osaf/libs/agents/saf/cpa/cpa_proc.c
> | 52
>
>
> +++++++++++++++++++++++++++++++++++
>
>
>
> osaf/libs/common/cpsv/cpsv_edu.c
> | 43
>
> +++++++++++++++++++++++++++++
>
>
>
> osaf/libs/common/cpsv/include/cpd_cb.h
> | 3 ++
>
> osaf/libs/common/cpsv/include/cpd_imm.h
> | 1 +
>
> osaf/libs/common/cpsv/include/cpd_proc.h
> | 7 ++++
>
> osaf/libs/common/cpsv/include/cpd_tmr.h
> | 3 +-
>
> osaf/libs/common/cpsv/include/cpnd_cb.h
> | 1 +
>
> osaf/libs/common/cpsv/include/cpnd_init.h
> | 2 +
>
> osaf/libs/common/cpsv/include/cpsv_evt.h
> | 20 +++++++++++++
>
> osaf/services/saf/cpsv/cpd/Makefile.am
> | 3 +-
>
> osaf/services/saf/cpsv/cpd/cpd_evt.c
> | 229
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
> ++++
>
>
> osaf/services/saf/cpsv/cpd/cpd_imm.c
> | 112
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> osaf/services/saf/cpsv/cpd/cpd_init.c
> | 20 ++++++++++++-
>
> osaf/services/saf/cpsv/cpd/cpd_proc.c
> | 309
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> osaf/services/saf/cpsv/cpd/cpd_tmr.c
> | 7 ++++
>
> osaf/services/saf/cpsv/cpnd/cpnd_db.c
> | 16 ++++++++++
>
> osaf/services/saf/cpsv/cpnd/cpnd_evt.c
> | 22
> +++++++++++++++
>
> osaf/services/saf/cpsv/cpnd/cpnd_init.c
> | 23
> ++++++++++++++-
>
> osaf/services/saf/cpsv/cpnd/cpnd_mds.c
> | 13 ++++++++
>
> osaf/services/saf/cpsv/cpnd/cpnd_proc.c
> | 314
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>
>
> 20 files changed,
> 1189
> insertions(+), 11
> deletions(-)
>
>
> Testing Commands:
> -----------------
> -
>
> Testing, Expected
> Results:
> --------------------------
>
> -
>
>
> Conditions of
> Submission:
> -------------------------
>
> <<HOW MANY
> DAYS BEFORE
> PUSHING, CONSENSUS
> ETC>>
>
>
> Arch Built
> Started Linux
> distro
>
> -------------------------------------------
>
> mips
> n n
> mips64
> n n
> x86
> n n
> x86_64
> n n
> powerpc
> n n
> powerpc64
> n n
>
>
> Reviewer Checklist:
> -------------------
> [Submitters: make
> sure that your
> review doesn't
> trigger any
> checkmarks!]
>
>
> Your checkin has
> not passed review
> because (see
> checked entries):
>
> ___ Your RR
> template is
> generally
> incomplete; it has
> too many
> blank
>
> entries
>
> that need proper
> data filled in.
>
> ___ You have
> failed to nominate
> the proper persons
> for review and
> push.
>
> ___ Your patches
> do not have proper
> short+long header
>
> ___ You have
> grammar/spelling
> in your header
> that is unacceptable.
>
> ___ You have
> exceeded a
> sensible line
> length in your
>
> headers/comments/text.
>
> ___ You have
> failed to put in a
> proper Trac Ticket
> # into your
> commits.
>
> ___ You have
> incorrectly
> put/left internal
> data in your
> comments/files
> (i.e.
> internal bug
> tracking tool IDs,
> product names etc)
>
> ___ You have not
> given any evidence
> of testing beyond
> basic build
> tests.
>
> Demonstrate some
> level of runtime
> or other sanity
> testing.
>
> ___ You have ^M
> present in some of
> your files. These
> have to be
> removed.
>
> ___ You have
> needlessly changed
> whitespace or
> added whitespace
> crimes
> like
> trailing spaces,
> or spaces before
> tabs.
>
> ___ You have mixed
> real technical
> changes with
> whitespace and other
> cosmetic
> code cleanup
> changes. These
> have to be separate
> commits.
>
> ___ You need to
> refactor your
> submission into
> logical chunks;
> there is
> too much
> content into a
> single commit.
>
> ___ You have
> extraneous garbage
> in your review
> (merge commits etc)
>
> ___ You have giant
> attachments which
> should never have
> been sent;
> Instead
> you should place
> your content in a
> public tree to
> be pulled.
>
> ___ You have too
> many commits
> attached to an
> e-mail; resend as
> threaded
> commits,
> or place in a
> public tree for a
> pull.
>
> ___ You have
> resent this
> content multiple
> times without a clear
> indication
> of what
> has changed
> between each re-send.
>
> ___ You have
> failed to
> adequately and
> individually
> address all of the
> comments
> and change
> requests that were
> proposed in the
> initial
>
> review.
>
> ___ You have a
> misconfigured
> ~/.hgrc file (i.e.
> username, email
> etc)
>
> ___ Your computer
> have a badly
> configured date
> and time;
> confusing the
> the
> threaded patch
> review.
>
> ___ Your changes
> affect IPC
> mechanism, and you
> don't present any
> results
> for
> in-service
> upgradability test.
>
> ___ Your changes
> affect user manual
> and documentation,
> your patch
> series
> do not
> contain the patch
> that updates the
> Doxygen manual.
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel
