Hi Nhat Pham, Well In any case let's go forward as below: (a) For now, let's just document that 'sackptCheckpoint' APis will return ERR_NOT_EXIST in the headless state. In future we can look at ways to create more than two replicas
(b) for the cpnd restart scenario, w.r.t CPSV-CLM integration, handle the error code received. Please publish the v3 patch. -AVM On 2/25/2016 3:39 PM, Nhat Pham wrote: > > Hi Mahesh and Anders, > > Please see my comment below with [NhatPham3] > > Best regards, > > Nhat Pham > > *From:*A V Mahesh [mailto:[email protected]] > *Sent:* Thursday, February 25, 2016 2:14 PM > *To:* Nhat Pham <[email protected]>; 'Anders Widell' > <[email protected]> > *Cc:* [email protected]; 'Beatriz Brandao' > <[email protected]>; 'Minh Chau H' <[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless state V2 > [#1621] > > Hi Nhat Pham, > > Please see my comment. > > -AVM > > On 2/25/2016 12:07 PM, Nhat Pham wrote: > > Hi Mahesh, > > Please see my comment below with [NhatPham2]. > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Thursday, February 25, 2016 11:26 AM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless > state V2 [#1621] > > Hi Nhat Pham, > > Please see my comment below. > > -AVM > > On 2/25/2016 7:54 AM, Nhat Pham wrote: > > Hi Mahesh, > > Would you agree with the comment below? > > To summarize, following are the comment so far: > > *Comment 1*: This functionality should be under checks if > Hydra configuration is enabled in IMM attrName = > > const_cast<SaImmAttrNameT>("scAbsenceAllowed"). > > Action: The code will be updated accordingly. > > *Comment 2*: To keep the scope of CPSV service as > non-collocated checkpoint creation NOT_SUPPORTED , if cluster > is running with IMMSV_SC_ABSENCE_ALLOWED ( headless state > configuration enabled at the time of cluster startup currently > it is not configurable , so there no chance of run-time > configuration change ). > > Action: No change in code. The CPSV still keep supporting > non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is > enable. > > >>[AndersW3] No, I think we ought to support non-colocated > checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The fact > that we have "system controllers" is an implementation detail of > OpenSAF. I don't think the CKPT SAF specification implies that > >>non-colocated checkpoints must be fully replicated on all the > nodes in the cluster, and thus we must have the possibility that > all replicas are lost. It is not clear exactly what to expect from > the APIs when this happens, but you could handle it in a similar > way as the case >> when all sections have been automatically > deleted by the checkpoint service because the sections have expired. > > [AVM] I am not in agreement with both comments , we can not > handle it in a similar to sections expiration case hear , in case > of sections expiration checkpoint replica still exist only > section deleted > > CPSV specification says if two replicas exist ( in > our case Only on SC`s) at a certain point in time, and the nodes > hosting both of these replicas is > administratively taken out of service, the Checkpoint > Service should allocate another replica on another node while this > node is not available > please check section `3.1.7.2 Non-Collocated > Checkpoints` of cpsv specification . > > For example, take a case of application on PL is in > progress of writing to non-collocated checkpoint sections ( > physical replica exist only on SC`s ) > what will happen to application on PL ? , ok let us > consider user agreed to loose the checkpoint and he what to > recreated it , what will happen to cpnd DB on PL and the > complexity involved in it (clean up) , > and this will lead to lot of maintainability issues. > > On top of that CKPT SAF specification only says that > non-collocated checkpoint and all its sections should survive if > the Checkpoint Service running on cluster and > replica is USER private data ( not Opensaf States ) > , loosing any USER private data not acceptable . > > [NhatPham2] According to SAI-AIS-CKPT-B.02.02 (chapter 3.1.8 > Persistence of Checkpoints): > > “As has been stated in Section 2.1 on page 13, the Checkpoint > Service typically stores > > checkpoint data in the main memory of the nodes. *Regardless of > the retention time, a * > > *checkpoint and all its sections do not survive if the Checkpoint > Service stops running * > > *on all nodes hosting replicas for this checkpoint. The stop of > the Checkpoint Service * > > *can be caused by administrative actions or node failures*.” > > This states that the checkpoint doesn’t not survive in case the > nodes hosting its replicas failures (i.e SCs in our case). > > [AVM If we read further section `3.1.7.2 Non-Collocated Checkpoints` , > it explains with example : > > "For example, if two replicas exist at a certain point in time, and > the node hosting one of these replicas is > administratively taken out of service, the Checkpoint Service may > allocate another > replica on another node while this node is not available." > > [NhatPham3] I think this example is to support the idea of enhancing > the availability of checkpoints by creating multiple replicas. > Furthermore, it mentions about administrative as, while headless state > is about multiple node failure. > > @Anders: How do you think?*//* > > Regarding the case you mentioned about the lost checkpoint, what > will happen to cpnd DB on PL. > > With this patch the CPND detects un-recoverable checkpoints and > deletes them all from the DB in case the headless state happens. > > [AVM] I know , I was saying maintaining such flow involved with > transport `no active timer` will enable lot of new issue in CPSV > and this becomes code maintainability issue, > for example : > > 1) both SC`s rejoined quickly ( below `no active > timer` timeout i think it is currently ) we will end up with not > deleting DB > to address this we need collect evidences to > detect headless state happens. > > [NhatPham3] I’m not sure if it’s really a case. But if so, this > problem impacts whole system not just CPSV regardless of headless state. > > @Ander: How do you think? > > *Comment 3*: This is about case where checkpoint node director > (cpnd) crashes during headless state. In this case the cpnd > can’t finish starting because it can’t initialize CLM service. > > Then after time out, the AMF triggers a restart again. > Finally, the node is rebooted. > > It is expected that this problem should not lead to a node reboot. > > Action: No change in code. This is the limitation of the > system during headless state. > > > [AVM] code changes required in CPSV CLM integration code need to > be revisited to handle TRYAGAIN. > > [NhatPham2] Agree. The CPND code will updated to re-initialize clm > for TRY AGAIN fault code. > > If you agree with the summary above, I’ll update code and send > out the V3 for review. > > Best regards, > > Nhat Pham > > *From:* Anders Widell [mailto:[email protected]] > *Sent:* Wednesday, February 24, 2016 9:26 PM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'A V Mahesh' > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz > Brandao' <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless > state V2 [#1621] > > See my comments inline, marked [AndersW3]. > > regards, > Anders Widell > > On 02/24/2016 07:32 AM, Nhat Pham wrote: > > Hi Mahesh and Anders, > > Please see my comments below. > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Wednesday, February 24, 2016 11:06 AM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz > Brandao' <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: > Support preserving and recovering checkpoint replicas > during headless state V2 [#1621] > > Hi Nhat Pham, > > If component ( CPND ) restart allows while Controllers > absent , before requesting CLM going to change return > value to**SA_AIS_ERR_TRY_AGAIN , > We need to get clarification from AMF guys on few > things why because if CPND is on SA_AIS_ERR_TRY_AGAIN > and component restart timeout > then AMF will restart component again ( this become cyclic > ) and after saAmfSGCompRestartMax configured value Node > gose for reboot as next level escalation, > in that case we may required changes in AMF as well, to > not to act on component restart timeout in case of > Controllers absent ( i am not sure it is deviation of AMF > specification ) . > > */[Nhat Pham] In headless state, I’m not sure about this > either. /* > > */@Anders: Would you have comments about this?/* > > [AndersW3] Ok, first of all I would like to point out that > normally, the OpenSAF checkpoint node director should not > crash. So we are talking about a situation where multiple > faults have occurred: first both the active and the standby > system controllers have died, and then shortly afterwards - > before we have a new active system controller - the checkpoint > node director also crashes. Sure, these may not be totally > independent events, but still there are a lot of faults that > have happened within a short period of time. We should test > the node director and make sure it doesn't crash in this type > of scenario. > > Now, let's consider the case where we have a fault in the node > director that causes it to crash during the headless state. > The general philosophy of the headless feature is that when > things work fine - i.e. in the absence of fault - we should be > able to continue running while the system controllers are > absent. However, if a fault happens during the headless state, > we may not be able to recover from the fault until there is an > active system controller. AMF does provide support for > restarting components, but as you have pointed out, the node > director will be stuck in a TRY_AGAIN loop immediately after > it has been restarted. So this means that if the node director > crashes during the headless state, we have lost the checkpoint > functionality on that node and we will not get it back until > there is an active system controller. Other services like IMM > will still work for a while, but AMF will as you say > eventually escalate the checkpoint node director failure to a > node restart and then the whole node is gone. The node will > not come back until we have an active system controller. So to > summarize: there is very limited support for recovering from > faults that happen during the headless state. The full > recovery will not happen until we have an active system > controller. > > Please do incorporate current comments ( in design > prospective ) and republish the patch , I will re-test V3 > patch and provide review comments on function issue/bugs > if I found any. > > One Important note , in the new patch let us not have > any complexity of allowing non-collocated checkpoint > creation and then documenting that in some scenario , > non-collocated checkpoint replicas are recoverable , why > because replica is USER private data ( not Opensaf States > ) , loosing USER private data not acceptable . > so let us keep the scope of CPSV service as non-collocated > checkpoint creation NOT_SUPPORTED , if cluster is running > with > IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration > enabled at the time of cluster startup currently it is > not configurable , so their no chance of run-time > configuration change ). > > We can provide support for non-collocated in subsequent > enhancements by having solution like replica on lower > node ID PL will also created > non-collocated ( max three riplicas in cluster regradless > of where non-collocated is opened ). > > So for now, regardless of the heads (SC`s) status exist > not exist CPSV should return SA_AIS_ERR_NOT_SUPPORTED in > case of IMMSV_SC_ABSENCE_ALLOWED enabled cluster , > and let us document it as well. > > */[Nhat Pham] The patch is to limit loosing replicas and > checkpoints in case of headless state./* > > */In case both replicas locate on SCs and they reboot, > loosing checkpoint is unpreventable with current design > after headless state./* > > */Even if we implement the proposal “/*max three riplicas > in cluster regradless of where non-collocated is > opened*/”, there is still the case where the checkpoint is > lost. Ex. The SCs and the PL which hosts the replica > reboot same time./* > > */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if both SCs > reboot, this leads whole cluster reboots. Then the > checkpoint is lost. > > */What I mean is there are cases where the checkpoint is > lost. The point is what we can do to limit loosing data./* > > */For the proposal of reject creating non-collocated > checkpoint in case of/* IMMSV_SC_ABSENCE_ALLOWED enabled, > I think this will lead to in compatible problem. > > */@Anders: How do you think about rejecting creating > non-collocated checkpoint in case of > /*IMMSV_SC_ABSENCE_ALLOWED enabled? > > [AndersW3] No, I think we ought to support non-colocated > checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The > fact that we have "system controllers" is an implementation > detail of OpenSAF. I don't think the CKPT SAF specification > implies that non-colocated checkpoints must be fully > replicated on all the nodes in the cluster, and thus we must > have the possibility that all replicas are lost. It is not > clear exactly what to expect from the APIs when this happens, > but you could handle it in a similar way as the case when all > sections have been automatically deleted by the checkpoint > service because the sections have expired. > > > -AVM > > On 2/24/2016 6:51 AM, Nhat Pham wrote: > > Hi Mahesh, > > Do you have any further comments? > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Monday, February 22, 2016 10:37 AM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz > Brandao' <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: > Support preserving and recovering checkpoint replicas > during headless state V2 [#1621] > > Hi, > > >>BTW, have you finished the review and test? > > I will finish by today. > > -AVM > > On 2/22/2016 7:48 AM, Nhat Pham wrote: > > Hi Mahesh and Anders, > > Please see my comment below. > > BTW, have you finished the review and test? > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Friday, February 19, 2016 2:28 PM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for > cpsv: Support preserving and recovering checkpoint > replicas during headless state V2 [#1621] > > Hi Nhat Pham, > > On 2/19/2016 12:28 PM, Nhat Pham wrote: > > Could you please give more detailed > information about steps to reproduce the > problem below? Thanks. > > > Don't see this as specific bug , we need to see > the issue as CLM integrated service point of view , > by considering Anders Widell explication about > CLM application behavior during headless state > we need to reintegrate CPND with CLM ( before > this headless state feature no case of CPND > existence in the obscene of CLMD , but now it is ). > > And this will be the consistent across the all > services who integrated with CLM ( you may need > some changes in CLM also ) > > */[Nhat Pham] I think CLM should return > /*SA_AIS_ERR_TRY_AGAIN in this case. > > @Anders. How would you think? > > To start with let us consider case CPND on payload > restarted on PL during headless state > and an application is in running on PL. > > */[Nhat Pham] Regarding the CPND as CLM > application, I’m not sure what it can do in this > case. In case it restarts, it is monitored by AMF./* > > */If it blocks for too long, AMF will also trigger > a node reboot./* > > */In my test case, the CPND get blocked by CLM. It > doesn’t get out of the saClmInitialize. How do you > get the “/ER cpnd clm init failed with return > value:31/”?/* > > */Following is the cpnd trace./* > > Feb 22 8:56:41.188122 osafckptnd > [736:cpnd_init.c:0183] >> cpnd_lib_init > > Feb 22 8:56:41.188332 osafckptnd > [736:cpnd_init.c:0412] >> cpnd_cb_db_init > > Feb 22 8:56:41.188600 osafckptnd > [736:cpnd_init.c:0437] << cpnd_cb_db_init > > Feb 22 8:56:41.188778 osafckptnd > [736:clma_api.c:0503] >> saClmInitialize > > Feb 22 8:56:41.188945 osafckptnd > [736:clma_api.c:0593] >> clmainitialize > > Feb 22 8:56:41.190052 osafckptnd > [736:clma_util.c:0100] >> clma_startup: > clma_use_count: 0 > > Feb 22 8:56:41.190273 osafckptnd > [736:clma_mds.c:1124] >> clma_mds_init > > Feb 22 8:56:41.190825 osafckptnd > [736:clma_mds.c:1170] << clma_mds_init > > -AVM > > On 2/19/2016 12:28 PM, Nhat Pham wrote: > > Hi Mahesh, > > Could you please give more detailed > information about steps to reproduce the > problem below? Thanks. > > Best regards, > > Nhat Pham > > *From:* A V Mahesh > [mailto:[email protected]] > *Sent:* Friday, February 19, 2016 1:06 PM > *To:* Anders Widell > <[email protected]> > <mailto:[email protected]>; Nhat Pham > <[email protected]> > <mailto:[email protected]>; 'Minh Chau > H' <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request > for cpsv: Support preserving and recovering > checkpoint replicas during headless state V2 > [#1621] > > Hi Anders Widell, > Thanks for the detailed explanation about CLM > during headless state. > > HI Nhat Pham , > > Comment : 3 > Please see below the problem I was > interpreted now I seeing it during CLMD > obscene ( during headless state ), > so now CPND/CLMA need to to address below > case , currently cpnd clm init failed with > return value: SA_AIS_ERR_UNAVAILABLE > but should be SA_AIS_ERR_TRY_AGAIN > > ================================================== > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE > STATE-> IMM_NODE_FULLY_AVAILABLE 17418 > Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync > ending normally > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch > set to 9 in ImmModel > Feb 19 11:18:28 PL-4 cpsv_app: IN Received > PROC_STALE_CLIENTS > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 42 > (MsgQueueService132111) <108, 2040f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 43 > (MsgQueueService131855) <0, 2030f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 44 (safLogService) <0, > 2010f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > SERVER STATE: IMM_SERVER_SYNC_SERVER --> > IMM_SERVER_READY > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 45 (safClmService) <0, > 2010f> > *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER > cpnd clm init failed with return value:31 > Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd > init failed > Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER > cpnd_lib_req FAILED > Feb 19 11:18:28 PL-4 osafckptnd[7718]: > __init_cpnd() failed* > Feb 19 11:18:28 PL-4 osafclmna[5432]: NO > safNode=PL-4,safCluster=myClmCluster Joined > cluster, nodeid=2040f > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD > NEW_ACTIVE, adest:1 > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO > Sending node up due to NCSMDS_NEW_ACTIVE > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 > SISU states sent > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU > states sent > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 > CSICOMP states synced > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU > states sent > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 46 (safAmfService) <0, > 2010f> > Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO > 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Component or SU restart probation timer expired > Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO > Instantiation of > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > failed > Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO > Reason: component registration timer expired > Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Presence State RESTARTING => INSTANTIATION_FAILED > Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO > Component Failover trigerred for > 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': > Failed component: > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER > > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got > Inst failed > Feb 19 11:18:35 PL-4 osafamfnd[5441]: > Rebooting OpenSAF NodeId = 132111 EE Name = , > Reason: NCS component Instantiation failed, > OwnNodeId = 132111, SupervisionTime = 60 > Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting > local node; timeout=60 > Feb 19 11:18:39 PL-4 kernel: [ 4877.338518] > md: stopping all md devices. > ================================================== > > -AVM > > On 2/15/2016 5:11 PM, Anders Widell wrote: > > Hi! > > Please find my answer inline, marked > [AndersW]. > > regards, > Anders Widell > > On 02/15/2016 10:38 AM, Nhat Pham wrote: > > Hi Mahesh, > > It's good. Thank you. :) > > [AVM] Up on rejoining of the SC`s The > replica should be re-created regardless > of another application opens it on PL4. > ( Note : this comment > is based on your explanation have not yet > reviewed/tested , > currently i am > struggling with SC`s not rejoining > after headless state , i can provide > you more on this once i complte my > review/testing) > > [Nhat] To make cloud resilience works, > you need the patches from other > services (log, amf, clm, ntf). > @Minh: I heard that you created tar > file which includes all patches. Could > you > please send it to Mahesh? Thanks > > [AVM] I understand that , before I > comment more on this please allow me to > understand > I am not still not very > clear of the headless design in detail. > For example cluster > membership of PL`s during headless > state , > In the absence of SC`s > (CLMD) dose the PLs is considered as > cluster nodes or not (cluster > membership) ? > > [Nhat] I don't know much about this. > @ Anders: Could you please have > comment about this? Thanks > > [AndersW] First of all, keep in mind that > the "headless" state should ideally not > last a very long time. Once we have the > spare SC feature in place (ticket [#79]), > a new SC should become active within a > matter of a few seconds after we have lost > both the active and the standby SC. > > I think you should view the state of the > cluster in the headless state in the same > way as you view the state of the cluster > during a failover between the active and > the standby SC. Imagine that the active SC > dies. It takes the standby SC 1.5 seconds > to detect the failure of the active SC > (this is due to the TIPC timeout). If you > have configured the PROMOTE_ACTIVE_TIMER, > there is an additional delay before the > standby takes over as active. What is the > state of the cluster during the time after > the active SC failed and before the > standby takes over? > > The state of the cluster while it is > headless is very similar. The difference > is that this state may last a little bit > longer (though not more than a few > seconds, until one of the spare SCs > becomes active). Another difference is > that we may have lost some state. With a > "perfect" implementation of the headless > feature we should not lose any state at > all, but with the current set of patches > we do lose state. > > So specifically if we talk about cluster > membership and ask the question: is a > particular PL a member of the cluster or > not during the headless state? Well, if > you ask CLM about this during the headless > state, then you will not know - because > CLM doesn't provide any service during the > headless state. If you keep retrying you > query to CLM, you will eventually get an > answer - but you will not get this answer > until there is an active SC again and we > have exited the headless state. When > viewed in this way, the answer to the > question about a node's membership is > undefined during the headless state, since > CLM will not provide you with any answer > until there is an active SC. > > However, if you asked CLM about the node's > cluster membership status before the > cluster went headless, you probably saved > a cached copy of the cluster membership > state. Maybe you also installed a CLM > track callback and intend to update this > cached copy every time the cluster > membership status changes. The question > then is: can you continue using this > cached copy of the cluster membership > state during the headless state? The > answer is YES: since CLM doesn't provide > any service during the headless state, it > also means that the cluster membership > view cannot change during this time. Nodes > can of course reboot or die, but CLM will > not notice and hence the cluster view will > not be updated. You can argue that this is > bad because the cluster view doesn't > reflect reality, but notice that this will > always be the case. We can never propagate > information instantaneously, and detection > of node failures will take 1.5 seconds due > to the TIPC timeout. You can never be sure > that a node is alive at this very moment > just because CLM tells you that it is a > member of the cluster. If we are > unfortunate enough to lose both system > controller nodes simultaneously, updates > to the cluster membership view will be > delayed a few seconds longer than usual. > > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh > [mailto:[email protected]] > Sent: Monday, February 15, 2016 11:19 AM > To: Nhat Pham > <[email protected]> > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > Cc: > [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]> > Subject: Re: [PATCH 0 of 1] Review > Request for cpsv: Support preserving and > recovering checkpoint replicas during > headless state V2 [#1621] > > Hi Nhat Pham, > > How is your holiday went > > Please find my comments below > > On 2/15/2016 8:43 AM, Nhat Pham wrote: > > Hi Mahesh, > > For the comment 1, the patch will > be updated accordingly. > > [AVM] Please hold , I will provide > more comments in this week , so we can > have consolidated V3 > > For the comment 2, I think the > CKPT service will not be backward > compatible if the scAbsenceAllowed > is true. > The client can't create > non-collocated checkpoint on SCs. > > Furthermore, this solution only > protects the CKPT service from the > case "The non-collocated > checkpoint is created on a SC" > there are still the cases where > the replicas are completely lost. Ex: > > - The non-collocated checkpoint > created on a PL. The PL reboots. Both > replicas now locate on SCs. Then, > headless state happens. All > replicas are > lost. > - The non-collocated checkpoint > has active replica locating on a PL > and this PL restarts during > headless state > - The non-collocated checkpoint is > created on PL3. This checkpoint is > also opened on PL4. Then SCs and > PL3 reboot. > > [AVM] Up on rejoining of the SC`s The > replica should be re-created regardless > of another application opens it on PL4. > ( Note : this comment > is based on your explanation have not yet > reviewed/tested , > currently i am > struggling with SC`s not rejoining > after headless state , i can provide > you more on this once i complte my > review/testing) > > In this case, all replicas are > lost and the client has to create > it again. > > In case multiple nodes (which > including SCs) reboot, losing > replicas > is unpreventable. The patch is to > recover the checkpoints in > possible cases. > How do you think? > > [AVM] I understand that , before I > comment more on this please allow > me to understand > I am not still not very > clear of the headless design in detail. > > For example cluster > membership of PL`s during headless > state , > In the absence of SC`s > (CLMD) dose the PLs is considered as > cluster nodes or not (cluster > membership) ? > > - if not consider > as NON cluster nodes Checkpoint Service > API should leverage the SA Forum > Cluster > Membership > Service and API's can fail with > SA_AIS_ERR_UNAVAILABLE > > - if considers as > cluster nodes we need to follow all the > defined rules which are defined in > SAI-AIS-CKPT-B.02.02 specification > > so give me some more > time to review it completely , so that we > can have consolidated patch V3 > > -AVM > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh > [mailto:[email protected]] > Sent: Friday, February 12, 2016 > 11:10 AM > To: Nhat Pham > <[email protected]> > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > Cc: > [email protected] > <mailto:[email protected]>; > Beatriz Brandao > <[email protected]> > <mailto:[email protected]> > Subject: Re: [PATCH 0 of 1] Review > Request for cpsv: Support > preserving and recovering > checkpoint replicas during > headless state V2 > [#1621] > > > Comment 2 : > > After incorporating the comment > one all the Limitations should be > prevented based on Hydra > configuration is enabled in IMM > status. > > Foe example : if some application > is trying to create > > non-collocated checkpoint active > replica getting generated/locating on > SC then ,regardless of the heads > (SC`s) status exist not exist should > return SA_AIS_ERR_NOT_SUPPORTED > > In other words, rather that > allowing to created non-collocated > checkpoint when > heads(SC`s) are exit , and > non-collocated checkpoint getting > unrecoverable after heads(SC`s) > rejoins. > > > ====================================================================== > > ======================= > > Limitation: The CKPT > service doesn't support > recovering checkpoints in > following cases: > . The checkpoint which is > unlinked before headless. > . The non-collocated > checkpoint has active replica > locating on SC. > . The non-collocated > checkpoint has active replica > locating on a PL > and this PL > restarts during headless > state. In this cases, the > checkpoint replica is > destroyed. The fault code > SA_AIS_ERR_BAD_HANDLE is > returned when the > client > accesses the checkpoint in > these cases. The client must > re-open the > checkpoint. > > > ====================================================================== > > ======================= > > -AVM > > > On 2/11/2016 12:52 PM, A V Mahesh > wrote: > > Hi, > > I jut starred reviewing patch > , I will be giving comments > as soon as > I crossover any , to save some > time. > > Comment 1 : > This functionality should be > under checks if Hydra > configuration is > enabled in IMM attrName = > > const_cast<SaImmAttrNameT>("scAbsenceAllowed") > > > Please see example how > LOG/AMF services implemented it. > > -AVM > > > On 1/29/2016 1:02 PM, Nhat > Pham wrote: > > Hi Mahesh, > > As described in the > README, the CKPT service > returns > SA_AIS_ERR_TRY_AGAIN fault > code in this case. > I guess it's same for > other services. > > @Anders: Could you please > confirm this? > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh > [mailto:[email protected]] > > Sent: Friday, January 29, > 2016 2:11 PM > To: Nhat Pham > <[email protected]> > <mailto:[email protected]>; > [email protected] > > <mailto:[email protected]> > > Cc: > > [email protected] > > <mailto:[email protected]> > > Subject: Re: [PATCH 0 of > 1] Review Request for > cpsv: Support > preserving and recovering > checkpoint replicas during > headless state > V2 [#1621] > > Hi, > > On 1/29/2016 11:45 AM, > Nhat Pham wrote: > > - The behavior of > application will be > consistent with other > saf services like > imm/amf behavior > during headless state. > [Nhat] I'm not clear > what you mean about > "consistent"? > > In the obscene of > Director (SC's) , what is > expected return values > of SAF API should ( all > services ) , > which are not in > aposition to provide > service at that moment. > > I think all services > should return same SAF > ERRS., I thinks > currently we don't have > it , may be Anders Widel > will help us. > > -AVM > > > On 1/29/2016 11:45 AM, > Nhat Pham wrote: > > Hi Mahesh, > > Please see the > attachment for the > README. Let me know if > there is > any more information > required. > > Regarding your comments: > - during > headless state > applications may > behave like during > CPND restart case > [Nhat] Headless state > and CPND restart are > different events. > Thus, the behavior is > different. > Headless state is a > case where both SCs go > down. > > - The behavior > of application will be > consistent with other > saf services like > imm/amf behavior > during headless state. > [Nhat] I'm not clear > what you mean about > "consistent"? > > Best regards, > Nhat Pham > > -----Original > Message----- > From: A V Mahesh > > [mailto:[email protected]] > > Sent: Friday, January > 29, 2016 11:12 AM > To: Nhat Pham > <[email protected]> > > <mailto:[email protected]>; > > [email protected] > > <mailto:[email protected]> > > Cc: > > [email protected] > > <mailto:[email protected]> > > Subject: Re: [PATCH 0 > of 1] Review Request > for cpsv: Support > preserving and > recovering checkpoint > replicas during > headless state > V2 [#1621] > > Hi Nhat Pham, > > I stared reviewing > this patch , so can > please provide README > file > with scope and > limitations , that > will help to define > testing/reviewing scope . > > Following are minimum > things we can keep in > mind while > reviewing/accepting > patch , > > - Not effecting > existing functionality > - during > headless state > applications may > behave like during > CPND restart case > - The minimum > functionally of > application works > - The behavior > of application will be > consistent with > other saf > services like imm/amf > behavior during > headless state. > > So please do provide > any additional > detailed in README if > any of > the above is deviated > , that allow users to > know about the > limitations/deviation. > > -AVM > > On 1/4/2016 3:15 PM, > Nhat Pham wrote: > > Summary: cpsv: > Support preserving > and recovering > checkpoint > replicas during > headless state > [#1621] Review > request for Trac > Ticket(s): > #1621 Peer > Reviewer(s): > [email protected] > > <mailto:[email protected]>; > > [email protected] > > <mailto:[email protected]> > Pull request to: > [email protected] > > <mailto:[email protected]> > Affected > branch(es): > default Development > branch: default > > > -------------------------------- > > Impacted area > Impact y/n > > -------------------------------- > > Docs n > Build > system n > RPM/packaging n > > Configuration > files n > Startup > scripts n > SAF > services y > OpenSAF > services n > Core > libraries n > Samples n > Tests n > Other n > > > Comments (indicate > scope for each "y" > above): > > --------------------------------------------- > > > changeset > > faec4a4445a4c23e8f630857b19aabb43b5af18d > > Author: Nhat > Pham > <[email protected]> > > <mailto:[email protected]> > > Date: Mon, 04 > Jan 2016 16:34:33 > +0700 > > cpsv: > Support preserving > and recovering > checkpoint replicas > during headless > state [#1621] > > Background: > ---------- > This enhancement > supports to > preserve checkpoint > replicas > > in case > > both SCs down > (headless state) > and recover > replicas in case > one of > > SCs up > > again. If both SCs > goes down, > checkpoint > replicas on > surviving nodes > > still > > remain. When a SC > is available > again, surviving > replicas are > > automatically > > registered to the > SC checkpoint > database. Content in > surviving > > replicas are > > intacted and > synchronized to > new replicas. > > When no SC > is available, > client API calls > changing checkpoint > > configuration > > which requires SC > communication, are > rejected. Client API > calls > > reading and > > writing existing > checkpoint > replicas still work. > > Limitation: > The CKPT service > does not support > recovering > checkpoints > > in > > following cases: > - The > checkpoint which > is unlinked before > headless. > - The > non-collocated > checkpoint has > active replica > locating > on SC. > - The > non-collocated > checkpoint has > active replica > locating > on a PL > > and this > > PL restarts during > headless state. In > this cases, the > checkpoint > > replica is > > destroyed. The > fault code > SA_AIS_ERR_BAD_HANDLE > is returned > when the > > client > > accesses the > checkpoint in > these cases. The > client must > re-open the > checkpoint. > > While in > headless state, > accessing > checkpoint > replicas does > not work > > if the > > node which hosts > the active replica > goes down. It will > back > working > > when a > > SC available again. > > Solution: > --------- > The solution for > this enhancement > includes 2 parts: > > 1. To > destroy > un-recoverable > checkpoint > described above when > both > > SCs are > > down: When both > SCs are down, the > CPND deletes > un-recoverable > > checkpoint > > nodes and replicas > on PLs. Then it > requests CPA to > destroy > > corresponding > > checkpoint node by > using new message > CPA_EVT_ND2A_CKPT_DESTROY > > > 2. To update > CPD with > checkpoint > information When > an active > SC is up > > after > > headless, CPND > will update CPD > with checkpoint > information by > using > > new > > message > > CPD_EVT_ND2D_CKPT_INFO_UPDATE > instead of using > CPD_EVT_ND2D_CKPT_CREATE. > This is because > the CPND will > create new > > ckpt_id > > for the checkpoint > which might be > different with the > current > ckpt id > > if the > > CPD_EVT_ND2D_CKPT_CREATE > is used. The CPD > collects checkpoint > > information > > within 6s. During > this updating > time, following > requests is > rejected > > with > > fault code > SA_AIS_ERR_TRY_AGAIN: > - > CPD_EVT_ND2D_CKPT_CREATE > > - > CPD_EVT_ND2D_CKPT_UNLINK > > - > CPD_EVT_ND2D_ACTIVE_SET > > - > CPD_EVT_ND2D_CKPT_RDSET > > > > Complete diffstat: > ------------------ > > osaf/libs/agents/saf/cpa/cpa_proc.c > | 52 > > > +++++++++++++++++++++++++++++++++++ > > > > osaf/libs/common/cpsv/cpsv_edu.c > | 43 > > +++++++++++++++++++++++++++++ > > > > osaf/libs/common/cpsv/include/cpd_cb.h > | 3 ++ > > osaf/libs/common/cpsv/include/cpd_imm.h > | 1 + > > osaf/libs/common/cpsv/include/cpd_proc.h > | 7 ++++ > > osaf/libs/common/cpsv/include/cpd_tmr.h > | 3 +- > > osaf/libs/common/cpsv/include/cpnd_cb.h > | 1 + > > osaf/libs/common/cpsv/include/cpnd_init.h > | 2 + > > osaf/libs/common/cpsv/include/cpsv_evt.h > | 20 +++++++++++++ > > osaf/services/saf/cpsv/cpd/Makefile.am > | 3 +- > > osaf/services/saf/cpsv/cpd/cpd_evt.c > | 229 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > ++++ > > > osaf/services/saf/cpsv/cpd/cpd_imm.c > | 112 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > osaf/services/saf/cpsv/cpd/cpd_init.c > | 20 ++++++++++++- > > osaf/services/saf/cpsv/cpd/cpd_proc.c > | 309 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > osaf/services/saf/cpsv/cpd/cpd_tmr.c > | 7 ++++ > > osaf/services/saf/cpsv/cpnd/cpnd_db.c > | 16 ++++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_evt.c > | 22 > +++++++++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_init.c > | 23 > ++++++++++++++- > > osaf/services/saf/cpsv/cpnd/cpnd_mds.c > | 13 ++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_proc.c > | 314 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > > > 20 files changed, > 1189 > insertions(+), 11 > deletions(-) > > > Testing Commands: > ----------------- > - > > Testing, Expected > Results: > -------------------------- > > - > > > Conditions of > Submission: > ------------------------- > > <<HOW MANY > DAYS BEFORE > PUSHING, CONSENSUS > ETC>> > > > Arch Built > Started Linux > distro > > ------------------------------------------- > > mips > n n > mips64 > n n > x86 > n n > x86_64 > n n > powerpc > n n > powerpc64 > n n > > > Reviewer Checklist: > ------------------- > [Submitters: make > sure that your > review doesn't > trigger any > checkmarks!] > > > Your checkin has > not passed review > because (see > checked entries): > > ___ Your RR > template is > generally > incomplete; it has > too many > blank > > entries > > that need proper > data filled in. > > ___ You have > failed to nominate > the proper persons > for review and > push. > > ___ Your patches > do not have proper > short+long header > > ___ You have > grammar/spelling > in your header > that is unacceptable. > > ___ You have > exceeded a > sensible line > length in your > > headers/comments/text. > > ___ You have > failed to put in a > proper Trac Ticket > # into your > commits. > > ___ You have > incorrectly > put/left internal > data in your > comments/files > (i.e. > internal bug > tracking tool IDs, > product names etc) > > ___ You have not > given any evidence > of testing beyond > basic build > tests. > > Demonstrate some > level of runtime > or other sanity > testing. > > ___ You have ^M > present in some of > your files. These > have to be > removed. > > ___ You have > needlessly changed > whitespace or > added whitespace > crimes > like > trailing spaces, > or spaces before > tabs. > > ___ You have mixed > real technical > changes with > whitespace and other > cosmetic > code cleanup > changes. These > have to be separate > commits. > > ___ You need to > refactor your > submission into > logical chunks; > there is > too much > content into a > single commit. > > ___ You have > extraneous garbage > in your review > (merge commits etc) > > ___ You have giant > attachments which > should never have > been sent; > Instead > you should place > your content in a > public tree to > be pulled. > > ___ You have too > many commits > attached to an > e-mail; resend as > threaded > commits, > or place in a > public tree for a > pull. > > ___ You have > resent this > content multiple > times without a clear > indication > of what > has changed > between each re-send. > > ___ You have > failed to > adequately and > individually > address all of the > comments > and change > requests that were > proposed in the > initial > > review. > > ___ You have a > misconfigured > ~/.hgrc file (i.e. > username, email > etc) > > ___ Your computer > have a badly > configured date > and time; > confusing the > the > threaded patch > review. > > ___ Your changes > affect IPC > mechanism, and you > don't present any > results > for > in-service > upgradability test. > > ___ Your changes > affect user manual > and documentation, > your patch > series > do not > contain the patch > that updates the > Doxygen manual. > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
