Hi Mahesh, Please see my answers below with [NhatPham4]
Best regards, Nhat Pham -----Original Message----- From: A V Mahesh [mailto:[email protected]] Sent: Thursday, February 25, 2016 4:31 PM To: Nhat Pham <[email protected]>; 'Anders Widell' <[email protected]> Cc: 'Beatriz Brandao' <[email protected]>; 'Minh Chau H' <[email protected]>; [email protected] Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621] Hi Nhat Pham, >> With this patch the CPND detects un-recoverable checkpoints and deletes them all from the DB in case the headless state happens. By the way I didn't tested some cases, can you clarify below : - which error will be revived by cpsv application of PL , for the unrecoverable checkpoint ? - Is accessing SaCkptHandleT valid after head recovery ? [NhatPham4] It's still valid during headless state and after head recovery. During headless, saCkptCheckpointOpen() returns SA_AIS_ERR_TRY_AGAIN. It's back working after head recovery. - Is accessing SaCkptCheckpointHandleT return SA_AIS_ERR_BAD_HANDLE after head recovery ? [NhatPham4] Yes, it returns SA_AIS_ERR_BAD_HANDLE during headless state and after head recovery. But the SaCkptHandleT is still valid so application can re-create the checkpoint. -AVM On 2/25/2016 12:43 PM, A V Mahesh wrote: > Hi Nhat Pham, > > Please see my comment. > > -AVM > > On 2/25/2016 12:07 PM, Nhat Pham wrote: >> Hi Mahesh, >> >> Please see my comment below with [NhatPham2]. >> >> Best regards, >> >> Nhat Pham >> >> *From:*A V Mahesh [mailto:[email protected]] >> *Sent:* Thursday, February 25, 2016 11:26 AM >> *To:* Nhat Pham <[email protected]>; 'Anders Widell' >> <[email protected]> >> *Cc:* [email protected]; 'Beatriz Brandao' >> <[email protected]>; 'Minh Chau H' <[email protected]> >> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support >> preserving and recovering checkpoint replicas during headless state V2 >> [#1621] >> >> Hi Nhat Pham, >> >> Please see my comment below. >> >> -AVM >> >> On 2/25/2016 7:54 AM, Nhat Pham wrote: >> >> Hi Mahesh, >> >> Would you agree with the comment below? >> >> To summarize, following are the comment so far: >> >> *Comment 1*: This functionality should be under checks if Hydra >> configuration is enabled in IMM attrName = >> >> const_cast<SaImmAttrNameT>("scAbsenceAllowed"). >> >> Action: The code will be updated accordingly. >> >> *Comment 2*: To keep the scope of CPSV service as non-collocated >> checkpoint creation NOT_SUPPORTED , if cluster is running with >> IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at >> the time of cluster startup currently it is not configurable , so >> there no chance of run-time configuration change ). >> >> Action: No change in code. The CPSV still keep supporting >> non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is enable. >> >> >>[AndersW3] No, I think we ought to support non-colocated >> checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The fact that >> we have "system controllers" is an implementation detail of OpenSAF. I >> don't think the CKPT SAF specification implies that >> >>non-colocated checkpoints must be fully replicated on all the nodes >> in the cluster, and thus we must have the possibility that all >> replicas are lost. It is not clear exactly what to expect from the >> APIs when this happens, but you could handle it in a similar way as >> the case >> when all sections have been automatically deleted by the >> checkpoint service because the sections have expired. >> >> [AVM] I am not in agreement with both comments , we can not handle >> it in a similar to sections expiration case hear , in case of sections >> expiration checkpoint replica still exist only section deleted >> >> CPSV specification says if two replicas exist ( in our >> case Only on SC`s) at a certain point in time, and the nodes hosting >> both of these replicas is >> administratively taken out of service, the Checkpoint >> Service should allocate another replica on another node while this >> node is not available >> please check section `3.1.7.2 Non-Collocated Checkpoints` >> of cpsv specification . >> >> For example, take a case of application on PL is in >> progress of writing to non-collocated checkpoint sections ( physical >> replica exist only on SC`s ) >> what will happen to application on PL ? , ok let us >> consider user agreed to loose the checkpoint and he what to recreated >> it , what will happen to cpnd DB on PL and the complexity involved in >> it (clean up) , >> and this will lead to lot of maintainability issues. >> >> On top of that CKPT SAF specification only says that >> non-collocated checkpoint and all its sections should survive if the >> Checkpoint Service running on cluster and >> replica is USER private data ( not Opensaf States ) , >> loosing any USER private data not acceptable . >> >> [NhatPham2] According to SAI-AIS-CKPT-B.02.02 (chapter 3.1.8 >> Persistence of Checkpoints): >> >> "As has been stated in Section 2.1 on page 13, the Checkpoint Service >> typically stores >> >> checkpoint data in the main memory of the nodes. *Regardless of the >> retention time, a * >> >> *checkpoint and all its sections do not survive if the Checkpoint >> Service stops running * >> >> *on all nodes hosting replicas for this checkpoint. The stop of the >> Checkpoint Service * >> >> *can be caused by administrative actions or node failures*." >> >> This states that the checkpoint doesn't not survive in case the nodes >> hosting its replicas failures (i.e SCs in our case). >> > [AVM If we read further section `3.1.7.2 Non-Collocated Checkpoints` , > it explains with example : > > "For example, if two replicas exist at a certain point in time, and the > node hosting one of these replicas is > administratively taken out of service, the Checkpoint Service may > allocate another > replica on another node while this node is not available." > >> Regarding the case you mentioned about the lost checkpoint, what will >> happen to cpnd DB on PL. >> >> With this patch the CPND detects un-recoverable checkpoints and >> deletes them all from the DB in case the headless state happens. >> > [AVM] I know , I was saying maintaining such flow involved with > transport `no active timer` will enable lot of new issue in CPSV > and this becomes code maintainability issue, > for example : > > 1) both SC`s rejoined quickly ( below `no active > timer` timeout i think it is currently ) we will end up with not > deleting DB > to address this we need collect evidences to > detect headless state happens. > > >> *Comment 3*: This is about case where checkpoint node director >> (cpnd) crashes during headless state. In this case the cpnd can't >> finish starting because it can't initialize CLM service. >> >> Then after time out, the AMF triggers a restart again. Finally, >> the node is rebooted. >> >> It is expected that this problem should not lead to a node reboot. >> >> Action: No change in code. This is the limitation of the system >> during headless state. >> >> >> [AVM] code changes required in CPSV CLM integration code need to be >> revisited to handle TRYAGAIN. >> >> [NhatPham2] Agree. The CPND code will updated to re-initialize clm for >> TRY AGAIN fault code. >> >> If you agree with the summary above, I'll update code and send out >> the V3 for review. >> >> Best regards, >> >> Nhat Pham >> >> *From:* Anders Widell [mailto:[email protected]] >> *Sent:* Wednesday, February 24, 2016 9:26 PM >> *To:* Nhat Pham <[email protected]> >> <mailto:[email protected]>; 'A V Mahesh' >> <[email protected]> <mailto:[email protected]> >> *Cc:* [email protected] >> <mailto:[email protected]>; 'Beatriz Brandao' >> <[email protected]> >> <mailto:[email protected]>; 'Minh Chau H' >> <[email protected]> <mailto:[email protected]> >> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support >> preserving and recovering checkpoint replicas during headless >> state V2 [#1621] >> >> See my comments inline, marked [AndersW3]. >> >> regards, >> Anders Widell >> >> On 02/24/2016 07:32 AM, Nhat Pham wrote: >> >> Hi Mahesh and Anders, >> >> Please see my comments below. >> >> Best regards, >> >> Nhat Pham >> >> *From:* A V Mahesh [mailto:[email protected]] >> *Sent:* Wednesday, February 24, 2016 11:06 AM >> *To:* Nhat Pham <[email protected]> >> <mailto:[email protected]>; 'Anders Widell' >> <[email protected]> <mailto:[email protected]> >> *Cc:* [email protected] >> <mailto:[email protected]>; 'Beatriz >> Brandao' <[email protected]> >> <mailto:[email protected]>; 'Minh Chau H' >> <[email protected]> <mailto:[email protected]> >> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support >> preserving and recovering checkpoint replicas during headless >> state V2 [#1621] >> >> Hi Nhat Pham, >> >> If component ( CPND ) restart allows while Controllers absent >> , before requesting CLM going to change return value >> to**SA_AIS_ERR_TRY_AGAIN , >> We need to get clarification from AMF guys on few things >> why because if CPND is on SA_AIS_ERR_TRY_AGAIN and component >> restart timeout >> then AMF will restart component again ( this become cyclic ) >> and after saAmfSGCompRestartMax configured value Node gose >> for reboot as next level escalation, >> in that case we may required changes in AMF as well, to not >> to act on component restart timeout in case of Controllers >> absent ( i am not sure it is deviation of AMF specification ) . >> >> */[Nhat Pham] In headless state, I'm not sure about this >> either. /* >> >> */@Anders: Would you have comments about this?/* >> >> [AndersW3] Ok, first of all I would like to point out that >> normally, the OpenSAF checkpoint node director should not crash. >> So we are talking about a situation where multiple faults have >> occurred: first both the active and the standby system controllers >> have died, and then shortly afterwards - before we have a new >> active system controller - the checkpoint node director also >> crashes. Sure, these may not be totally independent events, but >> still there are a lot of faults that have happened within a short >> period of time. We should test the node director and make sure it >> doesn't crash in this type of scenario. >> >> Now, let's consider the case where we have a fault in the node >> director that causes it to crash during the headless state. The >> general philosophy of the headless feature is that when things >> work fine - i.e. in the absence of fault - we should be able to >> continue running while the system controllers are absent. However, >> if a fault happens during the headless state, we may not be able >> to recover from the fault until there is an active system >> controller. AMF does provide support for restarting components, >> but as you have pointed out, the node director will be stuck in a >> TRY_AGAIN loop immediately after it has been restarted. So this >> means that if the node director crashes during the headless state, >> we have lost the checkpoint functionality on that node and we will >> not get it back until there is an active system controller. Other >> services like IMM will still work for a while, but AMF will as you >> say eventually escalate the checkpoint node director failure to a >> node restart and then the whole node is gone. The node will not >> come back until we have an active system controller. So to >> summarize: there is very limited support for recovering from >> faults that happen during the headless state. The full recovery >> will not happen until we have an active system controller. >> >> Please do incorporate current comments ( in design prospective >> ) and republish the patch , I will re-test V3 patch and >> provide review comments on function issue/bugs if I found any. >> >> One Important note , in the new patch let us not have any >> complexity of allowing non-collocated checkpoint creation >> and then documenting that in some scenario , >> non-collocated checkpoint replicas are recoverable , why >> because replica is USER private data ( not Opensaf States ) >> , loosing USER private data not acceptable . >> so let us keep the scope of CPSV service as non-collocated >> checkpoint creation NOT_SUPPORTED , if cluster is running with >> IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration >> enabled at the time of cluster startup currently it is not >> configurable , so their no chance of run-time configuration >> change ). >> >> We can provide support for non-collocated in subsequent >> enhancements by having solution like replica on lower node ID >> PL will also created >> non-collocated ( max three riplicas in cluster regradless of >> where non-collocated is opened ). >> >> So for now, regardless of the heads (SC`s) status exist not >> exist CPSV should return SA_AIS_ERR_NOT_SUPPORTED in case of >> IMMSV_SC_ABSENCE_ALLOWED enabled cluster , >> and let us document it as well. >> >> */[Nhat Pham] The patch is to limit loosing replicas and >> checkpoints in case of headless state./* >> >> */In case both replicas locate on SCs and they reboot, loosing >> checkpoint is unpreventable with current design after headless >> state./* >> >> */Even if we implement the proposal "/*max three riplicas in >> cluster regradless of where non-collocated is opened*/", there >> is still the case where the checkpoint is lost. Ex. The SCs >> and the PL which hosts the replica reboot same time./* >> >> */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if both SCs >> reboot, this leads whole cluster reboots. Then the checkpoint >> is lost. >> >> */What I mean is there are cases where the checkpoint is lost. >> The point is what we can do to limit loosing data./* >> >> */For the proposal of reject creating non-collocated >> checkpoint in case of/* IMMSV_SC_ABSENCE_ALLOWED enabled, I >> think this will lead to in compatible problem. >> >> */@Anders: How do you think about rejecting creating >> non-collocated checkpoint in case of >> /*IMMSV_SC_ABSENCE_ALLOWED enabled? >> >> [AndersW3] No, I think we ought to support non-colocated >> checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The fact >> that we have "system controllers" is an implementation detail of >> OpenSAF. I don't think the CKPT SAF specification implies that >> non-colocated checkpoints must be fully replicated on all the >> nodes in the cluster, and thus we must have the possibility that >> all replicas are lost. It is not clear exactly what to expect from >> the APIs when this happens, but you could handle it in a similar >> way as the case when all sections have been automatically deleted >> by the checkpoint service because the sections have expired. >> >> >> -AVM >> >> On 2/24/2016 6:51 AM, Nhat Pham wrote: >> >> Hi Mahesh, >> >> Do you have any further comments? >> >> Best regards, >> >> Nhat Pham >> >> *From:* A V Mahesh [mailto:[email protected]] >> *Sent:* Monday, February 22, 2016 10:37 AM >> *To:* Nhat Pham <[email protected]> >> <mailto:[email protected]>; 'Anders Widell' >> <[email protected]> >> <mailto:[email protected]> >> *Cc:* [email protected] >> <mailto:[email protected]>; 'Beatriz >> Brandao' <[email protected]> >> <mailto:[email protected]>; 'Minh Chau H' >> <[email protected]> <mailto:[email protected]> >> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: >> Support preserving and recovering checkpoint replicas >> during headless state V2 [#1621] >> >> Hi, >> >> >>BTW, have you finished the review and test? >> >> I will finish by today. >> >> -AVM >> >> On 2/22/2016 7:48 AM, Nhat Pham wrote: >> >> Hi Mahesh and Anders, >> >> Please see my comment below. >> >> BTW, have you finished the review and test? >> >> Best regards, >> >> Nhat Pham >> >> *From:* A V Mahesh [mailto:[email protected]] >> *Sent:* Friday, February 19, 2016 2:28 PM >> *To:* Nhat Pham <[email protected]> >> <mailto:[email protected]>; 'Anders Widell' >> <[email protected]> >> <mailto:[email protected]>; 'Minh Chau H' >> <[email protected]> >> <mailto:[email protected]> >> *Cc:* [email protected] >> <mailto:[email protected]>; 'Beatriz >> Brandao' <[email protected]> >> <mailto:[email protected]> >> *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: >> Support preserving and recovering checkpoint replicas >> during headless state V2 [#1621] >> >> Hi Nhat Pham, >> >> On 2/19/2016 12:28 PM, Nhat Pham wrote: >> >> Could you please give more detailed information >> about steps to reproduce the problem below? Thanks. >> >> >> Don't see this as specific bug , we need to see the >> issue as CLM integrated service point of view , >> by considering Anders Widell explication about CLM >> application behavior during headless state >> we need to reintegrate CPND with CLM ( before this >> headless state feature no case of CPND existence in >> the obscene of CLMD , but now it is ). >> >> And this will be the consistent across the all >> services who integrated with CLM ( you may need some >> changes in CLM also ) >> >> */[Nhat Pham] I think CLM should return >> /*SA_AIS_ERR_TRY_AGAIN in this case. >> >> @Anders. How would you think? >> >> To start with let us consider case CPND on payload >> restarted on PL during headless state >> and an application is in running on PL. >> >> */[Nhat Pham] Regarding the CPND as CLM application, >> I'm not sure what it can do in this case. In case it >> restarts, it is monitored by AMF./* >> >> */If it blocks for too long, AMF will also trigger a >> node reboot./* >> >> */In my test case, the CPND get blocked by CLM. It >> doesn't get out of the saClmInitialize. How do you get >> the "/ER cpnd clm init failed with return value:31/"?/* >> >> */Following is the cpnd trace./* >> >> Feb 22 8:56:41.188122 osafckptnd >> [736:cpnd_init.c:0183] >> cpnd_lib_init >> >> Feb 22 8:56:41.188332 osafckptnd >> [736:cpnd_init.c:0412] >> cpnd_cb_db_init >> >> Feb 22 8:56:41.188600 osafckptnd >> [736:cpnd_init.c:0437] << cpnd_cb_db_init >> >> Feb 22 8:56:41.188778 osafckptnd >> [736:clma_api.c:0503] >> saClmInitialize >> >> Feb 22 8:56:41.188945 osafckptnd >> [736:clma_api.c:0593] >> clmainitialize >> >> Feb 22 8:56:41.190052 osafckptnd >> [736:clma_util.c:0100] >> clma_startup: clma_use_count: 0 >> >> Feb 22 8:56:41.190273 osafckptnd >> [736:clma_mds.c:1124] >> clma_mds_init >> >> Feb 22 8:56:41.190825 osafckptnd >> [736:clma_mds.c:1170] << clma_mds_init >> >> -AVM >> >> On 2/19/2016 12:28 PM, Nhat Pham wrote: >> >> Hi Mahesh, >> >> Could you please give more detailed information >> about steps to reproduce the problem below? Thanks. >> >> Best regards, >> >> Nhat Pham >> >> *From:* A V Mahesh [mailto:[email protected]] >> *Sent:* Friday, February 19, 2016 1:06 PM >> *To:* Anders Widell <[email protected]> >> <mailto:[email protected]>; Nhat Pham >> <[email protected]> >> <mailto:[email protected]>; 'Minh Chau H' >> <[email protected]> >> <mailto:[email protected]> >> *Cc:* [email protected] >> <mailto:[email protected]>; >> 'Beatriz Brandao' <[email protected]> >> <mailto:[email protected]> >> *Subject:* Re: [PATCH 0 of 1] Review Request for >> cpsv: Support preserving and recovering checkpoint >> replicas during headless state V2 [#1621] >> >> Hi Anders Widell, >> Thanks for the detailed explanation about CLM >> during headless state. >> >> HI Nhat Pham , >> >> Comment : 3 >> Please see below the problem I was interpreted >> now I seeing it during CLMD obscene ( during >> headless state ), >> so now CPND/CLMA need to to address below case , >> currently cpnd clm init failed with return >> value: SA_AIS_ERR_UNAVAILABLE >> but should be SA_AIS_ERR_TRY_AGAIN >> >> ================================================== >> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO NODE >> STATE-> IMM_NODE_FULLY_AVAILABLE 17418 >> Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync ending >> normally >> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO Epoch set >> to 9 in ImmModel >> Feb 19 11:18:28 PL-4 cpsv_app: IN Received >> PROC_STALE_CLIENTS >> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO >> Implementer connected: 42 (MsgQueueService132111) >> <108, 2040f> >> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO >> Implementer connected: 43 (MsgQueueService131855) >> <0, 2030f> >> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO >> Implementer connected: 44 (safLogService) <0, 2010f> >> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO SERVER >> STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY >> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO >> Implementer connected: 45 (safClmService) <0, 2010f> >> *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd >> clm init failed with return value:31 >> Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER cpnd >> init failed >> Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER >> cpnd_lib_req FAILED >> Feb 19 11:18:28 PL-4 osafckptnd[7718]: >> __init_cpnd() failed* >> Feb 19 11:18:28 PL-4 osafclmna[5432]: NO >> safNode=PL-4,safCluster=myClmCluster Joined >> cluster, nodeid=2040f >> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO AVD >> NEW_ACTIVE, adest:1 >> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO Sending >> node up due to NCSMDS_NEW_ACTIVE >> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SISU >> states sent >> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 SU >> states sent >> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 CSICOMP >> states synced >> Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 SU >> states sent >> Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO >> Implementer connected: 46 (safAmfService) <0, 2010f> >> Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO >> 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' Component >> or SU restart probation timer expired >> Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO >> Instantiation of >> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' >> failed >> Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Reason: >> component registration timer expired >> Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA >> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' >> Presence State RESTARTING => INSTANTIATION_FAILED >> Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO Component >> Failover trigerred for >> 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': Failed >> component: >> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' >> Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER >> 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got >> Inst failed >> Feb 19 11:18:35 PL-4 osafamfnd[5441]: Rebooting >> OpenSAF NodeId = 132111 EE Name = , Reason: NCS >> component Instantiation failed, OwnNodeId = >> 132111, SupervisionTime = 60 >> Feb 19 11:18:36 PL-4 opensaf_reboot: Rebooting >> local node; timeout=60 >> Feb 19 11:18:39 PL-4 kernel: [ 4877.338518] md: >> stopping all md devices. >> ================================================== >> >> -AVM >> >> On 2/15/2016 5:11 PM, Anders Widell wrote: >> >> Hi! >> >> Please find my answer inline, marked [AndersW]. >> >> regards, >> Anders Widell >> >> On 02/15/2016 10:38 AM, Nhat Pham wrote: >> >> Hi Mahesh, >> >> It's good. Thank you. :) >> >> [AVM] Up on rejoining of the SC`s The >> replica should be re-created regardless >> of another application opens it on PL4. >> ( Note : this comment is >> based on your explanation have not yet >> reviewed/tested , >> currently i am >> struggling with SC`s not rejoining >> after headless state , i can provide you >> more on this once i complte my >> review/testing) >> >> [Nhat] To make cloud resilience works, you >> need the patches from other >> services (log, amf, clm, ntf). >> @Minh: I heard that you created tar file >> which includes all patches. Could you >> please send it to Mahesh? Thanks >> >> [AVM] I understand that , before I comment >> more on this please allow me to >> understand >> I am not still not very >> clear of the headless design in detail. >> For example cluster >> membership of PL`s during headless state , >> In the absence of SC`s >> (CLMD) dose the PLs is considered as >> cluster nodes or not (cluster membership) ? >> >> [Nhat] I don't know much about this. >> @ Anders: Could you please have comment >> about this? Thanks >> >> [AndersW] First of all, keep in mind that the >> "headless" state should ideally not last a >> very long time. Once we have the spare SC >> feature in place (ticket [#79]), a new SC >> should become active within a matter of a few >> seconds after we have lost both the active and >> the standby SC. >> >> I think you should view the state of the >> cluster in the headless state in the same way >> as you view the state of the cluster during a >> failover between the active and the standby >> SC. Imagine that the active SC dies. It takes >> the standby SC 1.5 seconds to detect the >> failure of the active SC (this is due to the >> TIPC timeout). If you have configured the >> PROMOTE_ACTIVE_TIMER, there is an additional >> delay before the standby takes over as active. >> What is the state of the cluster during the >> time after the active SC failed and before the >> standby takes over? >> >> The state of the cluster while it is headless >> is very similar. The difference is that this >> state may last a little bit longer (though not >> more than a few seconds, until one of the >> spare SCs becomes active). Another difference >> is that we may have lost some state. With a >> "perfect" implementation of the headless >> feature we should not lose any state at all, >> but with the current set of patches we do lose >> state. >> >> So specifically if we talk about cluster >> membership and ask the question: is a >> particular PL a member of the cluster or not >> during the headless state? Well, if you ask >> CLM about this during the headless state, then >> you will not know - because CLM doesn't >> provide any service during the headless state. >> If you keep retrying you query to CLM, you >> will eventually get an answer - but you will >> not get this answer until there is an active >> SC again and we have exited the headless >> state. When viewed in this way, the answer to >> the question about a node's membership is >> undefined during the headless state, since CLM >> will not provide you with any answer until >> there is an active SC. >> >> However, if you asked CLM about the node's >> cluster membership status before the cluster >> went headless, you probably saved a cached >> copy of the cluster membership state. Maybe >> you also installed a CLM track callback and >> intend to update this cached copy every time >> the cluster membership status changes. The >> question then is: can you continue using this >> cached copy of the cluster membership state >> during the headless state? The answer is YES: >> since CLM doesn't provide any service during >> the headless state, it also means that the >> cluster membership view cannot change during >> this time. Nodes can of course reboot or die, >> but CLM will not notice and hence the cluster >> view will not be updated. You can argue that >> this is bad because the cluster view doesn't >> reflect reality, but notice that this will >> always be the case. We can never propagate >> information instantaneously, and detection of >> node failures will take 1.5 seconds due to the >> TIPC timeout. You can never be sure that a >> node is alive at this very moment just because >> CLM tells you that it is a member of the >> cluster. If we are unfortunate enough to lose >> both system controller nodes simultaneously, >> updates to the cluster membership view will be >> delayed a few seconds longer than usual. >> >> >> Best regards, >> Nhat Pham >> >> -----Original Message----- >> From: A V Mahesh >> [mailto:[email protected]] >> Sent: Monday, February 15, 2016 11:19 AM >> To: Nhat Pham <[email protected]> >> <mailto:[email protected]>; >> [email protected] >> <mailto:[email protected]> >> Cc: [email protected] >> <mailto:[email protected]>; >> 'Beatriz Brandao' >> <[email protected]> >> <mailto:[email protected]> >> Subject: Re: [PATCH 0 of 1] Review Request >> for cpsv: Support preserving and >> recovering checkpoint replicas during >> headless state V2 [#1621] >> >> Hi Nhat Pham, >> >> How is your holiday went >> >> Please find my comments below >> >> On 2/15/2016 8:43 AM, Nhat Pham wrote: >> >> Hi Mahesh, >> >> For the comment 1, the patch will be >> updated accordingly. >> >> [AVM] Please hold , I will provide more >> comments in this week , so we can >> have consolidated V3 >> >> For the comment 2, I think the CKPT >> service will not be backward >> compatible if the scAbsenceAllowed is >> true. >> The client can't create non-collocated >> checkpoint on SCs. >> >> Furthermore, this solution only >> protects the CKPT service from the >> case "The non-collocated checkpoint is >> created on a SC" >> there are still the cases where the >> replicas are completely lost. Ex: >> >> - The non-collocated checkpoint >> created on a PL. The PL reboots. Both >> replicas now locate on SCs. Then, >> headless state happens. All replicas are >> lost. >> - The non-collocated checkpoint has >> active replica locating on a PL >> and this PL restarts during headless >> state >> - The non-collocated checkpoint is >> created on PL3. This checkpoint is >> also opened on PL4. Then SCs and PL3 >> reboot. >> >> [AVM] Up on rejoining of the SC`s The >> replica should be re-created regardless >> of another application opens it on PL4. >> ( Note : this comment is >> based on your explanation have not yet >> reviewed/tested , >> currently i am >> struggling with SC`s not rejoining >> after headless state , i can provide you >> more on this once i complte my >> review/testing) >> >> In this case, all replicas are lost >> and the client has to create it again. >> >> In case multiple nodes (which >> including SCs) reboot, losing replicas >> is unpreventable. The patch is to >> recover the checkpoints in possible >> cases. >> How do you think? >> >> [AVM] I understand that , before I comment >> more on this please allow >> me to understand >> I am not still not very >> clear of the headless design in detail. >> >> For example cluster >> membership of PL`s during headless >> state , >> In the absence of SC`s >> (CLMD) dose the PLs is considered as >> cluster nodes or not (cluster membership) ? >> >> - if not consider as >> NON cluster nodes Checkpoint Service >> API should leverage the SA Forum Cluster >> Membership Service >> and API's can fail with >> SA_AIS_ERR_UNAVAILABLE >> >> - if considers as >> cluster nodes we need to follow all the >> defined rules which are defined in >> SAI-AIS-CKPT-B.02.02 specification >> >> so give me some more time to >> review it completely , so that we >> can have consolidated patch V3 >> >> -AVM >> >> Best regards, >> Nhat Pham >> >> -----Original Message----- >> From: A V Mahesh >> [mailto:[email protected]] >> Sent: Friday, February 12, 2016 11:10 AM >> To: Nhat Pham >> <[email protected]> >> <mailto:[email protected]>; >> [email protected] >> <mailto:[email protected]> >> Cc: >> [email protected] >> <mailto:[email protected]>; >> Beatriz Brandao >> <[email protected]> >> <mailto:[email protected]> >> Subject: Re: [PATCH 0 of 1] Review >> Request for cpsv: Support >> preserving and recovering checkpoint >> replicas during headless state V2 >> [#1621] >> >> >> Comment 2 : >> >> After incorporating the comment one >> all the Limitations should be >> prevented based on Hydra configuration >> is enabled in IMM status. >> >> Foe example : if some application is >> trying to create >> >> non-collocated checkpoint active >> replica getting generated/locating on >> SC then ,regardless of the heads >> (SC`s) status exist not exist should >> return SA_AIS_ERR_NOT_SUPPORTED >> >> In other words, rather that allowing >> to created non-collocated >> checkpoint when >> heads(SC`s) are exit , and >> non-collocated checkpoint getting >> unrecoverable after heads(SC`s) rejoins. >> >> ====================================================================== >> >> ======================= >> >> Limitation: The CKPT service >> doesn't support recovering >> checkpoints in >> following cases: >> . The checkpoint which is >> unlinked before headless. >> . The non-collocated >> checkpoint has active replica >> locating on SC. >> . The non-collocated >> checkpoint has active replica >> locating on a PL >> and this PL >> restarts during headless >> state. In this cases, the >> checkpoint replica is >> destroyed. The fault code >> SA_AIS_ERR_BAD_HANDLE is returned >> when the >> client >> accesses the checkpoint in >> these cases. The client must >> re-open the >> checkpoint. >> >> ====================================================================== >> >> ======================= >> >> -AVM >> >> >> On 2/11/2016 12:52 PM, A V Mahesh wrote: >> >> Hi, >> >> I jut starred reviewing patch , I >> will be giving comments as soon as >> I crossover any , to save some time. >> >> Comment 1 : >> This functionality should be >> under checks if Hydra >> configuration is >> enabled in IMM attrName = >> const_cast<SaImmAttrNameT>("scAbsenceAllowed") >> >> >> Please see example how LOG/AMF >> services implemented it. >> >> -AVM >> >> >> On 1/29/2016 1:02 PM, Nhat Pham >> wrote: >> >> Hi Mahesh, >> >> As described in the README, >> the CKPT service returns >> SA_AIS_ERR_TRY_AGAIN fault >> code in this case. >> I guess it's same for other >> services. >> >> @Anders: Could you please >> confirm this? >> >> Best regards, >> Nhat Pham >> >> -----Original Message----- >> From: A V Mahesh >> [mailto:[email protected]] >> Sent: Friday, January 29, 2016 >> 2:11 PM >> To: Nhat Pham >> <[email protected]> >> <mailto:[email protected]>; >> [email protected] >> <mailto:[email protected]> >> >> Cc: >> [email protected] >> <mailto:[email protected]> >> >> Subject: Re: [PATCH 0 of 1] >> Review Request for cpsv: Support >> preserving and recovering >> checkpoint replicas during >> headless state >> V2 [#1621] >> >> Hi, >> >> On 1/29/2016 11:45 AM, Nhat >> Pham wrote: >> >> - The behavior of >> application will be >> consistent with other >> saf services like imm/amf >> behavior during headless >> state. >> [Nhat] I'm not clear what >> you mean about "consistent"? >> >> In the obscene of Director >> (SC's) , what is expected >> return values >> of SAF API should ( all >> services ) , >> which are not in >> aposition to provide service >> at that moment. >> >> I think all services should >> return same SAF ERRS., I thinks >> currently we don't have it , >> may be Anders Widel will >> help us. >> >> -AVM >> >> >> On 1/29/2016 11:45 AM, Nhat >> Pham wrote: >> >> Hi Mahesh, >> >> Please see the attachment >> for the README. Let me >> know if there is >> any more information >> required. >> >> Regarding your comments: >> - during headless >> state applications may >> behave like during >> CPND restart case [Nhat] >> Headless state and CPND >> restart are >> different events. Thus, >> the behavior is different. >> Headless state is a case >> where both SCs go down. >> >> - The behavior of >> application will be >> consistent with other >> saf services like imm/amf >> behavior during headless >> state. >> [Nhat] I'm not clear what >> you mean about "consistent"? >> >> Best regards, >> Nhat Pham >> >> -----Original Message----- >> From: A V Mahesh >> [mailto:[email protected]] >> >> Sent: Friday, January 29, >> 2016 11:12 AM >> To: Nhat Pham >> <[email protected]> >> <mailto:[email protected]>; >> >> [email protected] >> <mailto:[email protected]> >> >> Cc: >> [email protected] >> <mailto:[email protected]> >> >> Subject: Re: [PATCH 0 of >> 1] Review Request for >> cpsv: Support >> preserving and recovering >> checkpoint replicas during >> headless state >> V2 [#1621] >> >> Hi Nhat Pham, >> >> I stared reviewing this >> patch , so can please >> provide README file >> with scope and limitations >> , that will help to define >> testing/reviewing scope . >> >> Following are minimum >> things we can keep in mind >> while >> reviewing/accepting patch , >> >> - Not effecting existing >> functionality >> - during headless >> state applications may >> behave like during >> CPND restart case >> - The minimum >> functionally of >> application works >> - The behavior of >> application will be >> consistent with >> other saf >> services like imm/amf >> behavior during headless >> state. >> >> So please do provide any >> additional detailed in >> README if any of >> the above is deviated , >> that allow users to know >> about the >> limitations/deviation. >> >> -AVM >> >> On 1/4/2016 3:15 PM, Nhat >> Pham wrote: >> >> Summary: cpsv: Support >> preserving and >> recovering checkpoint >> replicas during >> headless state [#1621] >> Review request for Trac >> Ticket(s): >> #1621 Peer >> Reviewer(s): >> [email protected] <mailto:[email protected]>; >> >> [email protected] >> <mailto:[email protected]> >> Pull request to: >> [email protected] <mailto:[email protected]> >> Affected branch(es): >> default Development >> branch: default >> >> -------------------------------- >> >> Impacted area >> Impact y/n >> -------------------------------- >> >> Docs n >> Build >> system n >> RPM/packaging n >> Configuration >> files n >> Startup >> scripts n >> SAF >> services y >> OpenSAF >> services n >> Core >> libraries n >> Samples n >> Tests n >> Other n >> >> >> Comments (indicate >> scope for each "y" >> above): >> --------------------------------------------- >> >> >> changeset >> faec4a4445a4c23e8f630857b19aabb43b5af18d >> >> Author: Nhat Pham >> <[email protected]> >> <mailto:[email protected]> >> >> Date: Mon, 04 Jan >> 2016 16:34:33 +0700 >> >> cpsv: Support >> preserving and >> recovering checkpoint >> replicas >> during headless state >> [#1621] >> >> Background: >> ---------- This >> enhancement supports >> to preserve checkpoint >> replicas >> >> in case >> >> both SCs down >> (headless state) and >> recover replicas in case >> one of >> >> SCs up >> >> again. If both SCs >> goes down, checkpoint >> replicas on >> surviving nodes >> >> still >> >> remain. When a SC is >> available again, >> surviving replicas are >> >> automatically >> >> registered to the SC >> checkpoint database. >> Content in >> surviving >> >> replicas are >> >> intacted and >> synchronized to new >> replicas. >> >> When no SC is >> available, client API >> calls changing checkpoint >> >> configuration >> >> which requires SC >> communication, are >> rejected. Client API >> calls >> >> reading and >> >> writing existing >> checkpoint replicas >> still work. >> >> Limitation: The >> CKPT service does not >> support recovering >> checkpoints >> >> in >> >> following cases: >> - The >> checkpoint which is >> unlinked before headless. >> - The >> non-collocated >> checkpoint has active >> replica locating >> on SC. >> - The >> non-collocated >> checkpoint has active >> replica locating >> on a PL >> >> and this >> >> PL restarts >> during headless state. >> In this cases, the >> checkpoint >> >> replica is >> >> destroyed. The fault >> code >> SA_AIS_ERR_BAD_HANDLE >> is returned >> when the >> >> client >> >> accesses the >> checkpoint in these >> cases. The client must >> re-open the >> checkpoint. >> >> While in >> headless state, >> accessing checkpoint >> replicas does >> not work >> >> if the >> >> node which hosts the >> active replica goes >> down. It will back >> working >> >> when a >> >> SC available again. >> >> Solution: >> --------- The >> solution for this >> enhancement includes 2 >> parts: >> >> 1. To destroy >> un-recoverable >> checkpoint described >> above when >> both >> >> SCs are >> >> down: When both SCs >> are down, the CPND >> deletes un-recoverable >> >> checkpoint >> >> nodes and replicas on >> PLs. Then it requests >> CPA to destroy >> >> corresponding >> >> checkpoint node by >> using new message >> CPA_EVT_ND2A_CKPT_DESTROY >> >> 2. To update CPD >> with checkpoint >> information When an >> active >> SC is up >> >> after >> >> headless, CPND will >> update CPD with >> checkpoint information by >> using >> >> new >> >> message >> CPD_EVT_ND2D_CKPT_INFO_UPDATE >> instead of using >> CPD_EVT_ND2D_CKPT_CREATE. >> This is because the >> CPND will >> create new >> >> ckpt_id >> >> for the >> checkpoint which might >> be different with the >> current >> ckpt id >> >> if the >> >> CPD_EVT_ND2D_CKPT_CREATE >> is used. The CPD >> collects checkpoint >> >> information >> >> within 6s. During this >> updating time, >> following requests is >> rejected >> >> with >> >> fault code >> SA_AIS_ERR_TRY_AGAIN: >> - >> CPD_EVT_ND2D_CKPT_CREATE >> - >> CPD_EVT_ND2D_CKPT_UNLINK >> - >> CPD_EVT_ND2D_ACTIVE_SET >> - >> CPD_EVT_ND2D_CKPT_RDSET >> >> >> Complete diffstat: >> ------------------ >> osaf/libs/agents/saf/cpa/cpa_proc.c >> | 52 >> >> +++++++++++++++++++++++++++++++++++ >> >> >> osaf/libs/common/cpsv/cpsv_edu.c >> | 43 >> >> +++++++++++++++++++++++++++++ >> >> osaf/libs/common/cpsv/include/cpd_cb.h >> | 3 ++ >> osaf/libs/common/cpsv/include/cpd_imm.h >> | 1 + >> osaf/libs/common/cpsv/include/cpd_proc.h >> | 7 ++++ >> osaf/libs/common/cpsv/include/cpd_tmr.h >> | 3 +- >> osaf/libs/common/cpsv/include/cpnd_cb.h >> | 1 + >> osaf/libs/common/cpsv/include/cpnd_init.h >> | 2 + >> osaf/libs/common/cpsv/include/cpsv_evt.h >> | 20 +++++++++++++ >> osaf/services/saf/cpsv/cpd/Makefile.am >> | 3 +- >> osaf/services/saf/cpsv/cpd/cpd_evt.c >> | 229 >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> ++++ >> >> osaf/services/saf/cpsv/cpd/cpd_imm.c >> | 112 >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> osaf/services/saf/cpsv/cpd/cpd_init.c >> | 20 ++++++++++++- >> osaf/services/saf/cpsv/cpd/cpd_proc.c >> | 309 >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> osaf/services/saf/cpsv/cpd/cpd_tmr.c >> | 7 ++++ >> osaf/services/saf/cpsv/cpnd/cpnd_db.c >> | 16 ++++++++++ >> osaf/services/saf/cpsv/cpnd/cpnd_evt.c >> | 22 +++++++++++++++ >> osaf/services/saf/cpsv/cpnd/cpnd_init.c >> | 23 ++++++++++++++- >> osaf/services/saf/cpsv/cpnd/cpnd_mds.c >> | 13 ++++++++ >> osaf/services/saf/cpsv/cpnd/cpnd_proc.c >> | 314 >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- >> >> >> 20 files >> changed, 1189 >> insertions(+), 11 >> deletions(-) >> >> >> Testing Commands: >> ----------------- >> - >> >> Testing, Expected >> Results: >> -------------------------- >> >> - >> >> >> Conditions of Submission: >> ------------------------- >> <<HOW MANY DAYS >> BEFORE PUSHING, >> CONSENSUS ETC>> >> >> >> Arch Built >> Started Linux distro >> ------------------------------------------- >> >> mips n n >> mips64 n n >> x86 n n >> x86_64 n n >> powerpc n n >> powerpc64 n n >> >> >> Reviewer Checklist: >> ------------------- >> [Submitters: make sure >> that your review >> doesn't trigger any >> checkmarks!] >> >> >> Your checkin has not >> passed review because >> (see checked entries): >> >> ___ Your RR template >> is generally >> incomplete; it has too >> many >> blank >> >> entries >> >> that need proper data >> filled in. >> >> ___ You have failed to >> nominate the proper >> persons for review and >> push. >> >> ___ Your patches do >> not have proper >> short+long header >> >> ___ You have >> grammar/spelling in >> your header that is >> unacceptable. >> >> ___ You have exceeded >> a sensible line length >> in your >> >> headers/comments/text. >> >> ___ You have failed to >> put in a proper Trac >> Ticket # into your >> commits. >> >> ___ You have >> incorrectly put/left >> internal data in your >> comments/files >> (i.e. >> internal bug tracking >> tool IDs, product >> names etc) >> >> ___ You have not given >> any evidence of >> testing beyond basic >> build >> tests. >> Demonstrate >> some level of runtime >> or other sanity testing. >> >> ___ You have ^M >> present in some of >> your files. These have >> to be >> removed. >> >> ___ You have >> needlessly changed >> whitespace or added >> whitespace crimes >> like trailing >> spaces, or spaces >> before tabs. >> >> ___ You have mixed >> real technical changes >> with whitespace and other >> cosmetic code >> cleanup changes. These >> have to be separate >> commits. >> >> ___ You need to >> refactor your >> submission into >> logical chunks; there is >> too much >> content into a single >> commit. >> >> ___ You have >> extraneous garbage in >> your review (merge >> commits etc) >> >> ___ You have giant >> attachments which >> should never have been >> sent; >> Instead you >> should place your >> content in a public >> tree to >> be pulled. >> >> ___ You have too many >> commits attached to an >> e-mail; resend as >> threaded >> commits, or >> place in a public tree >> for a pull. >> >> ___ You have resent >> this content multiple >> times without a clear >> indication >> of what has >> changed between each >> re-send. >> >> ___ You have failed to >> adequately and >> individually address >> all of the >> comments and >> change requests that >> were proposed in the >> initial >> >> review. >> >> ___ You have a >> misconfigured ~/.hgrc >> file (i.e. username, >> email >> etc) >> >> ___ Your computer have >> a badly configured >> date and time; >> confusing the >> the threaded >> patch review. >> >> ___ Your changes >> affect IPC mechanism, >> and you don't present any >> results >> for >> in-service >> upgradability test. >> >> ___ Your changes >> affect user manual and >> documentation, your patch >> series >> do not >> contain the patch that >> updates the Doxygen >> manual. >> > ---------------------------------------------------------------------------- -- > Site24x7 APM Insight: Get Deep Visibility into Application Performance > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > Monitor end-to-end web transactions and take corrective actions now > Troubleshoot faster and improve end-user experience. Signup Now! > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 > _______________________________________________ > Opensaf-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
