Hi Nhat Pham, Please find my answers.
-AVM On 2/26/2016 10:23 AM, Nhat Pham wrote: > > Hi Mahesh, > > Please see my answers below with [NhatPham5] > > Best regards, > > Nhat Pham > > *From:*A V Mahesh [mailto:[email protected]] > *Sent:* Friday, February 26, 2016 11:17 AM > *To:* Nhat Pham <[email protected]>; 'Anders Widell' > <[email protected]> > *Cc:* [email protected]; 'Beatriz Brandao' > <[email protected]>; 'Minh Chau H' <[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless state V2 > [#1621] > > Hi Nhat Pham, > > >>[NhatPham4] To be more correct, the application will get > SA_AIS_ERR_BAD_HANDLE when trying to access the lost checkpoint > because all data was destroyed. > >>[AndersW4] If this is a problem we could re-create the checkpoint > with no sections in it. > > Even I come across this approach , instead of destroying the > checkpoint information (current patch doing) from CPND of payloads > and returning the SA_AIS_ERR_BAD_HANDLE applcation on PL`s > in the NEW patch V3 check the possibility of re-cremating the > checkpoint with sections ( you can send this data from PL to SC up on > CPD up) . > > [NhatPham5] The “all data” here I means the checkpoint node > information in database controlled by cpnd (not the replica). In this > case, all replicas were lost. How can the checkpoint be re-created > with sections? > [AVM] I know replicas are lost , I am suggesting to use the `checkpoint node information` available at PL CPND ( only one CPD will volunteer for this if multiple application opened on ) assume and try to recreate the checkpoint as fresh request came from CPA ( CPD assumes request came all the way form CPA-->CPND--->CPD but it is not ) so that CPD will create new replicas with sections with clean/no-data instead of asking application to recreate it . I think the LOG stems are getting recreated with empty/fresh/no data like this with data avlible on LGA . > > For other cases where the checkpoint replicas survive, the checkpoint > is restored when the SC is up again. > > Ex: A checkpoint is created on a PL. There are 3 replicas created on > SCs and PL. The headless state happens. After the SC is up, the > checkpoint is recovered. > > > > -AVM > > On 2/26/2016 8:11 AM, Nhat Pham wrote: > > Hi, > > Please see my comment below with [NhatPham4] > > Best regards, > > Nhat Pham > > *From:* Anders Widell [mailto:[email protected]] > *Sent:* Thursday, February 25, 2016 9:25 PM > *To:* A V Mahesh <[email protected]> > <mailto:[email protected]>; Nhat Pham > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless > state V2 [#1621] > > Hi! > > See my comments inline, marked [AndersW4]. > > regards, > Anders Widell > > On 02/25/2016 05:26 AM, A V Mahesh wrote: > > Hi Nhat Pham, > > Please see my comment below. > > -AVM > > On 2/25/2016 7:54 AM, Nhat Pham wrote: > > Hi Mahesh, > > Would you agree with the comment below? > > To summarize, following are the comment so far: > > *Comment 1*: This functionality should be under checks if > Hydra configuration is enabled in IMM attrName = > > const_cast<SaImmAttrNameT>("scAbsenceAllowed"). > > Action: The code will be updated accordingly. > > [AndersW4] Just a question here: is this really needed? If the > code is already 100% backwards compatible when the headless > feature is disabled, what would be the point of reading the > configuration and taking different paths in the code depending on > it? Maybe the code is not 100% backwards compatible and then I > agree that we need to read the configuration. > > The reason why I am asking is that I had the impression that the > code would only cause different behaviour in the cases where both > system controllers die at the same time, and this cannot happen > when the headless feature is disabled (or rather: it can happen, > but it would trigger an immediate cluster restart so any > difference in behaviour after that point is irrelevant). > > [NhatPham4] The code is backwards compatible when the headless > feature is disable. > > For V2 patch, cpnd will update cpd with recoverable checkpoint > data when SC is up after headless state.(From implementation point > of view) > > In current system if the headless feature is disable, whole > cluster reboots. Thus all data is destroyed. > > > For V2 patch + checking scAbsenceAllowed, cpnd destroys all the > checkpoint data (as original implementation). (From implementation > point of view) > > In current system if the headless feature is disable, whole > cluster reboots. Thus all data is destroyed. > > So if you ask if the checking is really needed in current > situation. The answer is not really. > > This checking is just to make sure that all checkpoint data is > destroyed in case headless feature is disable. > > How do you think? > > *Comment 2*: To keep the scope of CPSV service as > non-collocated checkpoint creation NOT_SUPPORTED , if > cluster is running with IMMSV_SC_ABSENCE_ALLOWED ( > headless state configuration enabled at the time of > cluster startup currently it is not configurable , so > there no chance of run-time configuration change ). > > Action: No change in code. The CPSV still keep supporting > non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED > is enable. > > >>[AndersW3] No, I think we ought to support non-colocated > checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The > fact that we have "system controllers" is an implementation > detail of OpenSAF. I don't think the CKPT SAF specification > implies that > >>non-colocated checkpoints must be fully replicated on all > the nodes in the cluster, and thus we must have the > possibility that all replicas are lost. It is not clear > exactly what to expect from the APIs when this happens, but > you could handle it in a similar way as the case >> when all > sections have been automatically deleted by the checkpoint > service because the sections have expired. > > [AVM] I am not in agreement with both comments , we can > not handle it in a similar to sections expiration case hear , > in case of sections expiration checkpoint replica still exist > only section deleted > > [AndersW4] If this is a problem we could re-create the checkpoint > with no sections in it. > > > CPSV specification says if two replicas exist ( > in our case Only on SC`s) at a certain point in time, and the > nodes hosting both of these replicas is > administratively taken out of service, the > Checkpoint Service should allocate another replica on another > node while this node is not available > please check section `3.1.7.2 Non-Collocated > Checkpoints` of cpsv specification . > > [AndersW4] The spec actually says "may" rather than "should" in > this section. And the purpose of allocating another replica is to > "enhance the availability of checkpoints". When I read this > section, I think it is quite clear that the spec does not perceive > non-colocated checkpoints as guaranteed to preserve data in the > case of node failures: > > "The Checkpoint Service may create replicas > other than the ones that may be created when opening a checkpoint. > These other > replicas can be useful to enhance the availability of checkpoints. > For example, if two > replicas exist at a certain point in time, and the node hosting > one of these replicas is > administratively taken out of service, the Checkpoint Service may > allocate another > replica on another node while this node is not available." > > So, data can be lost due to (multiple) node failures. There are > two other cases where data is lost: automatic deletion of the > entire checkpoint if it has not been opened by any process for the > duration of the retention time, and automatic deletion of sections > within a checkpoint when the sections reach their expiration > times. The APIs specify the return code SA_AIS_ERR_NOT_EXIST to > signal that a specific section, or the entire checkpoint, doesn't > exist. Thus, there support in the API for reporting loss of > checkpoint data (whatever the reason of the loss may be). If the > headless feature is disabled, we cannot lose non-colocated > checkpoints due to node failures, but when the headless feature is > enabled we can. > > > For example, take a case of application on PL > is in progress of writing to non-collocated checkpoint > sections ( physical replica exist only on SC`s ) > what will happen to application on PL ? , ok > let us consider user agreed to loose the checkpoint and he > what to recreated it , what will happen to cpnd DB on PL and > the complexity involved in it (clean up) , > and this will lead to lot of maintainability issues. > > [AndersW4] The thing that will happen (from an application's > perspective) is that you will get the SA_AIS_ERR_NOT_EXIST error > code from the CKPT API when trying to access the lost checkpoint. > I don't know the complexity at the code level for implementing > this, but isn't this already supported by the code which is out on > review (Nhat, correct me if I am wrong)? > > [NhatPham4] To be more correct, the application will get > SA_AIS_ERR_BAD_HANDLE when trying to access the lost checkpoint > because all data was destroyed. > > But for opening the checkpoint (not creating), it will get > SA_AIS_ERR_NOT_EXIST. > > > On top of that CKPT SAF specification only says > that non-collocated checkpoint and all its sections should > survive if the Checkpoint Service running on cluster and > replica is USER private data ( not Opensaf States > ) , loosing any USER private data not acceptable . > > > *Comment 3*: This is about case where checkpoint node > director (cpnd) crashes during headless state. In this > case the cpnd can’t finish starting because it can’t > initialize CLM service. > > Then after time out, the AMF triggers a restart again. > Finally, the node is rebooted. > > It is expected that this problem should not lead to a node > reboot. > > Action: No change in code. This is the limitation of the > system during headless state. > > > [AVM] code changes required in CPSV CLM integration code need > to be revisited to handle TRYAGAIN. > > If you agree with the summary above, I’ll update code and > send out the V3 for review. > > Best regards, > > Nhat Pham > > *From:* Anders Widell [mailto:[email protected]] > *Sent:* Wednesday, February 24, 2016 9:26 PM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'A V Mahesh' > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz > Brandao' <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: > Support preserving and recovering checkpoint replicas > during headless state V2 [#1621] > > See my comments inline, marked [AndersW3]. > > regards, > Anders Widell > > On 02/24/2016 07:32 AM, Nhat Pham wrote: > > Hi Mahesh and Anders, > > Please see my comments below. > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Wednesday, February 24, 2016 11:06 AM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz > Brandao' <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: > Support preserving and recovering checkpoint replicas > during headless state V2 [#1621] > > Hi Nhat Pham, > > If component ( CPND ) restart allows while Controllers > absent , before requesting CLM going to change return > value to**SA_AIS_ERR_TRY_AGAIN , > We need to get clarification from AMF guys on few > things why because if CPND is on > SA_AIS_ERR_TRY_AGAIN and component restart timeout > then AMF will restart component again ( this become > cyclic ) and after saAmfSGCompRestartMax configured > value Node gose for reboot as next level escalation, > in that case we may required changes in AMF as well, > to not to act on component restart timeout in case of > Controllers absent ( i am not sure it is deviation of > AMF specification ) . > > */[Nhat Pham] In headless state, I’m not sure about > this either. /* > > */@Anders: Would you have comments about this?/* > > [AndersW3] Ok, first of all I would like to point out that > normally, the OpenSAF checkpoint node director should not > crash. So we are talking about a situation where multiple > faults have occurred: first both the active and the > standby system controllers have died, and then shortly > afterwards - before we have a new active system controller > - the checkpoint node director also crashes. Sure, these > may not be totally independent events, but still there are > a lot of faults that have happened within a short period > of time. We should test the node director and make sure it > doesn't crash in this type of scenario. > > Now, let's consider the case where we have a fault in the > node director that causes it to crash during the headless > state. The general philosophy of the headless feature is > that when things work fine - i.e. in the absence of fault > - we should be able to continue running while the system > controllers are absent. However, if a fault happens during > the headless state, we may not be able to recover from the > fault until there is an active system controller. AMF does > provide support for restarting components, but as you have > pointed out, the node director will be stuck in a > TRY_AGAIN loop immediately after it has been restarted. So > this means that if the node director crashes during the > headless state, we have lost the checkpoint functionality > on that node and we will not get it back until there is an > active system controller. Other services like IMM will > still work for a while, but AMF will as you say eventually > escalate the checkpoint node director failure to a node > restart and then the whole node is gone. The node will not > come back until we have an active system controller. So to > summarize: there is very limited support for recovering > from faults that happen during the headless state. The > full recovery will not happen until we have an active > system controller. > > Please do incorporate current comments ( in design > prospective ) and republish the patch , I will > re-test V3 patch and provide review comments on > function issue/bugs if I found any. > > One Important note , in the new patch let us not > have any complexity of allowing non-collocated > checkpoint creation and then documenting that in some > scenario , > non-collocated checkpoint replicas are recoverable , > why because replica is USER private data ( not > Opensaf States ) , loosing USER private data not > acceptable . > so let us keep the scope of CPSV service as > non-collocated checkpoint creation NOT_SUPPORTED , if > cluster is running with > IMMSV_SC_ABSENCE_ALLOWED ( headless state > configuration enabled at the time of cluster startup > currently it is not configurable , so their no chance > of run-time configuration change ). > > We can provide support for non-collocated in > subsequent enhancements by having solution like > replica on lower node ID PL will also created > non-collocated ( max three riplicas in cluster > regradless of where non-collocated is opened ). > > So for now, regardless of the heads (SC`s) status > exist not exist CPSV should return > SA_AIS_ERR_NOT_SUPPORTED in case of > IMMSV_SC_ABSENCE_ALLOWED enabled cluster , > and let us document it as well. > > */[Nhat Pham] The patch is to limit loosing replicas > and checkpoints in case of headless state./* > > */In case both replicas locate on SCs and they reboot, > loosing checkpoint is unpreventable with current > design after headless state./* > > */Even if we implement the proposal “/*max three > riplicas in cluster regradless of where non-collocated > is opened*/”, there is still the case where the > checkpoint is lost. Ex. The SCs and the PL which hosts > the replica reboot same time./* > > */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if both > SCs reboot, this leads whole cluster reboots. Then the > checkpoint is lost. > > */What I mean is there are cases where the checkpoint > is lost. The point is what we can do to limit loosing > data./* > > */For the proposal of reject creating non-collocated > checkpoint in case of/* IMMSV_SC_ABSENCE_ALLOWED > enabled, I think this will lead to in compatible problem. > > */@Anders: How do you think about rejecting creating > non-collocated checkpoint in case of > /*IMMSV_SC_ABSENCE_ALLOWED enabled? > > [AndersW3] No, I think we ought to support non-colocated > checkpoints also when IMMSV_SC_ABSENCE_ALLOWED is set. The > fact that we have "system controllers" is an > implementation detail of OpenSAF. I don't think the CKPT > SAF specification implies that non-colocated checkpoints > must be fully replicated on all the nodes in the cluster, > and thus we must have the possibility that all replicas > are lost. It is not clear exactly what to expect from the > APIs when this happens, but you could handle it in a > similar way as the case when all sections have been > automatically deleted by the checkpoint service because > the sections have expired. > > > -AVM > > On 2/24/2016 6:51 AM, Nhat Pham wrote: > > Hi Mahesh, > > Do you have any further comments? > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Monday, February 22, 2016 10:37 AM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' <[email protected]> > <mailto:[email protected]>; 'Minh Chau > H' <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for > cpsv: Support preserving and recovering checkpoint > replicas during headless state V2 [#1621] > > Hi, > > >>BTW, have you finished the review and test? > > I will finish by today. > > -AVM > > On 2/22/2016 7:48 AM, Nhat Pham wrote: > > Hi Mahesh and Anders, > > Please see my comment below. > > BTW, have you finished the review and test? > > Best regards, > > Nhat Pham > > *From:* A V Mahesh > [mailto:[email protected]] > *Sent:* Friday, February 19, 2016 2:28 PM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders > Widell' <[email protected]> > <mailto:[email protected]>; 'Minh > Chau H' <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request > for cpsv: Support preserving and recovering > checkpoint replicas during headless state V2 > [#1621] > > Hi Nhat Pham, > > On 2/19/2016 12:28 PM, Nhat Pham wrote: > > Could you please give more detailed > information about steps to reproduce the > problem below? Thanks. > > > Don't see this as specific bug , we need to > see the issue as CLM integrated service > point of view , > by considering Anders Widell explication about > CLM application behavior during headless state > we need to reintegrate CPND with CLM ( before > this headless state feature no case of CPND > existence in the obscene of CLMD , but now it > is ). > > And this will be the consistent across the all > services who integrated with CLM ( you may > need some changes in CLM also ) > > */[Nhat Pham] I think CLM should return > /*SA_AIS_ERR_TRY_AGAIN in this case. > > @Anders. How would you think? > > To start with let us consider case CPND on > payload restarted on PL during headless state > and an application is in running on PL. > > */[Nhat Pham] Regarding the CPND as CLM > application, I’m not sure what it can do in > this case. In case it restarts, it is > monitored by AMF./* > > */If it blocks for too long, AMF will also > trigger a node reboot./* > > */In my test case, the CPND get blocked by > CLM. It doesn’t get out of the > saClmInitialize. How do you get the “/ER cpnd > clm init failed with return value:31/”?/* > > */Following is the cpnd trace./* > > Feb 22 8:56:41.188122 osafckptnd > [736:cpnd_init.c:0183] >> cpnd_lib_init > > Feb 22 8:56:41.188332 osafckptnd > [736:cpnd_init.c:0412] >> cpnd_cb_db_init > > Feb 22 8:56:41.188600 osafckptnd > [736:cpnd_init.c:0437] << cpnd_cb_db_init > > Feb 22 8:56:41.188778 osafckptnd > [736:clma_api.c:0503] >> saClmInitialize > > Feb 22 8:56:41.188945 osafckptnd > [736:clma_api.c:0593] >> clmainitialize > > Feb 22 8:56:41.190052 osafckptnd > [736:clma_util.c:0100] >> clma_startup: > clma_use_count: 0 > > Feb 22 8:56:41.190273 osafckptnd > [736:clma_mds.c:1124] >> clma_mds_init > > Feb 22 8:56:41.190825 osafckptnd > [736:clma_mds.c:1170] << clma_mds_init > > -AVM > > On 2/19/2016 12:28 PM, Nhat Pham wrote: > > Hi Mahesh, > > Could you please give more detailed > information about steps to reproduce the > problem below? Thanks. > > Best regards, > > Nhat Pham > > *From:* A V Mahesh > [mailto:[email protected]] > *Sent:* Friday, February 19, 2016 1:06 PM > *To:* Anders Widell > <[email protected]> > <mailto:[email protected]>; Nhat > Pham <[email protected]> > <mailto:[email protected]>; 'Minh > Chau H' <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review > Request for cpsv: Support preserving and > recovering checkpoint replicas during > headless state V2 [#1621] > > Hi Anders Widell, > Thanks for the detailed explanation about > CLM during headless state. > > HI Nhat Pham , > > Comment : 3 > Please see below the problem I was > interpreted now I seeing it during CLMD > obscene ( during headless state ), > so now CPND/CLMA need to to address below > case , currently cpnd clm init failed with > return value: SA_AIS_ERR_UNAVAILABLE > but should be SA_AIS_ERR_TRY_AGAIN > > ================================================== > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > NODE STATE-> IMM_NODE_FULLY_AVAILABLE 17418 > Feb 19 11:18:28 PL-4 osafimmloadd: NO Sync > ending normally > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Epoch set to 9 in ImmModel > Feb 19 11:18:28 PL-4 cpsv_app: IN Received > PROC_STALE_CLIENTS > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 42 > (MsgQueueService132111) <108, 2040f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 43 > (MsgQueueService131855) <0, 2030f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 44 (safLogService) > <0, 2010f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > SERVER STATE: IMM_SERVER_SYNC_SERVER --> > IMM_SERVER_READY > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 45 (safClmService) > <0, 2010f> > *Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER > cpnd clm init failed with return value:31 > Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER > cpnd init failed > Feb 19 11:18:28 PL-4 osafckptnd[7718]: ER > cpnd_lib_req FAILED > Feb 19 11:18:28 PL-4 osafckptnd[7718]: > __init_cpnd() failed* > Feb 19 11:18:28 PL-4 osafclmna[5432]: NO > safNode=PL-4,safCluster=myClmCluster > Joined cluster, nodeid=2040f > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO > AVD NEW_ACTIVE, adest:1 > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO > Sending node up due to NCSMDS_NEW_ACTIVE > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 > SISU states sent > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 1 > SU states sent > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 > CSICOMP states synced > Feb 19 11:18:28 PL-4 osafamfnd[5441]: NO 7 > SU states sent > Feb 19 11:18:28 PL-4 osafimmnd[5422]: NO > Implementer connected: 46 (safAmfService) > <0, 2010f> > Feb 19 11:18:30 PL-4 osafamfnd[5441]: NO > 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Component or SU restart probation timer > expired > Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO > Instantiation of > > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > failed > Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO > Reason: component registration timer expired > Feb 19 11:18:35 PL-4 osafamfnd[5441]: WA > > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Presence State RESTARTING => > INSTANTIATION_FAILED > Feb 19 11:18:35 PL-4 osafamfnd[5441]: NO > Component Failover trigerred for > 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': > Failed component: > > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Feb 19 11:18:35 PL-4 osafamfnd[5441]: ER > > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got > Inst failed > Feb 19 11:18:35 PL-4 osafamfnd[5441]: > Rebooting OpenSAF NodeId = 132111 EE Name > = , Reason: NCS component Instantiation > failed, OwnNodeId = 132111, > SupervisionTime = 60 > Feb 19 11:18:36 PL-4 opensaf_reboot: > Rebooting local node; timeout=60 > Feb 19 11:18:39 PL-4 kernel: [ > 4877.338518] md: stopping all md devices. > ================================================== > > -AVM > > On 2/15/2016 5:11 PM, Anders Widell wrote: > > Hi! > > Please find my answer inline, marked > [AndersW]. > > regards, > Anders Widell > > On 02/15/2016 10:38 AM, Nhat Pham wrote: > > Hi Mahesh, > > It's good. Thank you. :) > > [AVM] Up on rejoining of the SC`s > The replica should be re-created > regardless > of another application opens it on > PL4. > ( Note : this > comment is based on your > explanation have not yet > reviewed/tested , > currently i am > struggling with SC`s not > rejoining > after headless state , i can > provide you more on this once i > complte my > review/testing) > > [Nhat] To make cloud resilience > works, you need the patches from > other > services (log, amf, clm, ntf). > @Minh: I heard that you created > tar file which includes all > patches. Could you > please send it to Mahesh? Thanks > > [AVM] I understand that , before I > comment more on this please > allow me to > understand > I am not still not > very clear of the headless design > in detail. > For example cluster > membership of PL`s during headless > state , > In the absence of > SC`s (CLMD) dose the PLs is > considered as > cluster nodes or not (cluster > membership) ? > > [Nhat] I don't know much about this. > @ Anders: Could you please have > comment about this? Thanks > > [AndersW] First of all, keep in mind > that the "headless" state should > ideally not last a very long time. > Once we have the spare SC feature in > place (ticket [#79]), a new SC should > become active within a matter of a few > seconds after we have lost both the > active and the standby SC. > > I think you should view the state of > the cluster in the headless state in > the same way as you view the state of > the cluster during a failover between > the active and the standby SC. Imagine > that the active SC dies. It takes the > standby SC 1.5 seconds to detect the > failure of the active SC (this is due > to the TIPC timeout). If you have > configured the PROMOTE_ACTIVE_TIMER, > there is an additional delay before > the standby takes over as active. What > is the state of the cluster during the > time after the active SC failed and > before the standby takes over? > > The state of the cluster while it is > headless is very similar. The > difference is that this state may last > a little bit longer (though not more > than a few seconds, until one of the > spare SCs becomes active). Another > difference is that we may have lost > some state. With a "perfect" > implementation of the headless feature > we should not lose any state at all, > but with the current set of patches we > do lose state. > > So specifically if we talk about > cluster membership and ask the > question: is a particular PL a member > of the cluster or not during the > headless state? Well, if you ask CLM > about this during the headless state, > then you will not know - because CLM > doesn't provide any service during the > headless state. If you keep retrying > you query to CLM, you will eventually > get an answer - but you will not get > this answer until there is an active > SC again and we have exited the > headless state. When viewed in this > way, the answer to the question about > a node's membership is undefined > during the headless state, since CLM > will not provide you with any answer > until there is an active SC. > > However, if you asked CLM about the > node's cluster membership status > before the cluster went headless, you > probably saved a cached copy of the > cluster membership state. Maybe you > also installed a CLM track callback > and intend to update this cached copy > every time the cluster membership > status changes. The question then is: > can you continue using this cached > copy of the cluster membership state > during the headless state? The answer > is YES: since CLM doesn't provide any > service during the headless state, it > also means that the cluster membership > view cannot change during this time. > Nodes can of course reboot or die, but > CLM will not notice and hence the > cluster view will not be updated. You > can argue that this is bad because the > cluster view doesn't reflect reality, > but notice that this will always be > the case. We can never propagate > information instantaneously, and > detection of node failures will take > 1.5 seconds due to the TIPC timeout. > You can never be sure that a node is > alive at this very moment just because > CLM tells you that it is a member of > the cluster. If we are unfortunate > enough to lose both system controller > nodes simultaneously, updates to the > cluster membership view will be > delayed a few seconds longer than usual. > > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh > [mailto:[email protected]] > Sent: Monday, February 15, 2016 > 11:19 AM > To: Nhat Pham > <[email protected]> > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > Cc: > [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]> > Subject: Re: [PATCH 0 of 1] Review > Request for cpsv: Support > preserving and > recovering checkpoint replicas > during headless state V2 [#1621] > > Hi Nhat Pham, > > How is your holiday went > > Please find my comments below > > On 2/15/2016 8:43 AM, Nhat Pham > wrote: > > Hi Mahesh, > > For the comment 1, the patch > will be updated accordingly. > > [AVM] Please hold , I will > provide more comments in this week > , so we can > have consolidated V3 > > For the comment 2, I think the > CKPT service will not be backward > compatible if the > scAbsenceAllowed is true. > The client can't create > non-collocated checkpoint on SCs. > > Furthermore, this solution > only protects the CKPT service > from the > case "The non-collocated > checkpoint is created on a SC" > there are still the cases > where the replicas are > completely lost. Ex: > > - The non-collocated > checkpoint created on a PL. > The PL reboots. Both > replicas now locate on SCs. > Then, headless state happens. > All replicas are > lost. > - The non-collocated > checkpoint has active replica > locating on a PL > and this PL restarts during > headless state > - The non-collocated > checkpoint is created on PL3. > This checkpoint is > also opened on PL4. Then SCs > and PL3 reboot. > > [AVM] Up on rejoining of the SC`s > The replica should be re-created > regardless > of another application opens it on > PL4. > ( Note : this > comment is based on your > explanation have not yet > reviewed/tested , > currently i am > struggling with SC`s not > rejoining > after headless state , i can > provide you more on this once i > complte my > review/testing) > > In this case, all replicas are > lost and the client has to > create it again. > > In case multiple nodes (which > including SCs) reboot, losing > replicas > is unpreventable. The patch is > to recover the checkpoints in > possible cases. > How do you think? > > [AVM] I understand that , before I > comment more on this please allow > me to understand > I am not still not > very clear of the headless design > in detail. > > For example cluster > membership of PL`s during headless > state , > In the absence of > SC`s (CLMD) dose the PLs is > considered as > cluster nodes or not (cluster > membership) ? > > - if not > consider as NON cluster nodes > Checkpoint Service > API should leverage the SA Forum > Cluster > Membership Service and API's can > fail with > SA_AIS_ERR_UNAVAILABLE > > - if > considers as cluster nodes we > need to follow all the > defined rules which are defined in > SAI-AIS-CKPT-B.02.02 specification > > so give me some more > time to review it completely , so > that we > can have consolidated patch V3 > > -AVM > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh > [mailto:[email protected]] > Sent: Friday, February 12, > 2016 11:10 AM > To: Nhat Pham > <[email protected]> > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > > Cc: > [email protected] > > <mailto:[email protected]>; > Beatriz Brandao > <[email protected]> > <mailto:[email protected]> > > Subject: Re: [PATCH 0 of 1] > Review Request for cpsv: Support > preserving and recovering > checkpoint replicas during > headless state V2 > [#1621] > > > Comment 2 : > > After incorporating the > comment one all the > Limitations should be > prevented based on Hydra > configuration is enabled in > IMM status. > > Foe example : if some > application is trying to create > > non-collocated checkpoint > active replica getting > generated/locating on > SC then ,regardless of the > heads (SC`s) status exist not > exist should > return SA_AIS_ERR_NOT_SUPPORTED > > In other words, rather that > allowing to created > non-collocated > checkpoint when > heads(SC`s) are exit , and > non-collocated checkpoint getting > unrecoverable after > heads(SC`s) rejoins. > > > ====================================================================== > > ======================= > > Limitation: The CKPT > service doesn't support > recovering checkpoints in > following cases: > . The checkpoint which > is unlinked before headless. > . The non-collocated > checkpoint has active > replica locating on SC. > . The non-collocated > checkpoint has active > replica locating on a PL > and this PL > restarts during > headless state. In this > cases, the checkpoint > replica is > destroyed. The fault > code SA_AIS_ERR_BAD_HANDLE > is returned when the > client > accesses the > checkpoint in these cases. > The client must re-open the > checkpoint. > > > ====================================================================== > > ======================= > > -AVM > > > On 2/11/2016 12:52 PM, A V > Mahesh wrote: > > Hi, > > I jut starred reviewing > patch , I will be giving > comments as soon as > I crossover any , to save > some time. > > Comment 1 : > This functionality should > be under checks if Hydra > configuration is > enabled in IMM attrName = > > const_cast<SaImmAttrNameT>("scAbsenceAllowed") > > > Please see example how > LOG/AMF services > implemented it. > > -AVM > > > On 1/29/2016 1:02 PM, Nhat > Pham wrote: > > Hi Mahesh, > > As described in the > README, the CKPT > service returns > SA_AIS_ERR_TRY_AGAIN > fault code in this case. > I guess it's same for > other services. > > @Anders: Could you > please confirm this? > > Best regards, > Nhat Pham > > -----Original > Message----- > From: A V Mahesh > > [mailto:[email protected]] > > Sent: Friday, January > 29, 2016 2:11 PM > To: Nhat Pham > <[email protected]> > > <mailto:[email protected]>; > [email protected] > > <mailto:[email protected]> > > Cc: > > [email protected] > > <mailto:[email protected]> > > Subject: Re: [PATCH 0 > of 1] Review Request > for cpsv: Support > preserving and > recovering checkpoint > replicas during > headless state > V2 [#1621] > > Hi, > > On 1/29/2016 11:45 AM, > Nhat Pham wrote: > > - The behavior of > application will > be consistent with > other > saf services like > imm/amf behavior > during headless > state. > [Nhat] I'm not > clear what you > mean about > "consistent"? > > In the obscene of > Director (SC's) , what > is expected return values > of SAF API should ( > all services ) , > which are not in > aposition to provide > service at that moment. > > I think all services > should return same > SAF ERRS., I thinks > currently we don't > have it , may be > Anders Widel will help > us. > > -AVM > > > On 1/29/2016 11:45 AM, > Nhat Pham wrote: > > Hi Mahesh, > > Please see the > attachment for the > README. Let me > know if there is > any more > information required. > > Regarding your > comments: > - during > headless state > applications may > behave like during > CPND restart case > [Nhat] Headless > state and CPND > restart are > different events. > Thus, the behavior > is different. > Headless state is > a case where both > SCs go down. > > - The > behavior of > application will > be consistent with > other > saf services like > imm/amf behavior > during headless > state. > [Nhat] I'm not > clear what you > mean about > "consistent"? > > Best regards, > Nhat Pham > > -----Original > Message----- > From: A V Mahesh > > [mailto:[email protected]] > > Sent: Friday, > January 29, 2016 > 11:12 AM > To: Nhat Pham > <[email protected]> > > <mailto:[email protected]>; > > [email protected] > > <mailto:[email protected]> > > Cc: > > [email protected] > > <mailto:[email protected]> > > Subject: Re: > [PATCH 0 of 1] > Review Request for > cpsv: Support > preserving and > recovering > checkpoint > replicas during > headless state > V2 [#1621] > > Hi Nhat Pham, > > I stared reviewing > this patch , so > can please > provide README file > with scope and > limitations , that > will help to define > testing/reviewing > scope . > > Following are > minimum things we > can keep in mind > while > reviewing/accepting patch > , > > - Not effecting > existing > functionality > - during > headless state > applications may > behave like during > CPND restart case > - The > minimum > functionally of > application works > - The > behavior of > application will > be consistent with > other saf > services like > imm/amf behavior > during headless > state. > > So please do > provide any > additional > detailed in README > if any of > the above is > deviated , that > allow users to > know about the > limitations/deviation. > > > -AVM > > On 1/4/2016 3:15 > PM, Nhat Pham wrote: > > Summary: cpsv: > Support > preserving and > recovering > checkpoint > replicas > during > headless state > [#1621] Review > request for Trac > Ticket(s): > #1621 Peer > Reviewer(s): > > [email protected] > > <mailto:[email protected]>; > > > [email protected] > > <mailto:[email protected]> > Pull request to: > > [email protected] > > <mailto:[email protected]> > Affected > branch(es): > default > Development > branch: default > > > -------------------------------- > > Impacted area > Impact y/n > > -------------------------------- > > Docs n > Build > system > n > RPM/packaging n > > Configuration > files n > Startup > scripts n > SAF > services > y > OpenSAF > services n > Core > libraries > n > Samples n > Tests n > Other n > > > Comments > (indicate > scope for each > "y" above): > > --------------------------------------------- > > > changeset > > faec4a4445a4c23e8f630857b19aabb43b5af18d > > Author: > Nhat Pham > > <[email protected]> > > <mailto:[email protected]> > > Date: Mon, > 04 Jan 2016 > 16:34:33 +0700 > > cpsv: > Support > preserving and > recovering > checkpoint > replicas > during > headless state > [#1621] > > Background: > > ---------- > This > enhancement > supports to > preserve > checkpoint > replicas > > in case > > both SCs down > (headless > state) and > recover > replicas in case > one of > > SCs up > > again. If both > SCs goes down, > checkpoint > replicas on > surviving nodes > > still > > remain. When a > SC is > available > again, > surviving > replicas are > > automatically > > registered to > the SC > checkpoint > database. > Content in > surviving > > replicas are > > intacted and > synchronized > to new replicas. > > When no > SC is > available, > client API > calls changing > checkpoint > > configuration > > which requires > SC > communication, > are rejected. > Client API > calls > > reading and > > writing > existing > checkpoint > replicas still > work. > > > Limitation: > The CKPT > service does > not support > recovering > checkpoints > > in > > following cases: > - The > checkpoint > which is > unlinked > before headless. > - The > non-collocated > checkpoint has > active replica > locating > on SC. > - The > non-collocated > checkpoint has > active replica > locating > on a PL > > and this > > PL restarts > during > headless > state. In this > cases, the > checkpoint > > replica is > > destroyed. The > fault code > SA_AIS_ERR_BAD_HANDLE > is returned > when the > > client > > accesses the > checkpoint in > these cases. > The client must > re-open the > checkpoint. > > While in > headless > state, > accessing > checkpoint > replicas does > not work > > if the > > node which > hosts the > active replica > goes down. It > will back > working > > when a > > SC available > again. > > Solution: > > --------- The > solution for > this > enhancement > includes 2 parts: > > 1. To > destroy > un-recoverable > checkpoint > described > above when > both > > SCs are > > down: When > both SCs are > down, the CPND > deletes > un-recoverable > > checkpoint > > nodes and > replicas on > PLs. Then it > requests CPA > to destroy > > corresponding > > checkpoint > node by using > new message > > CPA_EVT_ND2A_CKPT_DESTROY > > > 2. To > update CPD > with > checkpoint > information > When an active > SC is up > > after > > headless, CPND > will update > CPD with > checkpoint > information by > using > > new > > message > > CPD_EVT_ND2D_CKPT_INFO_UPDATE > instead of using > > CPD_EVT_ND2D_CKPT_CREATE. > This is > because the > CPND will > create new > > ckpt_id > > for the > checkpoint > which might be > different with > the current > ckpt id > > if the > > > CPD_EVT_ND2D_CKPT_CREATE > is used. The > CPD collects > checkpoint > > information > > within 6s. > During this > updating time, > following > requests is > rejected > > with > > fault code > SA_AIS_ERR_TRY_AGAIN: > > - > > CPD_EVT_ND2D_CKPT_CREATE > > - > > CPD_EVT_ND2D_CKPT_UNLINK > > - > > CPD_EVT_ND2D_ACTIVE_SET > > - > > CPD_EVT_ND2D_CKPT_RDSET > > > > Complete > diffstat: > ------------------ > > > osaf/libs/agents/saf/cpa/cpa_proc.c > | 52 > > > +++++++++++++++++++++++++++++++++++ > > > > osaf/libs/common/cpsv/cpsv_edu.c > | 43 > > > +++++++++++++++++++++++++++++ > > > > osaf/libs/common/cpsv/include/cpd_cb.h > | 3 ++ > > osaf/libs/common/cpsv/include/cpd_imm.h > | 1 + > > osaf/libs/common/cpsv/include/cpd_proc.h > | 7 ++++ > > osaf/libs/common/cpsv/include/cpd_tmr.h > | 3 +- > > osaf/libs/common/cpsv/include/cpnd_cb.h > | 1 + > > osaf/libs/common/cpsv/include/cpnd_init.h > | 2 + > > osaf/libs/common/cpsv/include/cpsv_evt.h > | 20 > +++++++++++++ > > osaf/services/saf/cpsv/cpd/Makefile.am > | 3 +- > > osaf/services/saf/cpsv/cpd/cpd_evt.c > | 229 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > ++++ > > > osaf/services/saf/cpsv/cpd/cpd_imm.c > | 112 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > osaf/services/saf/cpsv/cpd/cpd_init.c > | 20 > ++++++++++++- > > osaf/services/saf/cpsv/cpd/cpd_proc.c > | 309 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > osaf/services/saf/cpsv/cpd/cpd_tmr.c > | 7 ++++ > > osaf/services/saf/cpsv/cpnd/cpnd_db.c > | 16 ++++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_evt.c > | 22 > +++++++++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_init.c > | 23 > ++++++++++++++- > > osaf/services/saf/cpsv/cpnd/cpnd_mds.c > | 13 ++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_proc.c > | 314 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > > > 20 files > changed, 1189 > insertions(+), > 11 deletions(-) > > > Testing Commands: > ----------------- > - > > Testing, > Expected Results: > > -------------------------- > > - > > > Conditions of > Submission: > > ------------------------- > > <<HOW > MANY DAYS > BEFORE > PUSHING, > CONSENSUS ETC>> > > > Arch > Built > Started > Linux distro > > ------------------------------------------- > > mips n n > mips64 n n > x86 n n > x86_64 n n > powerpc n n > powerpc64 n n > > > Reviewer > Checklist: > ------------------- > > [Submitters: > make sure that > your review > doesn't > trigger any > checkmarks!] > > > Your checkin > has not passed > review because > (see checked > entries): > > ___ Your RR > template is > generally > incomplete; it > has too many > blank > > entries > > that need > proper data > filled in. > > ___ You have > failed to > nominate the > proper persons > for review and > push. > > ___ Your > patches do not > have proper > short+long header > > ___ You have > grammar/spelling > in your header > that is > unacceptable. > > ___ You have > exceeded a > sensible line > length in your > > headers/comments/text. > > > ___ You have > failed to put > in a proper > Trac Ticket # > into your > commits. > > ___ You have > incorrectly > put/left > internal data > in your > comments/files > (i.e. > internal bug > tracking tool > IDs, product > names etc) > > ___ You have > not given any > evidence of > testing beyond > basic build > tests. > > Demonstrate > some level of > runtime or > other sanity > testing. > > ___ You have > ^M present in > some of your > files. These > have to be > removed. > > ___ You have > needlessly > changed > whitespace or > added > whitespace crimes > like > trailing > spaces, or > spaces before > tabs. > > ___ You have > mixed real > technical > changes with > whitespace and > other > > cosmetic code > cleanup > changes. These > have to be > separate > commits. > > ___ You need > to refactor > your > submission > into logical > chunks; there is > too > much content > into a single > commit. > > ___ You have > extraneous > garbage in > your review > (merge commits > etc) > > ___ You have > giant > attachments > which should > never have > been sent; > > Instead you > should place > your content > in a public > tree to > be pulled. > > ___ You have > too many > commits > attached to an > e-mail; resend as > threaded > > commits, or > place in a > public tree > for a pull. > > ___ You have > resent this > content > multiple times > without a clear > indication > of > what has > changed > between each > re-send. > > ___ You have > failed to > adequately and > individually > address all of > the > > comments and > change > requests that > were proposed > in the > initial > > review. > > ___ You have a > misconfigured > ~/.hgrc file > (i.e. > username, email > etc) > > ___ Your > computer have > a badly > configured > date and time; > confusing the > the > threaded patch > review. > > ___ Your > changes affect > IPC mechanism, > and you don't > present any > results > for > in-service > upgradability > test. > > ___ Your > changes affect > user manual > and > documentation, > your patch > series > do > not contain > the patch that > updates the > Doxygen manual. > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
