See my comments marked [AndersW6]. regards, Anders Widell
On 02/26/2016 10:31 AM, Nhat Pham wrote: > > Hi Mahesh, > > Please see my comment below with [NhatPham6] > > Best regards, > > Nhat Pham > > *From:*A V Mahesh [mailto:[email protected]] > *Sent:* Friday, February 26, 2016 12:31 PM > *To:* Nhat Pham <[email protected]>; 'Anders Widell' > <[email protected]> > *Cc:* [email protected]; 'Beatriz Brandao' > <[email protected]>; 'Minh Chau H' <[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless state V2 > [#1621] > > Hi Nhat Pham, > > Please find my answers. > > -AVM > > On 2/26/2016 10:23 AM, Nhat Pham wrote: > > Hi Mahesh, > > Please see my answers below with [NhatPham5] > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Friday, February 26, 2016 11:17 AM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless > state V2 [#1621] > > Hi Nhat Pham, > > >>[NhatPham4] To be more correct, the application will get > SA_AIS_ERR_BAD_HANDLE when trying to access the lost checkpoint > because all data was destroyed. > >>[AndersW4] If this is a problem we could re-create the > checkpoint with no sections in it. > > Even I come across this approach , instead of destroying the > checkpoint information (current patch doing) from CPND of > payloads and returning the SA_AIS_ERR_BAD_HANDLE applcation on PL`s > in the NEW patch V3 check the possibility of re-cremating > the checkpoint with sections ( you can send this data from PL to > SC up on CPD up) . > > [NhatPham5] The “all data” here I means the checkpoint node > information in database controlled by cpnd (not the replica). In > this case, all replicas were lost. How can the checkpoint be > re-created with sections? > > [AVM] I know replicas are lost , I am suggesting to use the > `checkpoint node information` available at PL CPND ( only one CPD > will volunteer for this if multiple application opened on ) > assume and try to recreate the checkpoint as fresh > request came from CPA ( CPD assumes request came all the way form > CPA-->CPND--->CPD but it is not ) > so that CPD will create new replicas with sections with > clean/no-data instead of asking application to recreate it . > > [Nhat Pham]I think it would be simpler and safer for application to > re-create the checkpoint in this case. > > I checked the implementation. The cpnd which doesn’t host the replica > doesn’t maintain section database. Thus, it can’t restore checkpoint > with section. > [AndersW6] My suggestion was to re-create the checkpoint without any sections. If the sections were re-created, the application wouldn't know that data has been lost. I think the BAD_HANDLE approach is okay since we have used it in other services, but I see it as kind of a hack solution that is not really in line with the specs. The specs never intended BAD_HANDLE to be something that can happen spontaneously on a previously valid handle, lest you are suffering from memory corruption. In the future we could consider the feasibility of avoiding spontaneous BAD_HANDLE where possible, and in CKPT I think it may be possible by re-creating the checkpoints. > > > I think the LOG stems are getting recreated with > empty/fresh/no data like this with data avlible on LGA . > > For other cases where the checkpoint replicas survive, the > checkpoint is restored when the SC is up again. > > Ex: A checkpoint is created on a PL. There are 3 replicas created > on SCs and PL. The headless state happens. After the SC is up, the > checkpoint is recovered. > > > > -AVM > > On 2/26/2016 8:11 AM, Nhat Pham wrote: > > Hi, > > Please see my comment below with [NhatPham4] > > Best regards, > > Nhat Pham > > *From:* Anders Widell [mailto:[email protected]] > *Sent:* Thursday, February 25, 2016 9:25 PM > *To:* A V Mahesh <[email protected]> > <mailto:[email protected]>; Nhat Pham > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz > Brandao' <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless > state V2 [#1621] > > Hi! > > See my comments inline, marked [AndersW4]. > > regards, > Anders Widell > > On 02/25/2016 05:26 AM, A V Mahesh wrote: > > Hi Nhat Pham, > > Please see my comment below. > > -AVM > > On 2/25/2016 7:54 AM, Nhat Pham wrote: > > Hi Mahesh, > > Would you agree with the comment below? > > To summarize, following are the comment so far: > > *Comment 1*: This functionality should be under checks > if Hydra configuration is enabled in IMM attrName = > > const_cast<SaImmAttrNameT>("scAbsenceAllowed"). > > Action: The code will be updated accordingly. > > [AndersW4] Just a question here: is this really needed? If the > code is already 100% backwards compatible when the headless > feature is disabled, what would be the point of reading the > configuration and taking different paths in the code depending > on it? Maybe the code is not 100% backwards compatible and > then I agree that we need to read the configuration. > > The reason why I am asking is that I had the impression that > the code would only cause different behaviour in the cases > where both system controllers die at the same time, and this > cannot happen when the headless feature is disabled (or > rather: it can happen, but it would trigger an immediate > cluster restart so any difference in behaviour after that > point is irrelevant). > > [NhatPham4] The code is backwards compatible when the headless > feature is disable. > > For V2 patch, cpnd will update cpd with recoverable checkpoint > data when SC is up after headless state.(From implementation > point of view) > > In current system if the headless feature is disable, whole > cluster reboots. Thus all data is destroyed. > > > For V2 patch + checking scAbsenceAllowed, cpnd destroys all > the checkpoint data (as original implementation). (From > implementation point of view) > > In current system if the headless feature is disable, whole > cluster reboots. Thus all data is destroyed. > > So if you ask if the checking is really needed in current > situation. The answer is not really. > > This checking is just to make sure that all checkpoint data is > destroyed in case headless feature is disable. > > How do you think? > > *Comment 2*: To keep the scope of CPSV service as > non-collocated checkpoint creation NOT_SUPPORTED , if > cluster is running with IMMSV_SC_ABSENCE_ALLOWED ( > headless state configuration enabled at the time of > cluster startup currently it is not configurable , so > there no chance of run-time configuration change ). > > Action: No change in code. The CPSV still keep > supporting non-collocated checkpoint even if > IMMSV_SC_ABSENCE_ALLOWED is enable. > > >>[AndersW3] No, I think we ought to support > non-colocated checkpoints also when > IMMSV_SC_ABSENCE_ALLOWED is set. The fact that we have > "system controllers" is an implementation detail of > OpenSAF. I don't think the CKPT SAF specification implies > that > >>non-colocated checkpoints must be fully replicated on > all the nodes in the cluster, and thus we must have the > possibility that all replicas are lost. It is not clear > exactly what to expect from the APIs when this happens, > but you could handle it in a similar way as the case >> > when all sections have been automatically deleted by the > checkpoint service because the sections have expired. > > [AVM] I am not in agreement with both comments , we can > not handle it in a similar to sections expiration case > hear , in case of sections expiration checkpoint replica > still exist only section deleted > > [AndersW4] If this is a problem we could re-create the > checkpoint with no sections in it. > > > CPSV specification says if two replicas exist > ( in our case Only on SC`s) at a certain point in time, > and the nodes hosting both of these replicas is > administratively taken out of service, the > Checkpoint Service should allocate another replica on > another node while this node is not available > please check section `3.1.7.2 Non-Collocated > Checkpoints` of cpsv specification . > > [AndersW4] The spec actually says "may" rather than "should" > in this section. And the purpose of allocating another replica > is to "enhance the availability of checkpoints". When I read > this section, I think it is quite clear that the spec does not > perceive non-colocated checkpoints as guaranteed to preserve > data in the case of node failures: > > "The Checkpoint Service may create replicas > other than the ones that may be created when opening a > checkpoint. These other > replicas can be useful to enhance the availability of > checkpoints. For example, if two > replicas exist at a certain point in time, and the node > hosting one of these replicas is > administratively taken out of service, the Checkpoint Service > may allocate another > replica on another node while this node is not available." > > So, data can be lost due to (multiple) node failures. There > are two other cases where data is lost: automatic deletion of > the entire checkpoint if it has not been opened by any process > for the duration of the retention time, and automatic deletion > of sections within a checkpoint when the sections reach their > expiration times. The APIs specify the return code > SA_AIS_ERR_NOT_EXIST to signal that a specific section, or the > entire checkpoint, doesn't exist. Thus, there support in the > API for reporting loss of checkpoint data (whatever the reason > of the loss may be). If the headless feature is disabled, we > cannot lose non-colocated checkpoints due to node failures, > but when the headless feature is enabled we can. > > > For example, take a case of application on > PL is in progress of writing to non-collocated checkpoint > sections ( physical replica exist only on SC`s ) > what will happen to application on PL ? , > ok let us consider user agreed to loose the checkpoint > and he what to recreated it , what will happen to cpnd DB > on PL and the complexity involved in it (clean up) , > and this will lead to lot of maintainability > issues. > > [AndersW4] The thing that will happen (from an application's > perspective) is that you will get the SA_AIS_ERR_NOT_EXIST > error code from the CKPT API when trying to access the lost > checkpoint. I don't know the complexity at the code level for > implementing this, but isn't this already supported by the > code which is out on review (Nhat, correct me if I am wrong)? > > [NhatPham4] To be more correct, the application will get > SA_AIS_ERR_BAD_HANDLE when trying to access the lost > checkpoint because all data was destroyed. > > But for opening the checkpoint (not creating), it will get > SA_AIS_ERR_NOT_EXIST. > > > On top of that CKPT SAF specification only > says that non-collocated checkpoint and all its sections > should survive if the Checkpoint Service running on > cluster and > replica is USER private data ( not Opensaf > States ) , loosing any USER private data not acceptable . > > *Comment 3*: This is about case where checkpoint node > director (cpnd) crashes during headless state. In this > case the cpnd can’t finish starting because it can’t > initialize CLM service. > > Then after time out, the AMF triggers a restart again. > Finally, the node is rebooted. > > It is expected that this problem should not lead to a > node reboot. > > Action: No change in code. This is the limitation of > the system during headless state. > > > [AVM] code changes required in CPSV CLM integration code > need to be revisited to handle TRYAGAIN. > > If you agree with the summary above, I’ll update code > and send out the V3 for review. > > Best regards, > > Nhat Pham > > *From:* Anders Widell [mailto:[email protected]] > *Sent:* Wednesday, February 24, 2016 9:26 PM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'A V Mahesh' > <[email protected]> <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; 'Beatriz > Brandao' <[email protected]> > <mailto:[email protected]>; 'Minh Chau H' > <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for cpsv: > Support preserving and recovering checkpoint replicas > during headless state V2 [#1621] > > See my comments inline, marked [AndersW3]. > > regards, > Anders Widell > > On 02/24/2016 07:32 AM, Nhat Pham wrote: > > Hi Mahesh and Anders, > > Please see my comments below. > > Best regards, > > Nhat Pham > > *From:* A V Mahesh [mailto:[email protected]] > *Sent:* Wednesday, February 24, 2016 11:06 AM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders Widell' > <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' <[email protected]> > <mailto:[email protected]>; 'Minh Chau > H' <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request for > cpsv: Support preserving and recovering checkpoint > replicas during headless state V2 [#1621] > > Hi Nhat Pham, > > If component ( CPND ) restart allows while > Controllers absent , before requesting CLM going > to change return value to**SA_AIS_ERR_TRY_AGAIN , > We need to get clarification from AMF guys on few > things why because if CPND is on > SA_AIS_ERR_TRY_AGAIN and component restart timeout > then AMF will restart component again ( this > become cyclic ) and after saAmfSGCompRestartMax > configured value Node gose for reboot as next > level escalation, > in that case we may required changes in AMF as > well, to not to act on component restart timeout > in case of Controllers absent ( i am not sure it > is deviation of AMF specification ) . > > */[Nhat Pham] In headless state, I’m not sure > about this either. /* > > */@Anders: Would you have comments about this?/* > > [AndersW3] Ok, first of all I would like to point out > that normally, the OpenSAF checkpoint node director > should not crash. So we are talking about a situation > where multiple faults have occurred: first both the > active and the standby system controllers have died, > and then shortly afterwards - before we have a new > active system controller - the checkpoint node > director also crashes. Sure, these may not be totally > independent events, but still there are a lot of > faults that have happened within a short period of > time. We should test the node director and make sure > it doesn't crash in this type of scenario. > > Now, let's consider the case where we have a fault in > the node director that causes it to crash during the > headless state. The general philosophy of the headless > feature is that when things work fine - i.e. in the > absence of fault - we should be able to continue > running while the system controllers are absent. > However, if a fault happens during the headless state, > we may not be able to recover from the fault until > there is an active system controller. AMF does provide > support for restarting components, but as you have > pointed out, the node director will be stuck in a > TRY_AGAIN loop immediately after it has been > restarted. So this means that if the node director > crashes during the headless state, we have lost the > checkpoint functionality on that node and we will not > get it back until there is an active system > controller. Other services like IMM will still work > for a while, but AMF will as you say eventually > escalate the checkpoint node director failure to a > node restart and then the whole node is gone. The node > will not come back until we have an active system > controller. So to summarize: there is very limited > support for recovering from faults that happen during > the headless state. The full recovery will not happen > until we have an active system controller. > > Please do incorporate current comments ( in design > prospective ) and republish the patch , I will > re-test V3 patch and provide review comments on > function issue/bugs if I found any. > > One Important note , in the new patch let us not > have any complexity of allowing non-collocated > checkpoint creation and then documenting that in > some scenario , > non-collocated checkpoint replicas are > recoverable , why because replica is USER > private data ( not Opensaf States ) , loosing USER > private data not acceptable . > so let us keep the scope of CPSV service as > non-collocated checkpoint creation NOT_SUPPORTED , > if cluster is running with > IMMSV_SC_ABSENCE_ALLOWED ( headless state > configuration enabled at the time of cluster > startup currently it is not configurable , so > their no chance of run-time configuration change ). > > We can provide support for non-collocated in > subsequent enhancements by having solution like > replica on lower node ID PL will also created > non-collocated ( max three riplicas in cluster > regradless of where non-collocated is opened ). > > So for now, regardless of the heads (SC`s) status > exist not exist CPSV should return > SA_AIS_ERR_NOT_SUPPORTED in case of > IMMSV_SC_ABSENCE_ALLOWED enabled cluster , > and let us document it as well. > > */[Nhat Pham] The patch is to limit loosing > replicas and checkpoints in case of headless state./* > > */In case both replicas locate on SCs and they > reboot, loosing checkpoint is unpreventable with > current design after headless state./* > > */Even if we implement the proposal “/*max three > riplicas in cluster regradless of where > non-collocated is opened*/”, there is still the > case where the checkpoint is lost. Ex. The SCs and > the PL which hosts the replica reboot same time./* > > */In case /*IMMSV_SC_ABSENCE_ALLOWED disable, if > both SCs reboot, this leads whole cluster reboots. > Then the checkpoint is lost. > > */What I mean is there are cases where the > checkpoint is lost. The point is what we can do to > limit loosing data./* > > */For the proposal of reject creating > non-collocated checkpoint in case > of/* IMMSV_SC_ABSENCE_ALLOWED enabled, I think > this will lead to in compatible problem. > > */@Anders: How do you think about rejecting > creating non-collocated checkpoint in case of > /*IMMSV_SC_ABSENCE_ALLOWED enabled? > > [AndersW3] No, I think we ought to support > non-colocated checkpoints also when > IMMSV_SC_ABSENCE_ALLOWED is set. The fact that we have > "system controllers" is an implementation detail of > OpenSAF. I don't think the CKPT SAF specification > implies that non-colocated checkpoints must be fully > replicated on all the nodes in the cluster, and thus > we must have the possibility that all replicas are > lost. It is not clear exactly what to expect from the > APIs when this happens, but you could handle it in a > similar way as the case when all sections have been > automatically deleted by the checkpoint service > because the sections have expired. > > > -AVM > > On 2/24/2016 6:51 AM, Nhat Pham wrote: > > Hi Mahesh, > > Do you have any further comments? > > Best regards, > > Nhat Pham > > *From:* A V Mahesh > [mailto:[email protected]] > *Sent:* Monday, February 22, 2016 10:37 AM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders > Widell' <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]>; 'Minh > Chau H' <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review Request > for cpsv: Support preserving and recovering > checkpoint replicas during headless state V2 > [#1621] > > Hi, > > >>BTW, have you finished the review and test? > > I will finish by today. > > -AVM > > On 2/22/2016 7:48 AM, Nhat Pham wrote: > > Hi Mahesh and Anders, > > Please see my comment below. > > BTW, have you finished the review and test? > > Best regards, > > Nhat Pham > > *From:* A V Mahesh > [mailto:[email protected]] > *Sent:* Friday, February 19, 2016 2:28 PM > *To:* Nhat Pham <[email protected]> > <mailto:[email protected]>; 'Anders > Widell' <[email protected]> > <mailto:[email protected]>; 'Minh > Chau H' <[email protected]> > <mailto:[email protected]> > *Cc:* [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review > Request for cpsv: Support preserving and > recovering checkpoint replicas during > headless state V2 [#1621] > > Hi Nhat Pham, > > On 2/19/2016 12:28 PM, Nhat Pham wrote: > > Could you please give more detailed > information about steps to reproduce > the problem below? Thanks. > > > Don't see this as specific bug , we need > to see the issue as CLM integrated > service point of view , > by considering Anders Widell explication > about CLM application behavior during > headless state > we need to reintegrate CPND with CLM ( > before this headless state feature no > case of CPND existence in the obscene of > CLMD , but now it is ). > > And this will be the consistent across the > all services who integrated with CLM ( > you may need some changes in CLM also ) > > */[Nhat Pham] I think CLM should return > /*SA_AIS_ERR_TRY_AGAIN in this case. > > @Anders. How would you think? > > To start with let us consider case CPND > on payload restarted on PL during headless > state > and an application is in running on PL. > > */[Nhat Pham] Regarding the CPND as CLM > application, I’m not sure what it can do > in this case. In case it restarts, it is > monitored by AMF./* > > */If it blocks for too long, AMF will also > trigger a node reboot./* > > */In my test case, the CPND get blocked by > CLM. It doesn’t get out of the > saClmInitialize. How do you get the “/ER > cpnd clm init failed with return value:31/”?/* > > */Following is the cpnd trace./* > > Feb 22 8:56:41.188122 osafckptnd > [736:cpnd_init.c:0183] >> cpnd_lib_init > > Feb 22 8:56:41.188332 osafckptnd > [736:cpnd_init.c:0412] >> cpnd_cb_db_init > > Feb 22 8:56:41.188600 osafckptnd > [736:cpnd_init.c:0437] << cpnd_cb_db_init > > Feb 22 8:56:41.188778 osafckptnd > [736:clma_api.c:0503] >> saClmInitialize > > Feb 22 8:56:41.188945 osafckptnd > [736:clma_api.c:0593] >> clmainitialize > > Feb 22 8:56:41.190052 osafckptnd > [736:clma_util.c:0100] >> clma_startup: > clma_use_count: 0 > > Feb 22 8:56:41.190273 osafckptnd > [736:clma_mds.c:1124] >> clma_mds_init > > Feb 22 8:56:41.190825 osafckptnd > [736:clma_mds.c:1170] << clma_mds_init > > -AVM > > On 2/19/2016 12:28 PM, Nhat Pham wrote: > > Hi Mahesh, > > Could you please give more detailed > information about steps to reproduce > the problem below? Thanks. > > Best regards, > > Nhat Pham > > *From:* A V Mahesh > [mailto:[email protected]] > *Sent:* Friday, February 19, 2016 1:06 PM > *To:* Anders Widell > <[email protected]> > <mailto:[email protected]>; > Nhat Pham <[email protected]> > <mailto:[email protected]>; > 'Minh Chau H' > <[email protected]> > <mailto:[email protected]> > *Cc:* > [email protected] > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]> > *Subject:* Re: [PATCH 0 of 1] Review > Request for cpsv: Support preserving > and recovering checkpoint replicas > during headless state V2 [#1621] > > Hi Anders Widell, > Thanks for the detailed explanation > about CLM during headless state. > > HI Nhat Pham , > > Comment : 3 > Please see below the problem I was > interpreted now I seeing it during > CLMD obscene ( during headless state ), > so now CPND/CLMA need to to address > below case , currently cpnd clm init > failed with return value: > SA_AIS_ERR_UNAVAILABLE > but should be SA_AIS_ERR_TRY_AGAIN > > > ================================================== > Feb 19 11:18:28 PL-4 osafimmnd[5422]: > NO NODE STATE-> > IMM_NODE_FULLY_AVAILABLE 17418 > Feb 19 11:18:28 PL-4 osafimmloadd: NO > Sync ending normally > Feb 19 11:18:28 PL-4 osafimmnd[5422]: > NO Epoch set to 9 in ImmModel > Feb 19 11:18:28 PL-4 cpsv_app: IN > Received PROC_STALE_CLIENTS > Feb 19 11:18:28 PL-4 osafimmnd[5422]: > NO Implementer connected: 42 > (MsgQueueService132111) <108, 2040f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: > NO Implementer connected: 43 > (MsgQueueService131855) <0, 2030f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: > NO Implementer connected: 44 > (safLogService) <0, 2010f> > Feb 19 11:18:28 PL-4 osafimmnd[5422]: > NO SERVER STATE: > IMM_SERVER_SYNC_SERVER --> > IMM_SERVER_READY > Feb 19 11:18:28 PL-4 osafimmnd[5422]: > NO Implementer connected: 45 > (safClmService) <0, 2010f> > *Feb 19 11:18:28 PL-4 > osafckptnd[7718]: ER cpnd clm init > failed with return value:31 > Feb 19 11:18:28 PL-4 osafckptnd[7718]: > ER cpnd init failed > Feb 19 11:18:28 PL-4 osafckptnd[7718]: > ER cpnd_lib_req FAILED > Feb 19 11:18:28 PL-4 osafckptnd[7718]: > __init_cpnd() failed* > Feb 19 11:18:28 PL-4 osafclmna[5432]: > NO > safNode=PL-4,safCluster=myClmCluster > Joined cluster, nodeid=2040f > Feb 19 11:18:28 PL-4 osafamfnd[5441]: > NO AVD NEW_ACTIVE, adest:1 > Feb 19 11:18:28 PL-4 osafamfnd[5441]: > NO Sending node up due to > NCSMDS_NEW_ACTIVE > Feb 19 11:18:28 PL-4 osafamfnd[5441]: > NO 1 SISU states sent > Feb 19 11:18:28 PL-4 osafamfnd[5441]: > NO 1 SU states sent > Feb 19 11:18:28 PL-4 osafamfnd[5441]: > NO 7 CSICOMP states synced > Feb 19 11:18:28 PL-4 osafamfnd[5441]: > NO 7 SU states sent > Feb 19 11:18:28 PL-4 osafimmnd[5422]: > NO Implementer connected: 46 > (safAmfService) <0, 2010f> > Feb 19 11:18:30 PL-4 osafamfnd[5441]: > NO > 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Component > or SU restart probation timer expired > Feb 19 11:18:35 PL-4 osafamfnd[5441]: > NO Instantiation of > > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > failed > Feb 19 11:18:35 PL-4 osafamfnd[5441]: > NO Reason: component registration > timer expired > Feb 19 11:18:35 PL-4 osafamfnd[5441]: > WA > > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Presence State RESTARTING => > INSTANTIATION_FAILED > Feb 19 11:18:35 PL-4 osafamfnd[5441]: > NO Component Failover trigerred for > 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF': > Failed component: > > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' > Feb 19 11:18:35 PL-4 osafamfnd[5441]: > ER > > 'safComp=CPND,safSu=PL-4,safSg=NoRed,safApp=OpenSAF'got > Inst failed > Feb 19 11:18:35 PL-4 osafamfnd[5441]: > Rebooting OpenSAF NodeId = 132111 EE > Name = , Reason: NCS component > Instantiation failed, OwnNodeId = > 132111, SupervisionTime = 60 > Feb 19 11:18:36 PL-4 opensaf_reboot: > Rebooting local node; timeout=60 > Feb 19 11:18:39 PL-4 kernel: [ > 4877.338518] md: stopping all md devices. > > ================================================== > > -AVM > > On 2/15/2016 5:11 PM, Anders Widell wrote: > > Hi! > > Please find my answer inline, > marked [AndersW]. > > regards, > Anders Widell > > On 02/15/2016 10:38 AM, Nhat Pham > wrote: > > Hi Mahesh, > > It's good. Thank you. :) > > [AVM] Up on rejoining of the > SC`s The replica should be > re-created regardless > of another application opens > it on PL4. > ( Note : this > comment is based on your > explanation have not yet > reviewed/tested , > currently i > am struggling with SC`s not > rejoining > after headless state , i can > provide you more on this once > i complte my > review/testing) > > [Nhat] To make cloud > resilience works, you need the > patches from other > services (log, amf, clm, ntf). > @Minh: I heard that you > created tar file which > includes all patches. Could you > please send it to Mahesh? Thanks > > [AVM] I understand that , > before I comment more on > this please allow me to > understand > I am not still > not very clear of the headless > design in detail. > For example > cluster membership of PL`s > during headless state , > In the absence > of SC`s (CLMD) dose the PLs > is considered as > cluster nodes or not (cluster > membership) ? > > [Nhat] I don't know much about > this. > @ Anders: Could you please > have comment about this? Thanks > > [AndersW] First of all, keep in > mind that the "headless" state > should ideally not last a very > long time. Once we have the spare > SC feature in place (ticket > [#79]), a new SC should become > active within a matter of a few > seconds after we have lost both > the active and the standby SC. > > I think you should view the state > of the cluster in the headless > state in the same way as you view > the state of the cluster during a > failover between the active and > the standby SC. Imagine that the > active SC dies. It takes the > standby SC 1.5 seconds to detect > the failure of the active SC (this > is due to the TIPC timeout). If > you have configured the > PROMOTE_ACTIVE_TIMER, there is an > additional delay before the > standby takes over as active. What > is the state of the cluster during > the time after the active SC > failed and before the standby > takes over? > > The state of the cluster while it > is headless is very similar. The > difference is that this state may > last a little bit longer (though > not more than a few seconds, until > one of the spare SCs becomes > active). Another difference is > that we may have lost some state. > With a "perfect" implementation of > the headless feature we should not > lose any state at all, but with > the current set of patches we do > lose state. > > So specifically if we talk about > cluster membership and ask the > question: is a particular PL a > member of the cluster or not > during the headless state? Well, > if you ask CLM about this during > the headless state, then you will > not know - because CLM doesn't > provide any service during the > headless state. If you keep > retrying you query to CLM, you > will eventually get an answer - > but you will not get this answer > until there is an active SC again > and we have exited the headless > state. When viewed in this way, > the answer to the question about a > node's membership is undefined > during the headless state, since > CLM will not provide you with any > answer until there is an active SC. > > However, if you asked CLM about > the node's cluster membership > status before the cluster went > headless, you probably saved a > cached copy of the cluster > membership state. Maybe you also > installed a CLM track callback and > intend to update this cached copy > every time the cluster membership > status changes. The question then > is: can you continue using this > cached copy of the cluster > membership state during the > headless state? The answer is YES: > since CLM doesn't provide any > service during the headless state, > it also means that the cluster > membership view cannot change > during this time. Nodes can of > course reboot or die, but CLM will > not notice and hence the cluster > view will not be updated. You can > argue that this is bad because the > cluster view doesn't reflect > reality, but notice that this will > always be the case. We can never > propagate information > instantaneously, and detection of > node failures will take 1.5 > seconds due to the TIPC timeout. > You can never be sure that a node > is alive at this very moment just > because CLM tells you that it is a > member of the cluster. If we are > unfortunate enough to lose both > system controller nodes > simultaneously, updates to the > cluster membership view will be > delayed a few seconds longer than > usual. > > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh > [mailto:[email protected]] > Sent: Monday, February 15, > 2016 11:19 AM > To: Nhat Pham > <[email protected]> > <mailto:[email protected]>; > [email protected] > <mailto:[email protected]> > > Cc: > [email protected] > > <mailto:[email protected]>; > 'Beatriz Brandao' > <[email protected]> > <mailto:[email protected]> > > Subject: Re: [PATCH 0 of 1] > Review Request for cpsv: > Support preserving and > recovering checkpoint replicas > during headless state V2 [#1621] > > Hi Nhat Pham, > > How is your holiday went > > Please find my comments below > > On 2/15/2016 8:43 AM, Nhat > Pham wrote: > > Hi Mahesh, > > For the comment 1, the > patch will be updated > accordingly. > > [AVM] Please hold , I will > provide more comments in this > week , so we can > have consolidated V3 > > For the comment 2, I think > the CKPT service will not > be backward > compatible if the > scAbsenceAllowed is true. > The client can't create > non-collocated checkpoint > on SCs. > > Furthermore, this solution > only protects the CKPT > service from the > case "The non-collocated > checkpoint is created on > a SC" > there are still the cases > where the replicas are > completely lost. Ex: > > - The non-collocated > checkpoint created on a > PL. The PL reboots. Both > replicas now locate on > SCs. Then, headless state > happens. All replicas are > lost. > - The non-collocated > checkpoint has active > replica locating on a PL > and this PL restarts > during headless state > - The non-collocated > checkpoint is created on > PL3. This checkpoint is > also opened on PL4. Then > SCs and PL3 reboot. > > [AVM] Up on rejoining of the > SC`s The replica should be > re-created regardless > of another application opens > it on PL4. > ( Note : this > comment is based on your > explanation have not yet > reviewed/tested , > currently i > am struggling with SC`s not > rejoining > after headless state , i can > provide you more on this once > i complte my > review/testing) > > In this case, all replicas > are lost and the client > has to create it again. > > In case multiple nodes > (which including SCs) > reboot, losing replicas > is unpreventable. The > patch is to recover the > checkpoints in possible > cases. > How do you think? > > [AVM] I understand that , > before I comment more on this > please allow > me to understand > I am not still > not very clear of the headless > design in detail. > > For example > cluster membership of PL`s > during headless > state , > In the absence > of SC`s (CLMD) dose the PLs > is considered as > cluster nodes or not (cluster > membership) ? > > - if not > consider as NON cluster nodes > Checkpoint Service > API should leverage the SA > Forum Cluster > Membership Service and API's > can fail with > SA_AIS_ERR_UNAVAILABLE > > - if > considers as cluster nodes we > need to follow all the > defined rules which are > defined in > SAI-AIS-CKPT-B.02.02 > specification > > so give me some > more time to review it > completely , so that we > can have consolidated patch V3 > > -AVM > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh > [mailto:[email protected]] > > Sent: Friday, February 12, > 2016 11:10 AM > To: Nhat Pham > <[email protected]> > <mailto:[email protected]>; > [email protected] > > <mailto:[email protected]> > > Cc: > > [email protected] > > <mailto:[email protected]>; > Beatriz Brandao > <[email protected]> > > <mailto:[email protected]> > > Subject: Re: [PATCH 0 of > 1] Review Request for > cpsv: Support > preserving and recovering > checkpoint replicas during > headless state V2 > [#1621] > > > Comment 2 : > > After incorporating the > comment one all the > Limitations should be > prevented based on Hydra > configuration is enabled > in IMM status. > > Foe example : if some > application is trying to > create > > non-collocated checkpoint > active replica getting > generated/locating on > SC then ,regardless of the > heads (SC`s) status exist > not exist should > return > SA_AIS_ERR_NOT_SUPPORTED > > In other words, rather > that allowing to created > non-collocated > checkpoint when > heads(SC`s) are exit , > and non-collocated > checkpoint getting > unrecoverable after > heads(SC`s) rejoins. > > > ====================================================================== > > ======================= > > Limitation: The > CKPT service doesn't > support recovering > checkpoints in > following cases: > . The checkpoint > which is unlinked > before headless. > . The > non-collocated > checkpoint has active > replica locating on SC. > . The > non-collocated > checkpoint has active > replica locating on a PL > and this PL > restarts during > headless state. In > this cases, the > checkpoint replica is > destroyed. The > fault code > SA_AIS_ERR_BAD_HANDLE > is returned when the > client > accesses the > checkpoint in these > cases. The client must > re-open the > checkpoint. > > > ====================================================================== > > ======================= > > -AVM > > > On 2/11/2016 12:52 PM, A V > Mahesh wrote: > > Hi, > > I jut starred > reviewing patch , I > will be giving > comments as soon as > I crossover any , to > save some time. > > Comment 1 : > This functionality > should be under > checks if Hydra > configuration is > enabled in IMM attrName = > > const_cast<SaImmAttrNameT>("scAbsenceAllowed") > > > Please see example how > LOG/AMF services > implemented it. > > -AVM > > > On 1/29/2016 1:02 PM, > Nhat Pham wrote: > > Hi Mahesh, > > As described in > the README, the > CKPT service returns > SA_AIS_ERR_TRY_AGAIN > fault code in this > case. > I guess it's same > for other services. > > @Anders: Could you > please confirm this? > > Best regards, > Nhat Pham > > -----Original > Message----- > From: A V Mahesh > > [mailto:[email protected]] > > Sent: Friday, > January 29, 2016 > 2:11 PM > To: Nhat Pham > <[email protected]> > > <mailto:[email protected]>; > [email protected] > > <mailto:[email protected]> > > Cc: > > [email protected] > > <mailto:[email protected]> > > Subject: Re: > [PATCH 0 of 1] > Review Request for > cpsv: Support > preserving and > recovering > checkpoint > replicas during > headless state > V2 [#1621] > > Hi, > > On 1/29/2016 11:45 > AM, Nhat Pham wrote: > > - The > behavior of > application > will be > consistent > with other > saf services > like imm/amf > behavior > during > headless state. > [Nhat] I'm not > clear what you > mean about > "consistent"? > > In the obscene of > Director (SC's) , > what is expected > return values > of SAF API should > ( all services ) , > which are not > in aposition to > provide service at > that moment. > > I think all > services should > return same SAF > ERRS., I thinks > currently we don't > have it , may be > Anders Widel will > help us. > > -AVM > > > On 1/29/2016 11:45 > AM, Nhat Pham wrote: > > Hi Mahesh, > > Please see the > attachment for > the README. > Let me know if > there is > any more > information > required. > > Regarding your > comments: > - > during > headless state > applications > may behave > like during > CPND restart > case [Nhat] > Headless state > and CPND > restart are > different > events. Thus, > the behavior > is different. > Headless state > is a case > where both SCs > go down. > > - The > behavior of > application > will be > consistent > with other > saf services > like imm/amf > behavior > during > headless state. > [Nhat] I'm not > clear what you > mean about > "consistent"? > > Best regards, > Nhat Pham > > -----Original > Message----- > From: A V > Mahesh > > [mailto:[email protected]] > > Sent: Friday, > January 29, > 2016 11:12 AM > To: Nhat Pham > > <[email protected]> > > <mailto:[email protected]>; > > > [email protected] > > <mailto:[email protected]> > > Cc: > > [email protected] > > <mailto:[email protected]> > > Subject: Re: > [PATCH 0 of 1] > Review Request > for cpsv: Support > preserving and > recovering > checkpoint > replicas > during > headless state > V2 [#1621] > > Hi Nhat Pham, > > I stared > reviewing this > patch , so can > please provide > README file > with scope and > limitations , > that will help > to define > testing/reviewing > scope . > > Following are > minimum things > we can keep in > mind while > reviewing/accepting > patch , > > - Not > effecting > existing > functionality > - > during > headless state > applications > may behave > like during > CPND restart case > - The > minimum > functionally > of application > works > - The > behavior of > application > will be > consistent with > other > saf services > like imm/amf > behavior > during > headless state. > > So please do > provide any > additional > detailed in > README if any of > the above is > deviated , > that allow > users to know > about the > limitations/deviation. > > > -AVM > > On 1/4/2016 > 3:15 PM, Nhat > Pham wrote: > > Summary: > cpsv: > Support > preserving > and > recovering > checkpoint > replicas > during > headless > state > [#1621] > Review > request > for Trac > Ticket(s): > #1621 Peer > Reviewer(s): > > [email protected] > > <mailto:[email protected]>; > > > [email protected] > > <mailto:[email protected]> > Pull > request to: > > [email protected] > > <mailto:[email protected]> > Affected > branch(es): > default > Development > branch: > default > > > -------------------------------- > > Impacted > area > Impact y/n > > -------------------------------- > > Docs n > > Build > system n > RPM/packaging > n > Configuration > files n > > Startup > scripts > n > SAF > services y > > OpenSAF > services > n > Core > libraries n > Samples n > Tests n > Other n > > > Comments > (indicate > scope for > each "y" > above): > > --------------------------------------------- > > > changeset > > faec4a4445a4c23e8f630857b19aabb43b5af18d > > Author: > Nhat Pham > > <[email protected]> > > <mailto:[email protected]> > > Date: > Mon, 04 > Jan 2016 > 16:34:33 > +0700 > > > cpsv: > Support > preserving > and > recovering > checkpoint > replicas > during > headless > state [#1621] > > > Background: > > ---------- > This > enhancement > supports > to > preserve > checkpoint > replicas > > in case > > both SCs > down > (headless > state) and > recover > replicas > in case > one of > > SCs up > > again. If > both SCs > goes down, > checkpoint > replicas on > surviving > nodes > > still > > remain. > When a SC > is > available > again, > surviving > replicas are > > automatically > > registered > to the SC > checkpoint > database. > Content in > surviving > > replicas are > > intacted > and > synchronized > to new > replicas. > > When > no SC is > available, > client API > calls > changing > checkpoint > > configuration > > which > requires > SC > communication, > are > rejected. > Client API > calls > > reading and > > writing > existing > checkpoint > replicas > still work. > > > Limitation: The > CKPT > service > does not > support > recovering > checkpoints > > in > > following > cases: > - > The > checkpoint > which is > unlinked > before > headless. > - > The > non-collocated > checkpoint > has active > replica > locating > on SC. > - > The > non-collocated > checkpoint > has active > replica > locating > on a PL > > and this > > PL > restarts > during > headless > state. In > this > cases, the > checkpoint > > replica is > > destroyed. > The fault > code > > SA_AIS_ERR_BAD_HANDLE > is returned > when the > > client > > accesses > the > checkpoint > in these > cases. The > client must > re-open the > > checkpoint. > > > While in > headless > state, > accessing > checkpoint > replicas does > not work > > if the > > node which > hosts the > active > replica > goes down. > It will back > working > > when a > > SC > available > again. > > > Solution: > > --------- > The > solution > for this > enhancement > includes > 2 parts: > > 1. > To destroy > un-recoverable > checkpoint > described > above when > both > > SCs are > > down: When > both SCs > are down, > the CPND > deletes > un-recoverable > > > checkpoint > > nodes and > replicas > on PLs. > Then it > requests > CPA to > destroy > > corresponding > > checkpoint > node by > using new > message > > CPA_EVT_ND2A_CKPT_DESTROY > > > 2. > To update > CPD with > checkpoint > information When > an active > SC is up > > after > > headless, > CPND will > update CPD > with > checkpoint > information by > > using > > new > > message > > CPD_EVT_ND2D_CKPT_INFO_UPDATE > instead of > using > > CPD_EVT_ND2D_CKPT_CREATE. > This is > because > the CPND will > create new > > ckpt_id > > for the > checkpoint > which > might be > different > with the > current > ckpt id > > if the > > > CPD_EVT_ND2D_CKPT_CREATE > is used. > The CPD > collects > checkpoint > > information > > within 6s. > During > this > updating > time, > following > requests is > rejected > > with > > fault code > > SA_AIS_ERR_TRY_AGAIN: > > - > > CPD_EVT_ND2D_CKPT_CREATE > > - > > CPD_EVT_ND2D_CKPT_UNLINK > > - > > CPD_EVT_ND2D_ACTIVE_SET > > - > > CPD_EVT_ND2D_CKPT_RDSET > > > > Complete > diffstat: > ------------------ > > > osaf/libs/agents/saf/cpa/cpa_proc.c > | 52 > > > +++++++++++++++++++++++++++++++++++ > > > > osaf/libs/common/cpsv/cpsv_edu.c > | 43 > > > +++++++++++++++++++++++++++++ > > > > osaf/libs/common/cpsv/include/cpd_cb.h > | 3 ++ > > osaf/libs/common/cpsv/include/cpd_imm.h > | 1 + > > osaf/libs/common/cpsv/include/cpd_proc.h > | 7 ++++ > > osaf/libs/common/cpsv/include/cpd_tmr.h > | 3 +- > > osaf/libs/common/cpsv/include/cpnd_cb.h > | 1 + > > osaf/libs/common/cpsv/include/cpnd_init.h > | 2 + > > osaf/libs/common/cpsv/include/cpsv_evt.h > | 20 > +++++++++++++ > > osaf/services/saf/cpsv/cpd/Makefile.am > | 3 +- > > osaf/services/saf/cpsv/cpd/cpd_evt.c > | 229 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > ++++ > > > osaf/services/saf/cpsv/cpd/cpd_imm.c > | 112 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > osaf/services/saf/cpsv/cpd/cpd_init.c > | 20 > ++++++++++++- > > osaf/services/saf/cpsv/cpd/cpd_proc.c > | 309 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > osaf/services/saf/cpsv/cpd/cpd_tmr.c > | 7 ++++ > > osaf/services/saf/cpsv/cpnd/cpnd_db.c > | 16 > ++++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_evt.c > | 22 > +++++++++++++++ > > > osaf/services/saf/cpsv/cpnd/cpnd_init.c > | 23 > ++++++++++++++- > > > osaf/services/saf/cpsv/cpnd/cpnd_mds.c > | 13 > ++++++++ > > osaf/services/saf/cpsv/cpnd/cpnd_proc.c > | 314 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > > > 20 files > changed, > 1189 > insertions(+), > 11 > deletions(-) > > > Testing > Commands: > ----------------- > > - > > Testing, > Expected > Results: > > -------------------------- > > - > > > Conditions > of > Submission: > > ------------------------- > > > <<HOW MANY > DAYS > BEFORE > PUSHING, > CONSENSUS > ETC>> > > > Arch Built > Started > Linux distro > > ------------------------------------------- > > mips > n n > mips64 > n n > x86 > n n > x86_64 > n n > powerpc > n n > powerpc64 > n n > > > Reviewer > Checklist: > > ------------------- > > [Submitters: > make sure > that your > review > doesn't > trigger any > checkmarks!] > > > Your > checkin > has not > passed > review > because > (see > checked > entries): > > ___ Your > RR > template > is > generally > incomplete; it > has too many > blank > > entries > > that need > proper > data > filled in. > > ___ You > have > failed to > nominate > the proper > persons > for review > and > push. > > ___ Your > patches do > not have > proper > short+long > header > > ___ You > have > grammar/spelling > in your > header > that is > unacceptable. > > ___ You > have > exceeded a > sensible > line > length in > your > > headers/comments/text. > > > ___ You > have > failed to > put in a > proper > Trac > Ticket # > into your > commits. > > ___ You > have > incorrectly > put/left > internal > data in > your > comments/files > > > (i.e. > internal > bug > tracking > tool IDs, > product > names etc) > > ___ You > have not > given any > evidence > of testing > beyond > basic build > tests. > Demonstrate some > level of > runtime or > other > sanity > testing. > > ___ You > have ^M > present in > some of > your > files. > These have > to be > removed. > > ___ You > have > needlessly > changed > whitespace > or added > whitespace > crimes > > like > trailing > spaces, or > spaces > before tabs. > > ___ You > have mixed > real > technical > changes > with > whitespace > and other > > cosmetic > code > cleanup > changes. > These have > to be > separate > commits. > > ___ You > need to > refactor > your > submission > into > logical > chunks; > there is > > too much > content > into a > single > commit. > > ___ You > have > extraneous > garbage in > your > review > (merge > commits etc) > > ___ You > have giant > attachments which > should > never have > been sent; > > Instead > you should > place your > content in > a public > tree to > be pulled. > > ___ You > have too > many > commits > attached > to an > e-mail; > resend as > threaded > > commits, > or place > in a > public > tree for a > pull. > > ___ You > have > resent > this > content > multiple > times > without a > clear > indication > > of what > has > changed > between > each re-send. > > ___ You > have > failed to > adequately > and > individually > address > all of the > > comments > and change > requests > that were > proposed > in the > initial > > review. > > ___ You > have a > misconfigured > ~/.hgrc > file (i.e. > username, > email > etc) > > ___ Your > computer > have a > badly > configured > date and > time; > confusing the > > the > threaded > patch review. > > ___ Your > changes > affect IPC > mechanism, > and you > don't > present any > results > > for > in-service > upgradability > test. > > ___ Your > changes > affect > user > manual and > documentation, > your patch > series > > do not > contain > the patch > that > updates > the > Doxygen > manual. > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
