Hi Mahesh, Have you reviewed the patch?
Best regards, Nhat Pham -----Original Message----- From: A V Mahesh [mailto:[email protected]] Sent: Monday, February 29, 2016 1:15 PM To: Nhat Pham <[email protected]>; [email protected] Cc: [email protected] Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V3 [#1621] Hi Nhat Pham, I will review V3 patch and do the final functional testing and get back to you soon. ( I may take some time , I also need to work on my published MDS enhancements ) -AVM On 2/29/2016 9:39 AM, Nhat Pham wrote: > Hi, > > Following is the summary of updating in V3: > > Comment 1: This functionality should be under checks if Hydra > configuration is enabled in IMM attrName = > const_cast<SaImmAttrNameT>("scAbsenceAllowed"). > > Status: Included in V3 > > Comment 2: To keep the scope of CPSV service as non-collocated > checkpoint creation NOT_SUPPORTED , if cluster is running with > IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at the > time of cluster startup currently it is not configurable , so there no > chance of run-time configuration change ). > > Status: No change in code. The CPSV still keep supporting > non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is enable. > > Comment 3: This is about case where checkpoint node director (cpnd) > crashes during headless state. In this case the cpnd can't finish > starting because it can't initialize CLM service. > Then after time out, the AMF triggers a restart again. Finally, the > node is rebooted. > It is expected that this problem should not lead to a node reboot. > > Status: Included in V3. CPND reinitializes CLM service if the fault > TRY_AGAIN is returned. > > Comment 4: The Suggestion was to re-create the checkpoint without any > sections in case the all replicas is lost. If the sections were > re-created, the application wouldn't know that data has been lost. I > think the BAD_HANDLE approach is okay since we have used it in other > services, but I see it as kind of a hack solution that is not really in line > with the specs. > The specs never intended BAD_HANDLE to be something that can happen > spontaneously on a previously valid handle, lest you are suffering > from memory corruption. In the future we could consider the > feasibility of avoiding spontaneous BAD_HANDLE where possible, and in > CKPT I think it may be possible by re-creating the checkpoints. > > Status: NOT included in V3. > This change is quite much and requires a detailed design in different > scenarios. I would suggest to create an enhancement ticket for this. > How would you think? > > Best regards, > Nhat Pham > > -----Original Message----- > From: Nhat Pham [mailto:[email protected]] > Sent: Monday, February 29, 2016 11:06 AM > To: [email protected]; [email protected] > Cc: [email protected] > Subject: [devel] [PATCH 0 of 1] Review Request for cpsv: Support > preserving and recovering checkpoint replicas during headless state V3 > [#1621] > > Summary: cpsv: Support preserving and recovering checkpoint replicas > during headless state V3 [#1621] Review request for Trac Ticket(s): > 1621 Peer > Reviewer(s): [email protected]; [email protected] Pull > request to: [email protected] Affected branch(es): default > Development > branch: default > > -------------------------------- > Impacted area Impact y/n > -------------------------------- > Docs n > Build system n > RPM/packaging n > Configuration files n > Startup scripts n > SAF services y > OpenSAF services n > Core libraries n > Samples n > Tests n > Other n > > > Comments (indicate scope for each "y" above): > --------------------------------------------- > > changeset 8559fe4cea27efc8234f7cf779f3c7413efcd40f > Author: Nhat Pham <[email protected]> > Date: Mon, 29 Feb 2016 11:02:15 +0700 > > cpsv: Support preserving and recovering checkpoint replicas during > headless state V3 [#1621] > > Background: > ---------- > This enhancement supports to preserve checkpoint replicas in case > both SCs down (headless state) and recover replicas in case one of > SCs up > again. If both SCs goes down, checkpoint replicas on surviving nodes > still > remain. When a SC is available again, surviving replicas are > automatically > registered to the SC checkpoint database. Content in surviving > replicas are > intacted and synchronized to new replicas. > > When no SC is available, client API calls changing checkpoint > configuration > which requires SC communication, are rejected. Client API calls > reading and > writing existing checkpoint replicas still work. > > Limitation: The CKPT service does not support recovering checkpoints > in > following cases: > - The checkpoint which is unlinked before headless. > - The non-collocated checkpoint has active replica locating on SC. > - The non-collocated checkpoint has active replica locating on a PL > and this > PL restarts during headless state. In this cases, the checkpoint > replica is > destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when the > client > accesses the checkpoint in these cases. The client must re-open the > checkpoint. > > While in headless state, accessing checkpoint replicas does not work > if the > node which hosts the active replica goes down. It will back working > when a > SC available again. > > Solution: > --------- > The solution for this enhancement includes 2 parts: > > 1. To destroy un-recoverable checkpoint described above when both SCs > are > down: When both SCs are down, the CPND deletes un-recoverable > checkpoint > nodes and replicas on PLs. Then it requests CPA to destroy > corresponding > checkpoint node by using new message CPA_EVT_ND2A_CKPT_DESTROY > > 2. To update CPD with checkpoint information When an active SC is up > after > headless, CPND will update CPD with checkpoint information by using > new > message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using > CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will create new > ckpt_id > for the checkpoint which might be different with the current ckpt id > if the > CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint > information > within 6s. During this updating time, following requests is rejected > with > fault code SA_AIS_ERR_TRY_AGAIN: > - CPD_EVT_ND2D_CKPT_CREATE > - CPD_EVT_ND2D_CKPT_UNLINK > - CPD_EVT_ND2D_ACTIVE_SET > - CPD_EVT_ND2D_CKPT_RDSET > > > Complete diffstat: > ------------------ > osaf/libs/agents/saf/cpa/cpa_proc.c | 52 > ++++++++++++++++++++++++++ > osaf/libs/common/cpsv/cpsv_edu.c | 43 +++++++++++++++++++++ > osaf/libs/common/cpsv/include/cpd_cb.h | 4 ++ > osaf/libs/common/cpsv/include/cpd_imm.h | 2 + > osaf/libs/common/cpsv/include/cpd_proc.h | 7 +++ > osaf/libs/common/cpsv/include/cpd_tmr.h | 3 +- > osaf/libs/common/cpsv/include/cpnd_cb.h | 3 + > osaf/libs/common/cpsv/include/cpnd_init.h | 3 + > osaf/libs/common/cpsv/include/cpsv_evt.h | 20 ++++++++++ > osaf/services/saf/cpsv/cpd/Makefile.am | 3 +- > osaf/services/saf/cpsv/cpd/cpd_evt.c | 229 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++ > osaf/services/saf/cpsv/cpd/cpd_imm.c | 202 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > +++++++++++++++++++++++++ > osaf/services/saf/cpsv/cpd/cpd_init.c | 26 ++++++++++++- > osaf/services/saf/cpsv/cpd/cpd_proc.c | 309 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++ > osaf/services/saf/cpsv/cpd/cpd_tmr.c | 7 +++ > osaf/services/saf/cpsv/cpnd/Makefile.am | 6 ++- > osaf/services/saf/cpsv/cpnd/cpnd_db.c | 16 ++++++++ > osaf/services/saf/cpsv/cpnd/cpnd_evt.c | 24 ++++++++++++ > osaf/services/saf/cpsv/cpnd/cpnd_init.c | 34 ++++++++++++++++- > osaf/services/saf/cpsv/cpnd/cpnd_mds.c | 13 ++++++ > osaf/services/saf/cpsv/cpnd/cpnd_proc.c | 429 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > 21 files changed, 1423 insertions(+), 12 deletions(-) > > > Testing Commands: > ----------------- > - > > Testing, Expected Results: > -------------------------- > - > > > Conditions of Submission: > ------------------------- > - > > > Arch Built Started Linux distro > ------------------------------------------- > mips n n > mips64 n n > x86 n n > x86_64 y y > powerpc n n > powerpc64 n n > > > Reviewer Checklist: > ------------------- > [Submitters: make sure that your review doesn't trigger any > checkmarks!] > > > Your checkin has not passed review because (see checked entries): > > ___ Your RR template is generally incomplete; it has too many blank entries > that need proper data filled in. > > ___ You have failed to nominate the proper persons for review and push. > > ___ Your patches do not have proper short+long header > > ___ You have grammar/spelling in your header that is unacceptable. > > ___ You have exceeded a sensible line length in your headers/comments/text. > > ___ You have failed to put in a proper Trac Ticket # into your commits. > > ___ You have incorrectly put/left internal data in your comments/files > (i.e. internal bug tracking tool IDs, product names etc) > > ___ You have not given any evidence of testing beyond basic build tests. > Demonstrate some level of runtime or other sanity testing. > > ___ You have ^M present in some of your files. These have to be removed. > > ___ You have needlessly changed whitespace or added whitespace crimes > like trailing spaces, or spaces before tabs. > > ___ You have mixed real technical changes with whitespace and other > cosmetic code cleanup changes. These have to be separate commits. > > ___ You need to refactor your submission into logical chunks; there is > too much content into a single commit. > > ___ You have extraneous garbage in your review (merge commits etc) > > ___ You have giant attachments which should never have been sent; > Instead you should place your content in a public tree to be pulled. > > ___ You have too many commits attached to an e-mail; resend as threaded > commits, or place in a public tree for a pull. > > ___ You have resent this content multiple times without a clear indication > of what has changed between each re-send. > > ___ You have failed to adequately and individually address all of the > comments and change requests that were proposed in the initial review. > > ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) > > ___ Your computer have a badly configured date and time; confusing the > the threaded patch review. > > ___ Your changes affect IPC mechanism, and you don't present any results > for in-service upgradability test. > > ___ Your changes affect user manual and documentation, your patch series > do not contain the patch that updates the Doxygen manual. > > > ---------------------------------------------------------------------- > ------ > -- > Site24x7 APM Insight: Get Deep Visibility into Application Performance > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > Monitor end-to-end web transactions and take corrective actions now > Troubleshoot faster and improve end-user experience. Signup Now! > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 > _______________________________________________ > Opensaf-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-devel > > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
