Hi Mahesh, For the comment 1, the patch will be updated accordingly.
For the comment 2, I think the CKPT service will not be backward compatible if the scAbsenceAllowed is true. The client can't create non-collocated checkpoint on SCs. Furthermore, this solution only protects the CKPT service from the case "The non-collocated checkpoint is created on a SC" there are still the cases where the replicas are completely lost. Ex: - The non-collocated checkpoint created on a PL. The PL reboots. Both replicas now locate on SCs. Then, headless state happens. All replicas are lost. - The non-collocated checkpoint has active replica locating on a PL and this PL restarts during headless state - The non-collocated checkpoint is created on PL3. This checkpoint is also opened on PL4. Then SCs and PL3 reboot. In this case, all replicas are lost and the client has to create it again. In case multiple nodes (which including SCs) reboot, losing replicas is unpreventable. The patch is to recover the checkpoints in possible cases. How do you think? Best regards, Nhat Pham -----Original Message----- From: A V Mahesh [mailto:[email protected]] Sent: Friday, February 12, 2016 11:10 AM To: Nhat Pham <[email protected]>; [email protected] Cc: [email protected]; Beatriz Brandao <[email protected]> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V2 [#1621] Comment 2 : After incorporating the comment one all the Limitations should be prevented based on Hydra configuration is enabled in IMM status. Foe example : if some application is trying to create non-collocated checkpoint active replica getting generated/locating on SC then ,regardless of the heads (SC`s) status exist not exist should return SA_AIS_ERR_NOT_SUPPORTED In other words, rather that allowing to created non-collocated checkpoint when heads(SC`s) are exit , and non-collocated checkpoint getting unrecoverable after heads(SC`s) rejoins. ============================================================================================= > Limitation: The CKPT service doesn't support recovering checkpoints in > following cases: > . The checkpoint which is unlinked before headless. > . The non-collocated checkpoint has active replica locating on SC. > . The non-collocated checkpoint has active replica locating on a PL and > this PL > restarts during headless state. In this cases, the checkpoint replica is > destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when the > client > accesses the checkpoint in these cases. The client must re-open the > checkpoint. ============================================================================================= -AVM On 2/11/2016 12:52 PM, A V Mahesh wrote: > Hi, > > I jut starred reviewing patch , I will be giving comments as soon as > I crossover any , to save some time. > > Comment 1 : > This functionality should be under checks if Hydra configuration is > enabled in IMM attrName = > const_cast<SaImmAttrNameT>("scAbsenceAllowed") > > Please see example how LOG/AMF services implemented it. > > -AVM > > > On 1/29/2016 1:02 PM, Nhat Pham wrote: >> Hi Mahesh, >> >> As described in the README, the CKPT service returns >> SA_AIS_ERR_TRY_AGAIN fault code in this case. >> I guess it's same for other services. >> >> @Anders: Could you please confirm this? >> >> Best regards, >> Nhat Pham >> >> -----Original Message----- >> From: A V Mahesh [mailto:[email protected]] >> Sent: Friday, January 29, 2016 2:11 PM >> To: Nhat Pham <[email protected]>; [email protected] >> Cc: [email protected] >> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support >> preserving and recovering checkpoint replicas during headless state >> V2 [#1621] >> >> Hi, >> >> On 1/29/2016 11:45 AM, Nhat Pham wrote: >>> - The behavior of application will be consistent with other saf >>> services like imm/amf behavior during headless state. >>> [Nhat] I'm not clear what you mean about "consistent"? >> In the obscene of Director (SC's) , what is expected return values >> of SAF API should ( all services ) , >> which are not in aposition to provide service at that moment. >> >> I think all services should return same SAF ERRS., I thinks >> currently we don't have it , may be Anders Widel will help us. >> >> -AVM >> >> >> On 1/29/2016 11:45 AM, Nhat Pham wrote: >>> Hi Mahesh, >>> >>> Please see the attachment for the README. Let me know if there is >>> any more information required. >>> >>> Regarding your comments: >>> - during headless state applications may behave like during >>> CPND restart case [Nhat] Headless state and CPND restart are >>> different events. Thus, the behavior is different. >>> Headless state is a case where both SCs go down. >>> >>> - The behavior of application will be consistent with other saf >>> services like imm/amf behavior during headless state. >>> [Nhat] I'm not clear what you mean about "consistent"? >>> >>> Best regards, >>> Nhat Pham >>> >>> -----Original Message----- >>> From: A V Mahesh [mailto:[email protected]] >>> Sent: Friday, January 29, 2016 11:12 AM >>> To: Nhat Pham <[email protected]>; [email protected] >>> Cc: [email protected] >>> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support >>> preserving and recovering checkpoint replicas during headless state >>> V2 [#1621] >>> >>> Hi Nhat Pham, >>> >>> I stared reviewing this patch , so can please provide README file >>> with scope and limitations , that will help to define >>> testing/reviewing scope . >>> >>> Following are minimum things we can keep in mind while >>> reviewing/accepting patch , >>> >>> - Not effecting existing functionality >>> - during headless state applications may behave like during >>> CPND restart case >>> - The minimum functionally of application works >>> - The behavior of application will be consistent with >>> other saf services like imm/amf behavior during headless state. >>> >>> So please do provide any additional detailed in README if any of the >>> above is deviated , that allow users to know about the >>> limitations/deviation. >>> >>> -AVM >>> >>> On 1/4/2016 3:15 PM, Nhat Pham wrote: >>>> Summary: cpsv: Support preserving and recovering checkpoint >>>> replicas during headless state [#1621] Review request for Trac Ticket(s): >>>> #1621 Peer Reviewer(s): [email protected]; >>>> [email protected] Pull request to: [email protected] >>>> Affected branch(es): default Development branch: default >>>> >>>> -------------------------------- >>>> Impacted area Impact y/n >>>> -------------------------------- >>>> Docs n >>>> Build system n >>>> RPM/packaging n >>>> Configuration files n >>>> Startup scripts n >>>> SAF services y >>>> OpenSAF services n >>>> Core libraries n >>>> Samples n >>>> Tests n >>>> Other n >>>> >>>> >>>> Comments (indicate scope for each "y" above): >>>> --------------------------------------------- >>>> >>>> changeset faec4a4445a4c23e8f630857b19aabb43b5af18d >>>> Author: Nhat Pham <[email protected]> >>>> Date: Mon, 04 Jan 2016 16:34:33 +0700 >>>> >>>> cpsv: Support preserving and recovering checkpoint replicas >>>> during headless state [#1621] >>>> >>>> Background: >>>> ---------- This enhancement supports to preserve checkpoint >>>> replicas >>> in case >>>> both SCs down (headless state) and recover replicas in case one >>>> of >>> SCs up >>>> again. If both SCs goes down, checkpoint replicas on surviving >>>> nodes >>> still >>>> remain. When a SC is available again, surviving replicas are >>> automatically >>>> registered to the SC checkpoint database. Content in surviving >>> replicas are >>>> intacted and synchronized to new replicas. >>>> >>>> When no SC is available, client API calls changing checkpoint >>> configuration >>>> which requires SC communication, are rejected. Client API calls >>> reading and >>>> writing existing checkpoint replicas still work. >>>> >>>> Limitation: The CKPT service does not support recovering >>>> checkpoints >>> in >>>> following cases: >>>> - The checkpoint which is unlinked before headless. >>>> - The non-collocated checkpoint has active replica locating on >>>> SC. >>>> - The non-collocated checkpoint has active replica locating on >>>> a PL >>> and this >>>> PL restarts during headless state. In this cases, the >>>> checkpoint >>> replica is >>>> destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned >>>> when the >>> client >>>> accesses the checkpoint in these cases. The client must re-open >>>> the >>>> checkpoint. >>>> >>>> While in headless state, accessing checkpoint replicas does not >>>> work >>> if the >>>> node which hosts the active replica goes down. It will back >>>> working >>> when a >>>> SC available again. >>>> >>>> Solution: >>>> --------- The solution for this enhancement includes 2 parts: >>>> >>>> 1. To destroy un-recoverable checkpoint described above when >>>> both >>> SCs are >>>> down: When both SCs are down, the CPND deletes un-recoverable >>> checkpoint >>>> nodes and replicas on PLs. Then it requests CPA to destroy >>> corresponding >>>> checkpoint node by using new message CPA_EVT_ND2A_CKPT_DESTROY >>>> >>>> 2. To update CPD with checkpoint information When an active SC >>>> is up >>> after >>>> headless, CPND will update CPD with checkpoint information by >>>> using >>> new >>>> message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using >>>> CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will create >>>> new >>> ckpt_id >>>> for the checkpoint which might be different with the current >>>> ckpt id >>> if the >>>> CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint >>> information >>>> within 6s. During this updating time, following requests is >>>> rejected >>> with >>>> fault code SA_AIS_ERR_TRY_AGAIN: >>>> - CPD_EVT_ND2D_CKPT_CREATE >>>> - CPD_EVT_ND2D_CKPT_UNLINK >>>> - CPD_EVT_ND2D_ACTIVE_SET >>>> - CPD_EVT_ND2D_CKPT_RDSET >>>> >>>> >>>> Complete diffstat: >>>> ------------------ >>>> osaf/libs/agents/saf/cpa/cpa_proc.c | 52 >>> +++++++++++++++++++++++++++++++++++ >>>> osaf/libs/common/cpsv/cpsv_edu.c | 43 >>> +++++++++++++++++++++++++++++ >>>> osaf/libs/common/cpsv/include/cpd_cb.h | 3 ++ >>>> osaf/libs/common/cpsv/include/cpd_imm.h | 1 + >>>> osaf/libs/common/cpsv/include/cpd_proc.h | 7 ++++ >>>> osaf/libs/common/cpsv/include/cpd_tmr.h | 3 +- >>>> osaf/libs/common/cpsv/include/cpnd_cb.h | 1 + >>>> osaf/libs/common/cpsv/include/cpnd_init.h | 2 + >>>> osaf/libs/common/cpsv/include/cpsv_evt.h | 20 +++++++++++++ >>>> osaf/services/saf/cpsv/cpd/Makefile.am | 3 +- >>>> osaf/services/saf/cpsv/cpd/cpd_evt.c | 229 >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> ++++ >>>> osaf/services/saf/cpsv/cpd/cpd_imm.c | 112 >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>>> osaf/services/saf/cpsv/cpd/cpd_init.c | 20 ++++++++++++- >>>> osaf/services/saf/cpsv/cpd/cpd_proc.c | 309 >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> osaf/services/saf/cpsv/cpd/cpd_tmr.c | 7 ++++ >>>> osaf/services/saf/cpsv/cpnd/cpnd_db.c | 16 ++++++++++ >>>> osaf/services/saf/cpsv/cpnd/cpnd_evt.c | 22 +++++++++++++++ >>>> osaf/services/saf/cpsv/cpnd/cpnd_init.c | 23 ++++++++++++++- >>>> osaf/services/saf/cpsv/cpnd/cpnd_mds.c | 13 ++++++++ >>>> osaf/services/saf/cpsv/cpnd/cpnd_proc.c | 314 >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>> 20 files changed, 1189 insertions(+), 11 deletions(-) >>>> >>>> >>>> Testing Commands: >>>> ----------------- >>>> - >>>> >>>> Testing, Expected Results: >>>> -------------------------- >>>> - >>>> >>>> >>>> Conditions of Submission: >>>> ------------------------- >>>> <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>> >>>> >>>> >>>> Arch Built Started Linux distro >>>> ------------------------------------------- >>>> mips n n >>>> mips64 n n >>>> x86 n n >>>> x86_64 n n >>>> powerpc n n >>>> powerpc64 n n >>>> >>>> >>>> Reviewer Checklist: >>>> ------------------- >>>> [Submitters: make sure that your review doesn't trigger any >>>> checkmarks!] >>>> >>>> >>>> Your checkin has not passed review because (see checked entries): >>>> >>>> ___ Your RR template is generally incomplete; it has too many blank >>> entries >>>> that need proper data filled in. >>>> >>>> ___ You have failed to nominate the proper persons for review and >>>> push. >>>> >>>> ___ Your patches do not have proper short+long header >>>> >>>> ___ You have grammar/spelling in your header that is unacceptable. >>>> >>>> ___ You have exceeded a sensible line length in your >>> headers/comments/text. >>>> ___ You have failed to put in a proper Trac Ticket # into your >>>> commits. >>>> >>>> ___ You have incorrectly put/left internal data in your comments/files >>>> (i.e. internal bug tracking tool IDs, product names etc) >>>> >>>> ___ You have not given any evidence of testing beyond basic build >>>> tests. >>>> Demonstrate some level of runtime or other sanity testing. >>>> >>>> ___ You have ^M present in some of your files. These have to be >>>> removed. >>>> >>>> ___ You have needlessly changed whitespace or added whitespace crimes >>>> like trailing spaces, or spaces before tabs. >>>> >>>> ___ You have mixed real technical changes with whitespace and other >>>> cosmetic code cleanup changes. These have to be separate >>>> commits. >>>> >>>> ___ You need to refactor your submission into logical chunks; there is >>>> too much content into a single commit. >>>> >>>> ___ You have extraneous garbage in your review (merge commits etc) >>>> >>>> ___ You have giant attachments which should never have been sent; >>>> Instead you should place your content in a public tree to be >>>> pulled. >>>> >>>> ___ You have too many commits attached to an e-mail; resend as >>>> threaded >>>> commits, or place in a public tree for a pull. >>>> >>>> ___ You have resent this content multiple times without a clear >>>> indication >>>> of what has changed between each re-send. >>>> >>>> ___ You have failed to adequately and individually address all of the >>>> comments and change requests that were proposed in the >>>> initial >>> review. >>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email >>>> etc) >>>> >>>> ___ Your computer have a badly configured date and time; confusing the >>>> the threaded patch review. >>>> >>>> ___ Your changes affect IPC mechanism, and you don't present any >>>> results >>>> for in-service upgradability test. >>>> >>>> ___ Your changes affect user manual and documentation, your patch >>>> series >>>> do not contain the patch that updates the Doxygen manual. >>>> >> > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
