Hi! Please find my answer inline, marked [AndersW].
regards, Anders Widell On 02/15/2016 10:38 AM, Nhat Pham wrote: > Hi Mahesh, > > It's good. Thank you. :) > > [AVM] Up on rejoining of the SC`s The replica should be re-created regardless > of another application opens it on PL4. > ( Note : this comment is based on your explanation have not yet > reviewed/tested , > currently i am struggling with SC`s not rejoining > after headless state , i can provide you more on this once i complte my > review/testing) > > [Nhat] To make cloud resilience works, you need the patches from other > services (log, amf, clm, ntf). > @Minh: I heard that you created tar file which includes all patches. Could you > please send it to Mahesh? Thanks > > [AVM] I understand that , before I comment more on this please allow me to > understand > I am not still not very clear of the headless design in detail. > For example cluster membership of PL`s during headless state , > In the absence of SC`s (CLMD) dose the PLs is considered as > cluster nodes or not (cluster membership) ? > > [Nhat] I don't know much about this. > @ Anders: Could you please have comment about this? Thanks [AndersW] First of all, keep in mind that the "headless" state should ideally not last a very long time. Once we have the spare SC feature in place (ticket [#79]), a new SC should become active within a matter of a few seconds after we have lost both the active and the standby SC. I think you should view the state of the cluster in the headless state in the same way as you view the state of the cluster during a failover between the active and the standby SC. Imagine that the active SC dies. It takes the standby SC 1.5 seconds to detect the failure of the active SC (this is due to the TIPC timeout). If you have configured the PROMOTE_ACTIVE_TIMER, there is an additional delay before the standby takes over as active. What is the state of the cluster during the time after the active SC failed and before the standby takes over? The state of the cluster while it is headless is very similar. The difference is that this state may last a little bit longer (though not more than a few seconds, until one of the spare SCs becomes active). Another difference is that we may have lost some state. With a "perfect" implementation of the headless feature we should not lose any state at all, but with the current set of patches we do lose state. So specifically if we talk about cluster membership and ask the question: is a particular PL a member of the cluster or not during the headless state? Well, if you ask CLM about this during the headless state, then you will not know - because CLM doesn't provide any service during the headless state. If you keep retrying you query to CLM, you will eventually get an answer - but you will not get this answer until there is an active SC again and we have exited the headless state. When viewed in this way, the answer to the question about a node's membership is undefined during the headless state, since CLM will not provide you with any answer until there is an active SC. However, if you asked CLM about the node's cluster membership status before the cluster went headless, you probably saved a cached copy of the cluster membership state. Maybe you also installed a CLM track callback and intend to update this cached copy every time the cluster membership status changes. The question then is: can you continue using this cached copy of the cluster membership state during the headless state? The answer is YES: since CLM doesn't provide any service during the headless state, it also means that the cluster membership view cannot change during this time. Nodes can of course reboot or die, but CLM will not notice and hence the cluster view will not be updated. You can argue that this is bad because the cluster view doesn't reflect reality, but notice that this will always be the case. We can never propagate information instantaneously, and detection of node failures will take 1.5 seconds due to the TIPC timeout. You can never be sure that a node is alive at this very moment just because CLM tells you that it is a member of the cluster. If we are unfortunate enough to lose both system controller nodes simultaneously, updates to the cluster membership view will be delayed a few seconds longer than usual. > > Best regards, > Nhat Pham > > -----Original Message----- > From: A V Mahesh [mailto:[email protected]] > Sent: Monday, February 15, 2016 11:19 AM > To: Nhat Pham <[email protected]>; [email protected] > Cc: [email protected]; 'Beatriz Brandao' > <[email protected]> > Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support preserving and > recovering checkpoint replicas during headless state V2 [#1621] > > Hi Nhat Pham, > > How is your holiday went > > Please find my comments below > > On 2/15/2016 8:43 AM, Nhat Pham wrote: >> Hi Mahesh, >> >> For the comment 1, the patch will be updated accordingly. > [AVM] Please hold , I will provide more comments in this week , so we can > have consolidated V3 >> For the comment 2, I think the CKPT service will not be backward >> compatible if the scAbsenceAllowed is true. >> The client can't create non-collocated checkpoint on SCs. >> >> Furthermore, this solution only protects the CKPT service from the >> case "The non-collocated checkpoint is created on a SC" >> there are still the cases where the replicas are completely lost. Ex: >> >> - The non-collocated checkpoint created on a PL. The PL reboots. Both >> replicas now locate on SCs. Then, headless state happens. All replicas are >> lost. >> - The non-collocated checkpoint has active replica locating on a PL >> and this PL restarts during headless state >> - The non-collocated checkpoint is created on PL3. This checkpoint is >> also opened on PL4. Then SCs and PL3 reboot. > [AVM] Up on rejoining of the SC`s The replica should be re-created regardless > of another application opens it on PL4. > ( Note : this comment is based on your explanation have not yet > reviewed/tested , > currently i am struggling with SC`s not rejoining > after headless state , i can provide you more on this once i complte my > review/testing) >> In this case, all replicas are lost and the client has to create it again. >> >> In case multiple nodes (which including SCs) reboot, losing replicas >> is unpreventable. The patch is to recover the checkpoints in possible cases. >> How do you think? > [AVM] I understand that , before I comment more on this please allow > me to understand > I am not still not very clear of the headless design in detail. > > For example cluster membership of PL`s during headless > state , > In the absence of SC`s (CLMD) dose the PLs is considered as > cluster nodes or not (cluster membership) ? > > - if not consider as NON cluster nodes Checkpoint > Service > API should leverage the SA Forum Cluster > Membership Service and API's can fail with > SA_AIS_ERR_UNAVAILABLE > > - if considers as cluster nodes we need to follow all > the > defined rules which are defined in SAI-AIS-CKPT-B.02.02 specification > > so give me some more time to review it completely , so that we > can have consolidated patch V3 > > -AVM >> Best regards, >> Nhat Pham >> >> -----Original Message----- >> From: A V Mahesh [mailto:[email protected]] >> Sent: Friday, February 12, 2016 11:10 AM >> To: Nhat Pham <[email protected]>; [email protected] >> Cc: [email protected]; Beatriz Brandao >> <[email protected]> >> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support >> preserving and recovering checkpoint replicas during headless state V2 >> [#1621] >> >> >> Comment 2 : >> >> After incorporating the comment one all the Limitations should be >> prevented based on Hydra configuration is enabled in IMM status. >> >> Foe example : if some application is trying to create >> >> non-collocated checkpoint active replica getting generated/locating on >> SC then ,regardless of the heads (SC`s) status exist not exist should >> return SA_AIS_ERR_NOT_SUPPORTED >> >> In other words, rather that allowing to created non-collocated >> checkpoint when >> heads(SC`s) are exit , and non-collocated checkpoint getting >> unrecoverable after heads(SC`s) rejoins. >> >> ====================================================================== >> ======================= >>> Limitation: The CKPT service doesn't support recovering checkpoints in >>> following cases: >>> . The checkpoint which is unlinked before headless. >>> . The non-collocated checkpoint has active replica locating on SC. >>> . The non-collocated checkpoint has active replica locating on a PL >>> and this PL >>> restarts during headless state. In this cases, the checkpoint replica is >>> destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when the >>> client >>> accesses the checkpoint in these cases. The client must re-open the >>> checkpoint. >> ====================================================================== >> ======================= >> >> -AVM >> >> >> On 2/11/2016 12:52 PM, A V Mahesh wrote: >>> Hi, >>> >>> I jut starred reviewing patch , I will be giving comments as soon as >>> I crossover any , to save some time. >>> >>> Comment 1 : >>> This functionality should be under checks if Hydra configuration is >>> enabled in IMM attrName = >>> const_cast<SaImmAttrNameT>("scAbsenceAllowed") >>> >>> Please see example how LOG/AMF services implemented it. >>> >>> -AVM >>> >>> >>> On 1/29/2016 1:02 PM, Nhat Pham wrote: >>>> Hi Mahesh, >>>> >>>> As described in the README, the CKPT service returns >>>> SA_AIS_ERR_TRY_AGAIN fault code in this case. >>>> I guess it's same for other services. >>>> >>>> @Anders: Could you please confirm this? >>>> >>>> Best regards, >>>> Nhat Pham >>>> >>>> -----Original Message----- >>>> From: A V Mahesh [mailto:[email protected]] >>>> Sent: Friday, January 29, 2016 2:11 PM >>>> To: Nhat Pham <[email protected]>; [email protected] >>>> Cc: [email protected] >>>> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support >>>> preserving and recovering checkpoint replicas during headless state >>>> V2 [#1621] >>>> >>>> Hi, >>>> >>>> On 1/29/2016 11:45 AM, Nhat Pham wrote: >>>>> - The behavior of application will be consistent with other >>>>> saf services like imm/amf behavior during headless state. >>>>> [Nhat] I'm not clear what you mean about "consistent"? >>>> In the obscene of Director (SC's) , what is expected return values >>>> of SAF API should ( all services ) , >>>> which are not in aposition to provide service at that moment. >>>> >>>> I think all services should return same SAF ERRS., I thinks >>>> currently we don't have it , may be Anders Widel will help us. >>>> >>>> -AVM >>>> >>>> >>>> On 1/29/2016 11:45 AM, Nhat Pham wrote: >>>>> Hi Mahesh, >>>>> >>>>> Please see the attachment for the README. Let me know if there is >>>>> any more information required. >>>>> >>>>> Regarding your comments: >>>>> - during headless state applications may behave like during >>>>> CPND restart case [Nhat] Headless state and CPND restart are >>>>> different events. Thus, the behavior is different. >>>>> Headless state is a case where both SCs go down. >>>>> >>>>> - The behavior of application will be consistent with other >>>>> saf services like imm/amf behavior during headless state. >>>>> [Nhat] I'm not clear what you mean about "consistent"? >>>>> >>>>> Best regards, >>>>> Nhat Pham >>>>> >>>>> -----Original Message----- >>>>> From: A V Mahesh [mailto:[email protected]] >>>>> Sent: Friday, January 29, 2016 11:12 AM >>>>> To: Nhat Pham <[email protected]>; >>>>> [email protected] >>>>> Cc: [email protected] >>>>> Subject: Re: [PATCH 0 of 1] Review Request for cpsv: Support >>>>> preserving and recovering checkpoint replicas during headless state >>>>> V2 [#1621] >>>>> >>>>> Hi Nhat Pham, >>>>> >>>>> I stared reviewing this patch , so can please provide README file >>>>> with scope and limitations , that will help to define >>>>> testing/reviewing scope . >>>>> >>>>> Following are minimum things we can keep in mind while >>>>> reviewing/accepting patch , >>>>> >>>>> - Not effecting existing functionality >>>>> - during headless state applications may behave like during >>>>> CPND restart case >>>>> - The minimum functionally of application works >>>>> - The behavior of application will be consistent with >>>>> other saf services like imm/amf behavior during headless state. >>>>> >>>>> So please do provide any additional detailed in README if any of >>>>> the above is deviated , that allow users to know about the >>>>> limitations/deviation. >>>>> >>>>> -AVM >>>>> >>>>> On 1/4/2016 3:15 PM, Nhat Pham wrote: >>>>>> Summary: cpsv: Support preserving and recovering checkpoint >>>>>> replicas during headless state [#1621] Review request for Trac >>>>>> Ticket(s): >>>>>> #1621 Peer Reviewer(s): [email protected]; >>>>>> [email protected] Pull request to: >>>>>> [email protected] Affected branch(es): default Development >>>>>> branch: default >>>>>> >>>>>> -------------------------------- >>>>>> Impacted area Impact y/n >>>>>> -------------------------------- >>>>>> Docs n >>>>>> Build system n >>>>>> RPM/packaging n >>>>>> Configuration files n >>>>>> Startup scripts n >>>>>> SAF services y >>>>>> OpenSAF services n >>>>>> Core libraries n >>>>>> Samples n >>>>>> Tests n >>>>>> Other n >>>>>> >>>>>> >>>>>> Comments (indicate scope for each "y" above): >>>>>> --------------------------------------------- >>>>>> >>>>>> changeset faec4a4445a4c23e8f630857b19aabb43b5af18d >>>>>> Author: Nhat Pham <[email protected]> >>>>>> Date: Mon, 04 Jan 2016 16:34:33 +0700 >>>>>> >>>>>> cpsv: Support preserving and recovering checkpoint replicas >>>>>> during headless state [#1621] >>>>>> >>>>>> Background: >>>>>> ---------- This enhancement supports to preserve checkpoint >>>>>> replicas >>>>> in case >>>>>> both SCs down (headless state) and recover replicas in case >>>>>> one of >>>>> SCs up >>>>>> again. If both SCs goes down, checkpoint replicas on >>>>>> surviving nodes >>>>> still >>>>>> remain. When a SC is available again, surviving replicas are >>>>> automatically >>>>>> registered to the SC checkpoint database. Content in >>>>>> surviving >>>>> replicas are >>>>>> intacted and synchronized to new replicas. >>>>>> >>>>>> When no SC is available, client API calls changing checkpoint >>>>> configuration >>>>>> which requires SC communication, are rejected. Client API >>>>>> calls >>>>> reading and >>>>>> writing existing checkpoint replicas still work. >>>>>> >>>>>> Limitation: The CKPT service does not support recovering >>>>>> checkpoints >>>>> in >>>>>> following cases: >>>>>> - The checkpoint which is unlinked before headless. >>>>>> - The non-collocated checkpoint has active replica locating >>>>>> on SC. >>>>>> - The non-collocated checkpoint has active replica locating >>>>>> on a PL >>>>> and this >>>>>> PL restarts during headless state. In this cases, the >>>>>> checkpoint >>>>> replica is >>>>>> destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned >>>>>> when the >>>>> client >>>>>> accesses the checkpoint in these cases. The client must >>>>>> re-open the >>>>>> checkpoint. >>>>>> >>>>>> While in headless state, accessing checkpoint replicas does >>>>>> not work >>>>> if the >>>>>> node which hosts the active replica goes down. It will back >>>>>> working >>>>> when a >>>>>> SC available again. >>>>>> >>>>>> Solution: >>>>>> --------- The solution for this enhancement includes 2 parts: >>>>>> >>>>>> 1. To destroy un-recoverable checkpoint described above when >>>>>> both >>>>> SCs are >>>>>> down: When both SCs are down, the CPND deletes un-recoverable >>>>> checkpoint >>>>>> nodes and replicas on PLs. Then it requests CPA to destroy >>>>> corresponding >>>>>> checkpoint node by using new message >>>>>> CPA_EVT_ND2A_CKPT_DESTROY >>>>>> >>>>>> 2. To update CPD with checkpoint information When an active >>>>>> SC is up >>>>> after >>>>>> headless, CPND will update CPD with checkpoint information by >>>>>> using >>>>> new >>>>>> message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using >>>>>> CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will >>>>>> create new >>>>> ckpt_id >>>>>> for the checkpoint which might be different with the current >>>>>> ckpt id >>>>> if the >>>>>> CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint >>>>> information >>>>>> within 6s. During this updating time, following requests is >>>>>> rejected >>>>> with >>>>>> fault code SA_AIS_ERR_TRY_AGAIN: >>>>>> - CPD_EVT_ND2D_CKPT_CREATE >>>>>> - CPD_EVT_ND2D_CKPT_UNLINK >>>>>> - CPD_EVT_ND2D_ACTIVE_SET >>>>>> - CPD_EVT_ND2D_CKPT_RDSET >>>>>> >>>>>> >>>>>> Complete diffstat: >>>>>> ------------------ >>>>>> osaf/libs/agents/saf/cpa/cpa_proc.c | 52 >>>>> +++++++++++++++++++++++++++++++++++ >>>>>> osaf/libs/common/cpsv/cpsv_edu.c | 43 >>>>> +++++++++++++++++++++++++++++ >>>>>> osaf/libs/common/cpsv/include/cpd_cb.h | 3 ++ >>>>>> osaf/libs/common/cpsv/include/cpd_imm.h | 1 + >>>>>> osaf/libs/common/cpsv/include/cpd_proc.h | 7 ++++ >>>>>> osaf/libs/common/cpsv/include/cpd_tmr.h | 3 +- >>>>>> osaf/libs/common/cpsv/include/cpnd_cb.h | 1 + >>>>>> osaf/libs/common/cpsv/include/cpnd_init.h | 2 + >>>>>> osaf/libs/common/cpsv/include/cpsv_evt.h | 20 +++++++++++++ >>>>>> osaf/services/saf/cpsv/cpd/Makefile.am | 3 +- >>>>>> osaf/services/saf/cpsv/cpd/cpd_evt.c | 229 >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> ++++ >>>>>> osaf/services/saf/cpsv/cpd/cpd_imm.c | 112 >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>>> osaf/services/saf/cpsv/cpd/cpd_init.c | 20 ++++++++++++- >>>>>> osaf/services/saf/cpsv/cpd/cpd_proc.c | 309 >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> osaf/services/saf/cpsv/cpd/cpd_tmr.c | 7 ++++ >>>>>> osaf/services/saf/cpsv/cpnd/cpnd_db.c | 16 ++++++++++ >>>>>> osaf/services/saf/cpsv/cpnd/cpnd_evt.c | 22 +++++++++++++++ >>>>>> osaf/services/saf/cpsv/cpnd/cpnd_init.c | 23 ++++++++++++++- >>>>>> osaf/services/saf/cpsv/cpnd/cpnd_mds.c | 13 ++++++++ >>>>>> osaf/services/saf/cpsv/cpnd/cpnd_proc.c | 314 >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>>>> 20 files changed, 1189 insertions(+), 11 deletions(-) >>>>>> >>>>>> >>>>>> Testing Commands: >>>>>> ----------------- >>>>>> - >>>>>> >>>>>> Testing, Expected Results: >>>>>> -------------------------- >>>>>> - >>>>>> >>>>>> >>>>>> Conditions of Submission: >>>>>> ------------------------- >>>>>> <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>> >>>>>> >>>>>> >>>>>> Arch Built Started Linux distro >>>>>> ------------------------------------------- >>>>>> mips n n >>>>>> mips64 n n >>>>>> x86 n n >>>>>> x86_64 n n >>>>>> powerpc n n >>>>>> powerpc64 n n >>>>>> >>>>>> >>>>>> Reviewer Checklist: >>>>>> ------------------- >>>>>> [Submitters: make sure that your review doesn't trigger any >>>>>> checkmarks!] >>>>>> >>>>>> >>>>>> Your checkin has not passed review because (see checked entries): >>>>>> >>>>>> ___ Your RR template is generally incomplete; it has too many >>>>>> blank >>>>> entries >>>>>> that need proper data filled in. >>>>>> >>>>>> ___ You have failed to nominate the proper persons for review and >>>>>> push. >>>>>> >>>>>> ___ Your patches do not have proper short+long header >>>>>> >>>>>> ___ You have grammar/spelling in your header that is unacceptable. >>>>>> >>>>>> ___ You have exceeded a sensible line length in your >>>>> headers/comments/text. >>>>>> ___ You have failed to put in a proper Trac Ticket # into your >>>>>> commits. >>>>>> >>>>>> ___ You have incorrectly put/left internal data in your comments/files >>>>>> (i.e. internal bug tracking tool IDs, product names etc) >>>>>> >>>>>> ___ You have not given any evidence of testing beyond basic build >>>>>> tests. >>>>>> Demonstrate some level of runtime or other sanity testing. >>>>>> >>>>>> ___ You have ^M present in some of your files. These have to be >>>>>> removed. >>>>>> >>>>>> ___ You have needlessly changed whitespace or added whitespace crimes >>>>>> like trailing spaces, or spaces before tabs. >>>>>> >>>>>> ___ You have mixed real technical changes with whitespace and other >>>>>> cosmetic code cleanup changes. These have to be separate >>>>>> commits. >>>>>> >>>>>> ___ You need to refactor your submission into logical chunks; there is >>>>>> too much content into a single commit. >>>>>> >>>>>> ___ You have extraneous garbage in your review (merge commits etc) >>>>>> >>>>>> ___ You have giant attachments which should never have been sent; >>>>>> Instead you should place your content in a public tree to >>>>>> be pulled. >>>>>> >>>>>> ___ You have too many commits attached to an e-mail; resend as >>>>>> threaded >>>>>> commits, or place in a public tree for a pull. >>>>>> >>>>>> ___ You have resent this content multiple times without a clear >>>>>> indication >>>>>> of what has changed between each re-send. >>>>>> >>>>>> ___ You have failed to adequately and individually address all of the >>>>>> comments and change requests that were proposed in the >>>>>> initial >>>>> review. >>>>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email >>>>>> etc) >>>>>> >>>>>> ___ Your computer have a badly configured date and time; confusing the >>>>>> the threaded patch review. >>>>>> >>>>>> ___ Your changes affect IPC mechanism, and you don't present any >>>>>> results >>>>>> for in-service upgradability test. >>>>>> >>>>>> ___ Your changes affect user manual and documentation, your patch >>>>>> series >>>>>> do not contain the patch that updates the Doxygen manual. >>>>>> > ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
