I will bring up this issue on our next TLC meeting. The immediate concern here is the prolonged startup time, which can be regarded as a performance regression in AMF. The long-term concern is the alignment of the solution with future the direction of OpenSAF (removal of FM has been discussed).
regards, Anders Widell On 09/11/2015 11:50 AM, Nagendra Kumar wrote: > Hi Anders, > Since you are talking about making FMS redundant and its > removal and moving from NID to system and since this problem is existing in > other pre Amf components(Amfnd actually), so I am being indecisive to > response. May be you can talk in TLC and conclude and raise a ticket and > prepare a roadmap for doing so. As a maintainer, respective people will fix > this issue in their services. > > Thanks > -Nagu >> -----Original Message----- >> From: Anders Widell [mailto:[email protected]] >> Sent: 10 September 2015 20:53 >> To: Nagendra Kumar; [email protected]; Praveen Malviya; >> Mathivanan Naickan Palanivelu >> Cc: [email protected] >> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to nid >> only after initialization is completed [#1334] >> >> Hi >> >> See my replies marked with [AndersW] inline. >> >> thanks, >> Anders Widell >> >> On 09/10/2015 01:27 PM, Nagendra Kumar wrote: >>> Hi Anders, >>> Thanks for pointing out. Please find my responses inlined with >> [Nagu]. >>> Thanks >>> -Nagu >>>> -----Original Message----- >>>> From: Anders Widell [mailto:[email protected]] >>>> Sent: 08 September 2015 15:44 >>>> To: Nagendra Kumar; [email protected]; Praveen Malviya; >>>> Mathivanan Naickan Palanivelu >>>> Cc: [email protected] >>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to >> nid >>>> only after initialization is completed [#1334] >>>> >>>> Hi! >>>> >>>> I have a couple of questing regarding this ticket: >>>> >>>> 1) Why did you chose a solution where you delay the reply to NID until >>>> amfd has completed its cold sync, instead of a solution where amfd would >>>> itself initiate a node reboot if a failover happens before it has >>>> completed the cold sync? The chosen solution will delay the startup time >>>> of the standby controller, so there is a clear (and visible, since we >>>> are measuring startup time) disadvantage with this solution. Is there >>>> any advantage with it, or disadvantage with the other solution? >>> [Nagu]: If you could read my comment in #1334 ticket, it will point out a >> flaw >>> in handling failover by FMS in case of cold sync in progress(Amfd). >>> FMS collects evidence of Cold sync complete by checking whether FMS has >>> got CSI assignment or not. Since Amfd has acknowledged NID that he has >> completed >>> its initialization(without Cold sync completed) and hence NID started Amfnd >>> and Amfnd took configuration from Act Amfd and assigned CSI to FMS. Till >> here Amfd is >>> not synced. So, when failover is triggered FMS couldn't take any decision. >>> So, #1334 was reported. >> [AndersW] I agree that the issue reported in ticket [#1334] is a bug. >> The observed problem is that the standby SC does not go for a reboot >> immediately in the case when an SC failover happens before the cold sync >> is complete. It should reboot immediately. What I am questioning is the >> solution to this problem. The implemented solution delays the ack from >> amfd to NID until the cold sync is complete, so that the logic in FM >> will work as intended and reboot the node. My proposal is to instead >> solve this problem by initiating the reboot directly from AMFD. >>> Now the question was >>> 1. whether Amfd can ack NID that it is initialized and ready for >>> Assignment/failover without syncing. >> [AndersW] The question here is what an ack to NID actually means (or >> should mean). Ideally, I think we should aim to move away from using NID >> for serializing the OpenSAF startup and instead solve dependencies >> between OpenSAF services using TRY_AGAIN loops or similar constructs. >> I.e. the same philosophy as used in systemd. By delaying the reply to >> NID we are moving in the opposite direction. >>> 2. Whether FMS logic can be changed. >> [AndersW] My proposal wasn't really to change the logic in FM, but >> rather to order the reboot from the OpenSAF director (amfd in this case) >> when a failover happens before the cold sync is complete. The logic in >> FM doesn't have to be changed, but if all directors have been fixed in >> this way then the logic in FM would be redundant and could be removed. >>> I thought Amfd need to ack NID when it completes sync. >>> >>> Whether 1. it increased Startuptime of Standby controller or >>> 2. it actually corrected the errata of measurement, >>> is subject to interpretation. >> [AndersW] I think the startup time is delayed by the current solution to >> this ticket. Look at it from the perspective of an application component >> which is started by AMF on the standby SC, and measure how long time it >> takes until this component gets an assignment after a reboot. >>> In my opinion, #2 is correct. If Amfd is not ready, it should not have sent >>> the >> Ack. >>> So, it was a bug. >>> >>>> 2) Isn't this problem applicable to all OpenSAF directors, and not just >>>> AMFD? Has anyone analyzed the other services to determine if they also >>>> need a similar fix? Since NID starts the services in serial and doesn't >>>> continue with the next service before it has received a reply from the >>>> previous one, it would mean that with the chosen solution all OpenSAF >>>> directors will be cold synced in serial. With the other solution, they >>>> can be cold synced in parallel. >>> [Nagu]: I checked now in Clmd, Ntfd, Logd, Immd and all needs correction >> and hence startup time >>> will increase more. >> [AndersW] Ok, thanks for checking! So then we will eventually have to >> fix them as well. But again I would like to point out that if we go for >> the solution that has been implemented in AMFD then the OpenSAF cold >> sync will be serialized. With the solution that I propose, the cold sync >> can be done in parallel. By syncing the OpenSAF directors in parallel we >> get a shorter time window when the system is without a fully synced >> standby SC, and hence improve the availability of the system. >>>> / Anders Widell >>>> >>>> On 04/27/2015 11:19 AM, [email protected] wrote: >>>>> Summary: amfd: respond to nid only after initialization is completed >> [#1334] >>>>> Review request for Trac Ticket(s): #1334 >>>>> Peer Reviewer(s): Mathi, Hans N, Praveen >>>>> Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>> >>>>> Affected branch(es): All >>>>> Development branch: Default >>>>> >>>>> -------------------------------- >>>>> Impacted area Impact y/n >>>>> -------------------------------- >>>>> Docs n >>>>> Build system n >>>>> RPM/packaging n >>>>> Configuration files n >>>>> Startup scripts n >>>>> SAF services y >>>>> OpenSAF services n >>>>> Core libraries n >>>>> Samples n >>>>> Tests n >>>>> Other n >>>>> >>>>> >>>>> Comments (indicate scope for each "y" above): >>>>> --------------------------------------------- >>>>> <<EXPLAIN/COMMENT THE PATCH SERIES HERE>> >>>>> >>>>> changeset eeba7fe22afb8f4c40bb393d45a108cc59061eda >>>>> Author: Nagendra Kumar<[email protected]> >>>>> Date: Mon, 27 Apr 2015 14:41:02 +0530 >>>>> >>>>> amfd: respond to nid only after initialization is completed [#1334] Act >>>> Amfd >>>>> initialization is said to be completed when it completes its >>>> initialization >>>>> with imm. Apart from initializing with imm, Standby Amfd also need >>>> to get >>>>> run time data from Act Amfd using cold sync. So, Standby Amfd >>>> initialization >>>>> is said to be completed when it completes its initialization with imm >>>> and it >>>>> completes its cold sync with Act Amfd. In the present code, Standby is >>>>> sending response to nid without cold sync complete. So, code has >>>> been added >>>>> to send nid response only when Amfd completes its initialization. >>>>> >>>>> >>>>> Complete diffstat: >>>>> ------------------ >>>>> osaf/services/saf/amf/amfd/chkop.cc | 4 ++++ >>>>> osaf/services/saf/amf/amfd/main.cc | 6 +++++- >>>>> 2 files changed, 9 insertions(+), 1 deletions(-) >>>>> >>>>> >>>>> Testing Commands: >>>>> ----------------- >>>>> 1. Start Act controller SC-1. >>>>> Start Standby controller SC-2 and stop SC-1 when >>>>> Standby Amfd is in the middle of cold sync. >>>>> 2. Start Act controller SC-1. >>>>> Start Standby controller SC-2 and stop SC-1 when >>>>> Standby Amfd has completed cold sync but Amfnd has not asigned >>>>> role to Fmd at Standby controller. >>>>> >>>>> >>>>> Testing, Expected Results: >>>>> -------------------------- >>>>> 1. Fmd reboots the node : >>>>> Apr 27 12:56:32 PM_SC-2 osaffmd[11792]: Rebooting OpenSAF NodeId = 0 >> EE >>>> Name = No EE Mapped, Reason: Failover occurred, but this node is not yet >>>> ready, OwnNodeId = 131599, SupervisionTime = 60 >>>>> 2. Same as above. >>>>> >>>>> Conditions of Submission: >>>>> ------------------------- >>>>> Ack from peer reviewers. >>>>> >>>>> Arch Built Started Linux distro >>>>> ------------------------------------------- >>>>> mips n n >>>>> mips64 n n >>>>> x86 n n >>>>> x86_64 y y >>>>> powerpc n n >>>>> powerpc64 n n >>>>> >>>>> >>>>> Reviewer Checklist: >>>>> ------------------- >>>>> [Submitters: make sure that your review doesn't trigger any checkmarks!] >>>>> >>>>> >>>>> Your checkin has not passed review because (see checked entries): >>>>> >>>>> ___ Your RR template is generally incomplete; it has too many blank >> entries >>>>> that need proper data filled in. >>>>> >>>>> ___ You have failed to nominate the proper persons for review and push. >>>>> >>>>> ___ Your patches do not have proper short+long header >>>>> >>>>> ___ You have grammar/spelling in your header that is unacceptable. >>>>> >>>>> ___ You have exceeded a sensible line length in your >>>> headers/comments/text. >>>>> ___ You have failed to put in a proper Trac Ticket # into your commits. >>>>> >>>>> ___ You have incorrectly put/left internal data in your comments/files >>>>> (i.e. internal bug tracking tool IDs, product names etc) >>>>> >>>>> ___ You have not given any evidence of testing beyond basic build tests. >>>>> Demonstrate some level of runtime or other sanity testing. >>>>> >>>>> ___ You have ^M present in some of your files. These have to be >> removed. >>>>> ___ You have needlessly changed whitespace or added whitespace crimes >>>>> like trailing spaces, or spaces before tabs. >>>>> >>>>> ___ You have mixed real technical changes with whitespace and other >>>>> cosmetic code cleanup changes. These have to be separate commits. >>>>> >>>>> ___ You need to refactor your submission into logical chunks; there is >>>>> too much content into a single commit. >>>>> >>>>> ___ You have extraneous garbage in your review (merge commits etc) >>>>> >>>>> ___ You have giant attachments which should never have been sent; >>>>> Instead you should place your content in a public tree to be >>>>> pulled. >>>>> >>>>> ___ You have too many commits attached to an e-mail; resend as >> threaded >>>>> commits, or place in a public tree for a pull. >>>>> >>>>> ___ You have resent this content multiple times without a clear >> indication >>>>> of what has changed between each re-send. >>>>> >>>>> ___ You have failed to adequately and individually address all of the >>>>> comments and change requests that were proposed in the initial >> review. >>>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) >>>>> >>>>> ___ Your computer have a badly configured date and time; confusing the >>>>> the threaded patch review. >>>>> >>>>> ___ Your changes affect IPC mechanism, and you don't present any results >>>>> for in-service upgradability test. >>>>> >>>>> ___ Your changes affect user manual and documentation, your patch >> series >>>>> do not contain the patch that updates the Doxygen manual. >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> One dashboard for servers and applications across Physical-Virtual-Cloud >>>>> Widest out-of-the-box monitoring support with 50+ applications >>>>> Performance metrics, stats and reports that give you Actionable Insights >>>>> Deep dive visibility with transaction tracing using APM Insight. >>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> _______________________________________________ >>>>> Opensaf-devel mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >>>>> >>>>> ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
