See inline. / Anders Widell
On 09/11/2015 12:45 PM, Nagendra Kumar wrote: > Hi Anders, > Please check the comment on #1334 : "Amf can provide a reboot > in this case if fms don't care to handle it.". [AndersW] Yes, I saw this comment and that's why I was wondering why you ended up with a different solution. > But as per discussion with FMS maintainer, Mathi, we have taken this decision > to allow handing FMS on system level faults and Amf should mind its business > with internal affairs. So, the solution was proposed in the ticket #1334 on > 2015-04-27. > > Let's hear Mathi's opinion as well. > > Just a question: is there any reason which enforces inclusion of Standby's > startup time in Cluster startup time, when failover can't work? [AndersW] There can be applications running on the system controllers, and if you by "cluster startup time" mean the time until applications are up and running on all nodes in the cluster, then this time has increased with the current solution to ticket [#1334]. Also, my concern is that if we go for the same solution in the other OpenSAF directors where we have the same fault, then we will serialize the cold sync of OpenSAF directors and hence increase the repair time of the OpenSAF system controllers. > > Thanks > -Nagu >> -----Original Message----- >> From: Anders Widell [mailto:[email protected]] >> Sent: 11 September 2015 15:50 >> To: Nagendra Kumar; [email protected]; Praveen Malviya; >> Mathivanan Naickan Palanivelu >> Cc: [email protected] >> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to nid >> only after initialization is completed [#1334] >> >> I will bring up this issue on our next TLC meeting. The immediate concern >> here is the prolonged startup time, which can be regarded as a performance >> regression in AMF. The long-term concern is the alignment of the solution >> with future the direction of OpenSAF (removal of FM has been discussed). >> >> regards, >> Anders Widell >> >> On 09/11/2015 11:50 AM, Nagendra Kumar wrote: >>> Hi Anders, >>> Since you are talking about making FMS redundant and its >> removal and moving from NID to system and since this problem is existing in >> other pre Amf components(Amfnd actually), so I am being indecisive to >> response. May be you can talk in TLC and conclude and raise a ticket and >> prepare a roadmap for doing so. As a maintainer, respective people will fix >> this issue in their services. >>> Thanks >>> -Nagu >>>> -----Original Message----- >>>> From: Anders Widell [mailto:[email protected]] >>>> Sent: 10 September 2015 20:53 >>>> To: Nagendra Kumar; [email protected]; Praveen Malviya; >>>> Mathivanan Naickan Palanivelu >>>> Cc: [email protected] >>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond >>>> to nid only after initialization is completed [#1334] >>>> >>>> Hi >>>> >>>> See my replies marked with [AndersW] inline. >>>> >>>> thanks, >>>> Anders Widell >>>> >>>> On 09/10/2015 01:27 PM, Nagendra Kumar wrote: >>>>> Hi Anders, >>>>> Thanks for pointing out. Please find my responses inlined with >>>> [Nagu]. >>>>> Thanks >>>>> -Nagu >>>>>> -----Original Message----- >>>>>> From: Anders Widell [mailto:[email protected]] >>>>>> Sent: 08 September 2015 15:44 >>>>>> To: Nagendra Kumar; [email protected]; Praveen Malviya; >>>>>> Mathivanan Naickan Palanivelu >>>>>> Cc: [email protected] >>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: >>>>>> respond to >>>> nid >>>>>> only after initialization is completed [#1334] >>>>>> >>>>>> Hi! >>>>>> >>>>>> I have a couple of questing regarding this ticket: >>>>>> >>>>>> 1) Why did you chose a solution where you delay the reply to NID >>>>>> until amfd has completed its cold sync, instead of a solution where >>>>>> amfd would itself initiate a node reboot if a failover happens >>>>>> before it has completed the cold sync? The chosen solution will >>>>>> delay the startup time of the standby controller, so there is a >>>>>> clear (and visible, since we are measuring startup time) >>>>>> disadvantage with this solution. Is there any advantage with it, or >> disadvantage with the other solution? >>>>> [Nagu]: If you could read my comment in #1334 ticket, it will point >>>>> out a >>>> flaw >>>>> in handling failover by FMS in case of cold sync in progress(Amfd). >>>>> FMS collects evidence of Cold sync complete by checking whether FMS >>>>> has got CSI assignment or not. Since Amfd has acknowledged NID that >>>>> he has >>>> completed >>>>> its initialization(without Cold sync completed) and hence NID >>>>> started Amfnd and Amfnd took configuration from Act Amfd and >>>>> assigned CSI to FMS. Till >>>> here Amfd is >>>>> not synced. So, when failover is triggered FMS couldn't take any decision. >>>>> So, #1334 was reported. >>>> [AndersW] I agree that the issue reported in ticket [#1334] is a bug. >>>> The observed problem is that the standby SC does not go for a reboot >>>> immediately in the case when an SC failover happens before the cold >>>> sync is complete. It should reboot immediately. What I am questioning >>>> is the solution to this problem. The implemented solution delays the >>>> ack from amfd to NID until the cold sync is complete, so that the >>>> logic in FM will work as intended and reboot the node. My proposal is >>>> to instead solve this problem by initiating the reboot directly from AMFD. >>>>> Now the question was >>>>> 1. whether Amfd can ack NID that it is initialized and ready for >>>>> Assignment/failover without syncing. >>>> [AndersW] The question here is what an ack to NID actually means (or >>>> should mean). Ideally, I think we should aim to move away from using >>>> NID for serializing the OpenSAF startup and instead solve >>>> dependencies between OpenSAF services using TRY_AGAIN loops or similar >> constructs. >>>> I.e. the same philosophy as used in systemd. By delaying the reply to >>>> NID we are moving in the opposite direction. >>>>> 2. Whether FMS logic can be changed. >>>> [AndersW] My proposal wasn't really to change the logic in FM, but >>>> rather to order the reboot from the OpenSAF director (amfd in this >>>> case) when a failover happens before the cold sync is complete. The >>>> logic in FM doesn't have to be changed, but if all directors have >>>> been fixed in this way then the logic in FM would be redundant and could >> be removed. >>>>> I thought Amfd need to ack NID when it completes sync. >>>>> >>>>> Whether 1. it increased Startuptime of Standby controller or 2. it >>>>> actually corrected the errata of measurement, is subject to >>>>> interpretation. >>>> [AndersW] I think the startup time is delayed by the current solution >>>> to this ticket. Look at it from the perspective of an application >>>> component which is started by AMF on the standby SC, and measure how >>>> long time it takes until this component gets an assignment after a reboot. >>>>> In my opinion, #2 is correct. If Amfd is not ready, it should not >>>>> have sent the >>>> Ack. >>>>> So, it was a bug. >>>>> >>>>>> 2) Isn't this problem applicable to all OpenSAF directors, and not >>>>>> just AMFD? Has anyone analyzed the other services to determine if >>>>>> they also need a similar fix? Since NID starts the services in >>>>>> serial and doesn't continue with the next service before it has >>>>>> received a reply from the previous one, it would mean that with the >>>>>> chosen solution all OpenSAF directors will be cold synced in >>>>>> serial. With the other solution, they can be cold synced in parallel. >>>>> [Nagu]: I checked now in Clmd, Ntfd, Logd, Immd and all needs >>>>> correction >>>> and hence startup time >>>>> will increase more. >>>> [AndersW] Ok, thanks for checking! So then we will eventually have to >>>> fix them as well. But again I would like to point out that if we go >>>> for the solution that has been implemented in AMFD then the OpenSAF >>>> cold sync will be serialized. With the solution that I propose, the >>>> cold sync can be done in parallel. By syncing the OpenSAF directors >>>> in parallel we get a shorter time window when the system is without a >>>> fully synced standby SC, and hence improve the availability of the system. >>>>>> / Anders Widell >>>>>> >>>>>> On 04/27/2015 11:19 AM, [email protected] wrote: >>>>>>> Summary: amfd: respond to nid only after initialization is >>>>>>> completed >>>> [#1334] >>>>>>> Review request for Trac Ticket(s): #1334 Peer Reviewer(s): Mathi, >>>>>>> Hans N, Praveen Pull request to: <<LIST THE PERSON WITH PUSH >>>>>>> ACCESS HERE>> Affected branch(es): All Development branch: Default >>>>>>> >>>>>>> -------------------------------- >>>>>>> Impacted area Impact y/n >>>>>>> -------------------------------- >>>>>>> Docs n >>>>>>> Build system n >>>>>>> RPM/packaging n >>>>>>> Configuration files n >>>>>>> Startup scripts n >>>>>>> SAF services y >>>>>>> OpenSAF services n >>>>>>> Core libraries n >>>>>>> Samples n >>>>>>> Tests n >>>>>>> Other n >>>>>>> >>>>>>> >>>>>>> Comments (indicate scope for each "y" above): >>>>>>> --------------------------------------------- >>>>>>> <<EXPLAIN/COMMENT THE PATCH SERIES HERE>> >>>>>>> >>>>>>> changeset eeba7fe22afb8f4c40bb393d45a108cc59061eda >>>>>>> Author: Nagendra Kumar<[email protected]> >>>>>>> Date: Mon, 27 Apr 2015 14:41:02 +0530 >>>>>>> >>>>>>> amfd: respond to nid only after initialization is completed >>>>>>> [#1334] Act >>>>>> Amfd >>>>>>> initialization is said to be completed when it completes its >>>>>> initialization >>>>>>> with imm. Apart from initializing with imm, Standby Amfd also >>>>>>> need >>>>>> to get >>>>>>> run time data from Act Amfd using cold sync. So, Standby Amfd >>>>>> initialization >>>>>>> is said to be completed when it completes its initialization >>>>>>> with >>>>>>> imm >>>>>> and it >>>>>>> completes its cold sync with Act Amfd. In the present code, >>>>>>> Standby is >>>>>>> sending response to nid without cold sync complete. So, code has >>>>>> been added >>>>>>> to send nid response only when Amfd completes its >>>>>>> initialization. >>>>>>> >>>>>>> >>>>>>> Complete diffstat: >>>>>>> ------------------ >>>>>>> osaf/services/saf/amf/amfd/chkop.cc | 4 ++++ >>>>>>> osaf/services/saf/amf/amfd/main.cc | 6 +++++- >>>>>>> 2 files changed, 9 insertions(+), 1 deletions(-) >>>>>>> >>>>>>> >>>>>>> Testing Commands: >>>>>>> ----------------- >>>>>>> 1. Start Act controller SC-1. >>>>>>> Start Standby controller SC-2 and stop SC-1 when >>>>>>> Standby Amfd is in the middle of cold sync. >>>>>>> 2. Start Act controller SC-1. >>>>>>> Start Standby controller SC-2 and stop SC-1 when >>>>>>> Standby Amfd has completed cold sync but Amfnd has not asigned >>>>>>> role to Fmd at Standby controller. >>>>>>> >>>>>>> >>>>>>> Testing, Expected Results: >>>>>>> -------------------------- >>>>>>> 1. Fmd reboots the node : >>>>>>> Apr 27 12:56:32 PM_SC-2 osaffmd[11792]: Rebooting OpenSAF NodeId = >>>>>>> 0 >>>> EE >>>>>> Name = No EE Mapped, Reason: Failover occurred, but this node is >>>>>> not yet ready, OwnNodeId = 131599, SupervisionTime = 60 >>>>>>> 2. Same as above. >>>>>>> >>>>>>> Conditions of Submission: >>>>>>> ------------------------- >>>>>>> Ack from peer reviewers. >>>>>>> >>>>>>> Arch Built Started Linux distro >>>>>>> ------------------------------------------- >>>>>>> mips n n >>>>>>> mips64 n n >>>>>>> x86 n n >>>>>>> x86_64 y y >>>>>>> powerpc n n >>>>>>> powerpc64 n n >>>>>>> >>>>>>> >>>>>>> Reviewer Checklist: >>>>>>> ------------------- >>>>>>> [Submitters: make sure that your review doesn't trigger any >>>>>>> checkmarks!] >>>>>>> >>>>>>> >>>>>>> Your checkin has not passed review because (see checked entries): >>>>>>> >>>>>>> ___ Your RR template is generally incomplete; it has too many >>>>>>> blank >>>> entries >>>>>>> that need proper data filled in. >>>>>>> >>>>>>> ___ You have failed to nominate the proper persons for review and >> push. >>>>>>> ___ Your patches do not have proper short+long header >>>>>>> >>>>>>> ___ You have grammar/spelling in your header that is unacceptable. >>>>>>> >>>>>>> ___ You have exceeded a sensible line length in your >>>>>> headers/comments/text. >>>>>>> ___ You have failed to put in a proper Trac Ticket # into your commits. >>>>>>> >>>>>>> ___ You have incorrectly put/left internal data in your comments/files >>>>>>> (i.e. internal bug tracking tool IDs, product names etc) >>>>>>> >>>>>>> ___ You have not given any evidence of testing beyond basic build >> tests. >>>>>>> Demonstrate some level of runtime or other sanity testing. >>>>>>> >>>>>>> ___ You have ^M present in some of your files. These have to be >>>> removed. >>>>>>> ___ You have needlessly changed whitespace or added whitespace >> crimes >>>>>>> like trailing spaces, or spaces before tabs. >>>>>>> >>>>>>> ___ You have mixed real technical changes with whitespace and other >>>>>>> cosmetic code cleanup changes. These have to be separate >> commits. >>>>>>> ___ You need to refactor your submission into logical chunks; there is >>>>>>> too much content into a single commit. >>>>>>> >>>>>>> ___ You have extraneous garbage in your review (merge commits etc) >>>>>>> >>>>>>> ___ You have giant attachments which should never have been sent; >>>>>>> Instead you should place your content in a public tree to be >>>>>>> pulled. >>>>>>> >>>>>>> ___ You have too many commits attached to an e-mail; resend as >>>> threaded >>>>>>> commits, or place in a public tree for a pull. >>>>>>> >>>>>>> ___ You have resent this content multiple times without a clear >>>> indication >>>>>>> of what has changed between each re-send. >>>>>>> >>>>>>> ___ You have failed to adequately and individually address all of the >>>>>>> comments and change requests that were proposed in the >>>>>>> initial >>>> review. >>>>>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email >>>>>>> etc) >>>>>>> >>>>>>> ___ Your computer have a badly configured date and time; confusing >> the >>>>>>> the threaded patch review. >>>>>>> >>>>>>> ___ Your changes affect IPC mechanism, and you don't present any >> results >>>>>>> for in-service upgradability test. >>>>>>> >>>>>>> ___ Your changes affect user manual and documentation, your patch >>>> series >>>>>>> do not contain the patch that updates the Doxygen manual. >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------ >>>>>>> ------------ One dashboard for servers and applications across >>>>>>> Physical-Virtual-Cloud Widest out-of-the-box monitoring support >>>>>>> with 50+ applications Performance metrics, stats and reports that >>>>>>> give you Actionable Insights Deep dive visibility with transaction >>>>>>> tracing using APM Insight. >>>>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>>>> _______________________________________________ >>>>>>> Opensaf-devel mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >>>>>>> >>>>>>> ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
