Thanks, we will try it out! Btw I checked the patch and it only changes the function avd_role_failover(), but there is no corresponding change in the function avd_role_failover_qsd_actv(). Doesn't the latter function also have to be modified?
regards, Anders Widell On 09/11/2015 03:08 PM, Nagendra Kumar wrote: > Hi Anders, > Please find the patch(1334_modified.patch) attached and confirm > if it works for you. Of course, the commit is subject to discussion among us > (including Mathi). > > Thanks > -Nagu > >> -----Original Message----- >> From: Anders Widell [mailto:[email protected]] >> Sent: 11 September 2015 18:16 >> To: Nagendra Kumar; [email protected]; Praveen Malviya; >> Mathivanan Naickan Palanivelu >> Cc: [email protected] >> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to nid >> only after initialization is completed [#1334] >> >> See inline. >> >> thanks, >> Anders Widell >> >> On 09/11/2015 02:02 PM, Nagendra Kumar wrote: >>> Thanks for the response. >>> If I could, I will exclude the applications timing coming up on Standby >> controller because I could install only management applications on it and >> that >> too in Standby mode. There has been many instances (I had seen), where >> Standby controller of some systems takes hours to come up, but it is not >> considered in Cluster startup time because of Act controller is present and >> it >> is working. >> [AndersW] The time it takes for the standby to come up is also important, >> since it counts towards your repair time and you wish to minimize the time >> you are running without (or with reduced) redundancy. >>> But anyway, it looks to me that your test case is very much affected with >> this change, so what I can provide help is: I can send a patch which will >> revert >> the changes in #1334 and add code for rebooting in Amfd in case of failover >> during cold sync. Will that work for you? >> [AndersW] That would be great! >>> If all(Mathi is on vacation till Tuesday) agree then we can push that patch, >> else take some other decision or continue with old(as old is already pushed). >>> Thanks >>> -Nagu >>>> -----Original Message----- >>>> From: Anders Widell [mailto:[email protected]] >>>> Sent: 11 September 2015 16:39 >>>> To: Nagendra Kumar; [email protected]; Praveen Malviya; >>>> Mathivanan Naickan Palanivelu >>>> Cc: [email protected] >>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond >>>> to nid only after initialization is completed [#1334] >>>> >>>> See inline. >>>> >>>> / Anders Widell >>>> >>>> On 09/11/2015 12:45 PM, Nagendra Kumar wrote: >>>>> Hi Anders, >>>>> Please check the comment on #1334 : "Amf can provide a >>>> reboot in this case if fms don't care to handle it.". >>>> [AndersW] Yes, I saw this comment and that's why I was wondering why >>>> you ended up with a different solution. >>>>> But as per discussion with FMS maintainer, Mathi, we have taken this >>>> decision to allow handing FMS on system level faults and Amf should >>>> mind its business with internal affairs. So, the solution was >>>> proposed in the ticket >>>> #1334 on 2015-04-27. >>>>> Let's hear Mathi's opinion as well. >>>>> >>>>> Just a question: is there any reason which enforces inclusion of >>>>> Standby's >>>> startup time in Cluster startup time, when failover can't work? >>>> [AndersW] There can be applications running on the system >>>> controllers, and if you by "cluster startup time" mean the time until >>>> applications are up and running on all nodes in the cluster, then >>>> this time has increased with the current solution to ticket [#1334]. >>>> Also, my concern is that if we go for the same solution in the other >>>> OpenSAF directors where we have the same fault, then we will >>>> serialize the cold sync of OpenSAF directors and hence increase the >>>> repair time of the OpenSAF system controllers. >>>>> Thanks >>>>> -Nagu >>>>>> -----Original Message----- >>>>>> From: Anders Widell [mailto:[email protected]] >>>>>> Sent: 11 September 2015 15:50 >>>>>> To: Nagendra Kumar; [email protected]; Praveen Malviya; >>>>>> Mathivanan Naickan Palanivelu >>>>>> Cc: [email protected] >>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: >>>>>> respond to >>>> nid >>>>>> only after initialization is completed [#1334] >>>>>> >>>>>> I will bring up this issue on our next TLC meeting. The immediate >>>>>> concern here is the prolonged startup time, which can be regarded >>>>>> as a >>>> performance >>>>>> regression in AMF. The long-term concern is the alignment of the >>>>>> solution with future the direction of OpenSAF (removal of FM has been >> discussed). >>>>>> regards, >>>>>> Anders Widell >>>>>> >>>>>> On 09/11/2015 11:50 AM, Nagendra Kumar wrote: >>>>>>> Hi Anders, >>>>>>> Since you are talking about making FMS redundant and its >>>>>> removal and moving from NID to system and since this problem is >>>>>> existing >>>> in >>>>>> other pre Amf components(Amfnd actually), so I am being indecisive >>>>>> to response. May be you can talk in TLC and conclude and raise a >>>>>> ticket and prepare a roadmap for doing so. As a maintainer, >>>>>> respective people will fix this issue in their services. >>>>>>> Thanks >>>>>>> -Nagu >>>>>>>> -----Original Message----- >>>>>>>> From: Anders Widell [mailto:[email protected]] >>>>>>>> Sent: 10 September 2015 20:53 >>>>>>>> To: Nagendra Kumar; [email protected]; Praveen >> Malviya; >>>>>>>> Mathivanan Naickan Palanivelu >>>>>>>> Cc: [email protected] >>>>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: >>>>>>>> respond to nid only after initialization is completed [#1334] >>>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> See my replies marked with [AndersW] inline. >>>>>>>> >>>>>>>> thanks, >>>>>>>> Anders Widell >>>>>>>> >>>>>>>> On 09/10/2015 01:27 PM, Nagendra Kumar wrote: >>>>>>>>> Hi Anders, >>>>>>>>> Thanks for pointing out. Please find my responses >> inlined with >>>>>>>> [Nagu]. >>>>>>>>> Thanks >>>>>>>>> -Nagu >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Anders Widell [mailto:[email protected]] >>>>>>>>>> Sent: 08 September 2015 15:44 >>>>>>>>>> To: Nagendra Kumar; [email protected]; Praveen >>>> Malviya; >>>>>>>>>> Mathivanan Naickan Palanivelu >>>>>>>>>> Cc: [email protected] >>>>>>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: >>>>>>>>>> respond to >>>>>>>> nid >>>>>>>>>> only after initialization is completed [#1334] >>>>>>>>>> >>>>>>>>>> Hi! >>>>>>>>>> >>>>>>>>>> I have a couple of questing regarding this ticket: >>>>>>>>>> >>>>>>>>>> 1) Why did you chose a solution where you delay the reply to >>>>>>>>>> NID until amfd has completed its cold sync, instead of a >>>>>>>>>> solution where amfd would itself initiate a node reboot if a >>>>>>>>>> failover happens before it has completed the cold sync? The >>>>>>>>>> chosen solution will delay the startup time of the standby >>>>>>>>>> controller, so there is a clear (and visible, since we are >>>>>>>>>> measuring startup time) disadvantage with this solution. Is >>>>>>>>>> there any advantage with it, or >>>>>> disadvantage with the other solution? >>>>>>>>> [Nagu]: If you could read my comment in #1334 ticket, it will >>>>>>>>> point out a >>>>>>>> flaw >>>>>>>>> in handling failover by FMS in case of cold sync in progress(Amfd). >>>>>>>>> FMS collects evidence of Cold sync complete by checking whether >>>>>>>>> FMS has got CSI assignment or not. Since Amfd has acknowledged >>>>>>>>> NID that he has >>>>>>>> completed >>>>>>>>> its initialization(without Cold sync completed) and hence NID >>>>>>>>> started Amfnd and Amfnd took configuration from Act Amfd and >>>>>>>>> assigned CSI to FMS. Till >>>>>>>> here Amfd is >>>>>>>>> not synced. So, when failover is triggered FMS couldn't take any >>>> decision. >>>>>>>>> So, #1334 was reported. >>>>>>>> [AndersW] I agree that the issue reported in ticket [#1334] is a bug. >>>>>>>> The observed problem is that the standby SC does not go for a >>>>>>>> reboot immediately in the case when an SC failover happens before >>>>>>>> the cold sync is complete. It should reboot immediately. What I >>>>>>>> am questioning is the solution to this problem. The implemented >>>>>>>> solution delays the ack from amfd to NID until the cold sync is >>>>>>>> complete, so that the logic in FM will work as intended and >>>>>>>> reboot the node. My proposal is to instead solve this problem by >>>>>>>> initiating the reboot directly from >>>> AMFD. >>>>>>>>> Now the question was >>>>>>>>> 1. whether Amfd can ack NID that it is initialized and ready for >>>>>>>>> Assignment/failover without syncing. >>>>>>>> [AndersW] The question here is what an ack to NID actually means >>>>>>>> (or should mean). Ideally, I think we should aim to move away >>>>>>>> from using NID for serializing the OpenSAF startup and instead >>>>>>>> solve dependencies between OpenSAF services using TRY_AGAIN >> loops >>>>>>>> or >>>> similar >>>>>> constructs. >>>>>>>> I.e. the same philosophy as used in systemd. By delaying the >>>>>>>> reply to NID we are moving in the opposite direction. >>>>>>>>> 2. Whether FMS logic can be changed. >>>>>>>> [AndersW] My proposal wasn't really to change the logic in FM, >>>>>>>> but rather to order the reboot from the OpenSAF director (amfd in >>>>>>>> this >>>>>>>> case) when a failover happens before the cold sync is complete. >>>>>>>> The logic in FM doesn't have to be changed, but if all directors >>>>>>>> have been fixed in this way then the logic in FM would be >>>>>>>> redundant and >>>> could >>>>>> be removed. >>>>>>>>> I thought Amfd need to ack NID when it completes sync. >>>>>>>>> >>>>>>>>> Whether 1. it increased Startuptime of Standby controller or 2. >>>>>>>>> it actually corrected the errata of measurement, is subject to >>>>>>>>> interpretation. >>>>>>>> [AndersW] I think the startup time is delayed by the current >>>>>>>> solution to this ticket. Look at it from the perspective of an >>>>>>>> application component which is started by AMF on the standby SC, >>>>>>>> and measure >>>> how >>>>>>>> long time it takes until this component gets an assignment after >>>>>>>> a >>>> reboot. >>>>>>>>> In my opinion, #2 is correct. If Amfd is not ready, it should >>>>>>>>> not have sent the >>>>>>>> Ack. >>>>>>>>> So, it was a bug. >>>>>>>>> >>>>>>>>>> 2) Isn't this problem applicable to all OpenSAF directors, and >>>>>>>>>> not just AMFD? Has anyone analyzed the other services to >>>>>>>>>> determine if they also need a similar fix? Since NID starts the >>>>>>>>>> services in serial and doesn't continue with the next service >>>>>>>>>> before it has received a reply from the previous one, it would >>>>>>>>>> mean that with the chosen solution all OpenSAF directors will >>>>>>>>>> be cold synced in serial. With the other solution, they can be cold >> synced in parallel. >>>>>>>>> [Nagu]: I checked now in Clmd, Ntfd, Logd, Immd and all needs >>>>>>>>> correction >>>>>>>> and hence startup time >>>>>>>>> will increase more. >>>>>>>> [AndersW] Ok, thanks for checking! So then we will eventually >>>>>>>> have to fix them as well. But again I would like to point out >>>>>>>> that if we go for the solution that has been implemented in AMFD >>>>>>>> then the OpenSAF cold sync will be serialized. With the solution >>>>>>>> that I propose, the cold sync can be done in parallel. By syncing >>>>>>>> the OpenSAF directors in parallel we get a shorter time window >>>>>>>> when the system is without a fully synced standby SC, and hence >>>>>>>> improve the availability of the >>>> system. >>>>>>>>>> / Anders Widell >>>>>>>>>> >>>>>>>>>> On 04/27/2015 11:19 AM, [email protected] wrote: >>>>>>>>>>> Summary: amfd: respond to nid only after initialization is >>>>>>>>>>> completed >>>>>>>> [#1334] >>>>>>>>>>> Review request for Trac Ticket(s): #1334 Peer Reviewer(s): >>>>>>>>>>> Mathi, Hans N, Praveen Pull request to: <<LIST THE PERSON WITH >>>>>>>>>>> PUSH ACCESS HERE>> Affected branch(es): All Development >> branch: >>>> Default >>>>>>>>>>> -------------------------------- >>>>>>>>>>> Impacted area Impact y/n >>>>>>>>>>> -------------------------------- >>>>>>>>>>> Docs n >>>>>>>>>>> Build system n >>>>>>>>>>> RPM/packaging n >>>>>>>>>>> Configuration files n >>>>>>>>>>> Startup scripts n >>>>>>>>>>> SAF services y >>>>>>>>>>> OpenSAF services n >>>>>>>>>>> Core libraries n >>>>>>>>>>> Samples n >>>>>>>>>>> Tests n >>>>>>>>>>> Other n >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Comments (indicate scope for each "y" above): >>>>>>>>>>> --------------------------------------------- >>>>>>>>>>> <<EXPLAIN/COMMENT THE PATCH SERIES HERE>> >>>>>>>>>>> >>>>>>>>>>> changeset eeba7fe22afb8f4c40bb393d45a108cc59061eda >>>>>>>>>>> Author: Nagendra Kumar<[email protected]> >>>>>>>>>>> Date: Mon, 27 Apr 2015 14:41:02 +0530 >>>>>>>>>>> >>>>>>>>>>> amfd: respond to nid only after initialization is completed >>>>>>>>>>> [#1334] Act >>>>>>>>>> Amfd >>>>>>>>>>> initialization is said to be completed when it completes its >>>>>>>>>> initialization >>>>>>>>>>> with imm. Apart from initializing with imm, Standby Amfd >>>> also >>>>>>>>>>> need >>>>>>>>>> to get >>>>>>>>>>> run time data from Act Amfd using cold sync. So, Standby >>>> Amfd >>>>>>>>>> initialization >>>>>>>>>>> is said to be completed when it completes its initialization >>>> with >>>>>>>>>>> imm >>>>>>>>>> and it >>>>>>>>>>> completes its cold sync with Act Amfd. In the present code, >>>> Standby is >>>>>>>>>>> sending response to nid without cold sync complete. So, code >>>> has >>>>>>>>>> been added >>>>>>>>>>> to send nid response only when Amfd completes its >>>> initialization. >>>>>>>>>>> Complete diffstat: >>>>>>>>>>> ------------------ >>>>>>>>>>> osaf/services/saf/amf/amfd/chkop.cc | 4 ++++ >>>>>>>>>>> osaf/services/saf/amf/amfd/main.cc | 6 +++++- >>>>>>>>>>> 2 files changed, 9 insertions(+), 1 deletions(-) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Testing Commands: >>>>>>>>>>> ----------------- >>>>>>>>>>> 1. Start Act controller SC-1. >>>>>>>>>>> Start Standby controller SC-2 and stop SC-1 when >>>>>>>>>>> Standby Amfd is in the middle of cold sync. >>>>>>>>>>> 2. Start Act controller SC-1. >>>>>>>>>>> Start Standby controller SC-2 and stop SC-1 when >>>>>>>>>>> Standby Amfd has completed cold sync but Amfnd has not >>>> asigned >>>>>>>>>>> role to Fmd at Standby controller. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Testing, Expected Results: >>>>>>>>>>> -------------------------- >>>>>>>>>>> 1. Fmd reboots the node : >>>>>>>>>>> Apr 27 12:56:32 PM_SC-2 osaffmd[11792]: Rebooting OpenSAF >>>> NodeId = >>>>>>>>>>> 0 >>>>>>>> EE >>>>>>>>>> Name = No EE Mapped, Reason: Failover occurred, but this node >>>>>>>>>> is not yet ready, OwnNodeId = 131599, SupervisionTime = 60 >>>>>>>>>>> 2. Same as above. >>>>>>>>>>> >>>>>>>>>>> Conditions of Submission: >>>>>>>>>>> ------------------------- >>>>>>>>>>> Ack from peer reviewers. >>>>>>>>>>> >>>>>>>>>>> Arch Built Started Linux distro >>>>>>>>>>> ------------------------------------------- >>>>>>>>>>> mips n n >>>>>>>>>>> mips64 n n >>>>>>>>>>> x86 n n >>>>>>>>>>> x86_64 y y >>>>>>>>>>> powerpc n n >>>>>>>>>>> powerpc64 n n >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Reviewer Checklist: >>>>>>>>>>> ------------------- >>>>>>>>>>> [Submitters: make sure that your review doesn't trigger any >>>>>>>>>>> checkmarks!] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Your checkin has not passed review because (see checked entries): >>>>>>>>>>> >>>>>>>>>>> ___ Your RR template is generally incomplete; it has too many >>>>>>>>>>> blank >>>>>>>> entries >>>>>>>>>>> that need proper data filled in. >>>>>>>>>>> >>>>>>>>>>> ___ You have failed to nominate the proper persons for review >>>>>>>>>>> and >>>>>> push. >>>>>>>>>>> ___ Your patches do not have proper short+long header >>>>>>>>>>> >>>>>>>>>>> ___ You have grammar/spelling in your header that is >> unacceptable. >>>>>>>>>>> ___ You have exceeded a sensible line length in your >>>>>>>>>> headers/comments/text. >>>>>>>>>>> ___ You have failed to put in a proper Trac Ticket # into your >>>> commits. >>>>>>>>>>> ___ You have incorrectly put/left internal data in your >>>> comments/files >>>>>>>>>>> (i.e. internal bug tracking tool IDs, product names >>>>>>>>>>> etc) >>>>>>>>>>> >>>>>>>>>>> ___ You have not given any evidence of testing beyond basic >>>>>>>>>>> build >>>>>> tests. >>>>>>>>>>> Demonstrate some level of runtime or other sanity testing. >>>>>>>>>>> >>>>>>>>>>> ___ You have ^M present in some of your files. These have to >>>>>>>>>>> be >>>>>>>> removed. >>>>>>>>>>> ___ You have needlessly changed whitespace or added whitespace >>>>>> crimes >>>>>>>>>>> like trailing spaces, or spaces before tabs. >>>>>>>>>>> >>>>>>>>>>> ___ You have mixed real technical changes with whitespace and >>>> other >>>>>>>>>>> cosmetic code cleanup changes. These have to be >>>>>>>>>>> separate >>>>>> commits. >>>>>>>>>>> ___ You need to refactor your submission into logical chunks; >>>>>>>>>>> there >>>> is >>>>>>>>>>> too much content into a single commit. >>>>>>>>>>> >>>>>>>>>>> ___ You have extraneous garbage in your review (merge commits >>>>>>>>>>> etc) >>>>>>>>>>> >>>>>>>>>>> ___ You have giant attachments which should never have been >> sent; >>>>>>>>>>> Instead you should place your content in a public >>>>>>>>>>> tree to be >>>> pulled. >>>>>>>>>>> ___ You have too many commits attached to an e-mail; resend as >>>>>>>> threaded >>>>>>>>>>> commits, or place in a public tree for a pull. >>>>>>>>>>> >>>>>>>>>>> ___ You have resent this content multiple times without a >>>>>>>>>>> clear >>>>>>>> indication >>>>>>>>>>> of what has changed between each re-send. >>>>>>>>>>> >>>>>>>>>>> ___ You have failed to adequately and individually address all of >> the >>>>>>>>>>> comments and change requests that were proposed in >>>>>>>>>>> the initial >>>>>>>> review. >>>>>>>>>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, >>>>>>>>>>> email >>>>>>>>>>> etc) >>>>>>>>>>> >>>>>>>>>>> ___ Your computer have a badly configured date and time; >>>>>>>>>>> confusing >>>>>> the >>>>>>>>>>> the threaded patch review. >>>>>>>>>>> >>>>>>>>>>> ___ Your changes affect IPC mechanism, and you don't present >>>>>>>>>>> any >>>>>> results >>>>>>>>>>> for in-service upgradability test. >>>>>>>>>>> >>>>>>>>>>> ___ Your changes affect user manual and documentation, your >>>>>>>>>>> patch >>>>>>>> series >>>>>>>>>>> do not contain the patch that updates the Doxygen manual. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------- >>>>>>>>>>> ---- >>>>>>>>>>> ------------ One dashboard for servers and applications across >>>>>>>>>>> Physical-Virtual-Cloud Widest out-of-the-box monitoring >>>>>>>>>>> support with 50+ applications Performance metrics, stats and >>>>>>>>>>> reports that give you Actionable Insights Deep dive visibility >>>>>>>>>>> with transaction tracing using APM Insight. >>>>>>>>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Opensaf-devel mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
