Hi Anders,
Please find the patch(1334_modified.patch) attached and confirm
if it works for you. Of course, the commit is subject to discussion among us
(including Mathi).Thanks -Nagu > -----Original Message----- > From: Anders Widell [mailto:[email protected]] > Sent: 11 September 2015 18:16 > To: Nagendra Kumar; [email protected]; Praveen Malviya; > Mathivanan Naickan Palanivelu > Cc: [email protected] > Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to nid > only after initialization is completed [#1334] > > See inline. > > thanks, > Anders Widell > > On 09/11/2015 02:02 PM, Nagendra Kumar wrote: > > Thanks for the response. > > If I could, I will exclude the applications timing coming up on Standby > controller because I could install only management applications on it and that > too in Standby mode. There has been many instances (I had seen), where > Standby controller of some systems takes hours to come up, but it is not > considered in Cluster startup time because of Act controller is present and it > is working. > [AndersW] The time it takes for the standby to come up is also important, > since it counts towards your repair time and you wish to minimize the time > you are running without (or with reduced) redundancy. > > > > But anyway, it looks to me that your test case is very much affected with > this change, so what I can provide help is: I can send a patch which will > revert > the changes in #1334 and add code for rebooting in Amfd in case of failover > during cold sync. Will that work for you? > [AndersW] That would be great! > > > > If all(Mathi is on vacation till Tuesday) agree then we can push that patch, > else take some other decision or continue with old(as old is already pushed). > > > > Thanks > > -Nagu > >> -----Original Message----- > >> From: Anders Widell [mailto:[email protected]] > >> Sent: 11 September 2015 16:39 > >> To: Nagendra Kumar; [email protected]; Praveen Malviya; > >> Mathivanan Naickan Palanivelu > >> Cc: [email protected] > >> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond > >> to nid only after initialization is completed [#1334] > >> > >> See inline. > >> > >> / Anders Widell > >> > >> On 09/11/2015 12:45 PM, Nagendra Kumar wrote: > >>> Hi Anders, > >>> Please check the comment on #1334 : "Amf can provide a > >> reboot in this case if fms don't care to handle it.". > >> [AndersW] Yes, I saw this comment and that's why I was wondering why > >> you ended up with a different solution. > >>> But as per discussion with FMS maintainer, Mathi, we have taken this > >> decision to allow handing FMS on system level faults and Amf should > >> mind its business with internal affairs. So, the solution was > >> proposed in the ticket > >> #1334 on 2015-04-27. > >>> Let's hear Mathi's opinion as well. > >>> > >>> Just a question: is there any reason which enforces inclusion of > >>> Standby's > >> startup time in Cluster startup time, when failover can't work? > >> [AndersW] There can be applications running on the system > >> controllers, and if you by "cluster startup time" mean the time until > >> applications are up and running on all nodes in the cluster, then > >> this time has increased with the current solution to ticket [#1334]. > >> Also, my concern is that if we go for the same solution in the other > >> OpenSAF directors where we have the same fault, then we will > >> serialize the cold sync of OpenSAF directors and hence increase the > >> repair time of the OpenSAF system controllers. > >>> Thanks > >>> -Nagu > >>>> -----Original Message----- > >>>> From: Anders Widell [mailto:[email protected]] > >>>> Sent: 11 September 2015 15:50 > >>>> To: Nagendra Kumar; [email protected]; Praveen Malviya; > >>>> Mathivanan Naickan Palanivelu > >>>> Cc: [email protected] > >>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: > >>>> respond to > >> nid > >>>> only after initialization is completed [#1334] > >>>> > >>>> I will bring up this issue on our next TLC meeting. The immediate > >>>> concern here is the prolonged startup time, which can be regarded > >>>> as a > >> performance > >>>> regression in AMF. The long-term concern is the alignment of the > >>>> solution with future the direction of OpenSAF (removal of FM has been > discussed). > >>>> > >>>> regards, > >>>> Anders Widell > >>>> > >>>> On 09/11/2015 11:50 AM, Nagendra Kumar wrote: > >>>>> Hi Anders, > >>>>> Since you are talking about making FMS redundant and its > >>>> removal and moving from NID to system and since this problem is > >>>> existing > >> in > >>>> other pre Amf components(Amfnd actually), so I am being indecisive > >>>> to response. May be you can talk in TLC and conclude and raise a > >>>> ticket and prepare a roadmap for doing so. As a maintainer, > >>>> respective people will fix this issue in their services. > >>>>> Thanks > >>>>> -Nagu > >>>>>> -----Original Message----- > >>>>>> From: Anders Widell [mailto:[email protected]] > >>>>>> Sent: 10 September 2015 20:53 > >>>>>> To: Nagendra Kumar; [email protected]; Praveen > Malviya; > >>>>>> Mathivanan Naickan Palanivelu > >>>>>> Cc: [email protected] > >>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: > >>>>>> respond to nid only after initialization is completed [#1334] > >>>>>> > >>>>>> Hi > >>>>>> > >>>>>> See my replies marked with [AndersW] inline. > >>>>>> > >>>>>> thanks, > >>>>>> Anders Widell > >>>>>> > >>>>>> On 09/10/2015 01:27 PM, Nagendra Kumar wrote: > >>>>>>> Hi Anders, > >>>>>>> Thanks for pointing out. Please find my responses > inlined with > >>>>>> [Nagu]. > >>>>>>> Thanks > >>>>>>> -Nagu > >>>>>>>> -----Original Message----- > >>>>>>>> From: Anders Widell [mailto:[email protected]] > >>>>>>>> Sent: 08 September 2015 15:44 > >>>>>>>> To: Nagendra Kumar; [email protected]; Praveen > >> Malviya; > >>>>>>>> Mathivanan Naickan Palanivelu > >>>>>>>> Cc: [email protected] > >>>>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: > >>>>>>>> respond to > >>>>>> nid > >>>>>>>> only after initialization is completed [#1334] > >>>>>>>> > >>>>>>>> Hi! > >>>>>>>> > >>>>>>>> I have a couple of questing regarding this ticket: > >>>>>>>> > >>>>>>>> 1) Why did you chose a solution where you delay the reply to > >>>>>>>> NID until amfd has completed its cold sync, instead of a > >>>>>>>> solution where amfd would itself initiate a node reboot if a > >>>>>>>> failover happens before it has completed the cold sync? The > >>>>>>>> chosen solution will delay the startup time of the standby > >>>>>>>> controller, so there is a clear (and visible, since we are > >>>>>>>> measuring startup time) disadvantage with this solution. Is > >>>>>>>> there any advantage with it, or > >>>> disadvantage with the other solution? > >>>>>>> [Nagu]: If you could read my comment in #1334 ticket, it will > >>>>>>> point out a > >>>>>> flaw > >>>>>>> in handling failover by FMS in case of cold sync in progress(Amfd). > >>>>>>> FMS collects evidence of Cold sync complete by checking whether > >>>>>>> FMS has got CSI assignment or not. Since Amfd has acknowledged > >>>>>>> NID that he has > >>>>>> completed > >>>>>>> its initialization(without Cold sync completed) and hence NID > >>>>>>> started Amfnd and Amfnd took configuration from Act Amfd and > >>>>>>> assigned CSI to FMS. Till > >>>>>> here Amfd is > >>>>>>> not synced. So, when failover is triggered FMS couldn't take any > >> decision. > >>>>>>> So, #1334 was reported. > >>>>>> [AndersW] I agree that the issue reported in ticket [#1334] is a bug. > >>>>>> The observed problem is that the standby SC does not go for a > >>>>>> reboot immediately in the case when an SC failover happens before > >>>>>> the cold sync is complete. It should reboot immediately. What I > >>>>>> am questioning is the solution to this problem. The implemented > >>>>>> solution delays the ack from amfd to NID until the cold sync is > >>>>>> complete, so that the logic in FM will work as intended and > >>>>>> reboot the node. My proposal is to instead solve this problem by > >>>>>> initiating the reboot directly from > >> AMFD. > >>>>>>> Now the question was > >>>>>>> 1. whether Amfd can ack NID that it is initialized and ready for > >>>>>>> Assignment/failover without syncing. > >>>>>> [AndersW] The question here is what an ack to NID actually means > >>>>>> (or should mean). Ideally, I think we should aim to move away > >>>>>> from using NID for serializing the OpenSAF startup and instead > >>>>>> solve dependencies between OpenSAF services using TRY_AGAIN > loops > >>>>>> or > >> similar > >>>> constructs. > >>>>>> I.e. the same philosophy as used in systemd. By delaying the > >>>>>> reply to NID we are moving in the opposite direction. > >>>>>>> 2. Whether FMS logic can be changed. > >>>>>> [AndersW] My proposal wasn't really to change the logic in FM, > >>>>>> but rather to order the reboot from the OpenSAF director (amfd in > >>>>>> this > >>>>>> case) when a failover happens before the cold sync is complete. > >>>>>> The logic in FM doesn't have to be changed, but if all directors > >>>>>> have been fixed in this way then the logic in FM would be > >>>>>> redundant and > >> could > >>>> be removed. > >>>>>>> I thought Amfd need to ack NID when it completes sync. > >>>>>>> > >>>>>>> Whether 1. it increased Startuptime of Standby controller or 2. > >>>>>>> it actually corrected the errata of measurement, is subject to > >>>>>>> interpretation. > >>>>>> [AndersW] I think the startup time is delayed by the current > >>>>>> solution to this ticket. Look at it from the perspective of an > >>>>>> application component which is started by AMF on the standby SC, > >>>>>> and measure > >> how > >>>>>> long time it takes until this component gets an assignment after > >>>>>> a > >> reboot. > >>>>>>> In my opinion, #2 is correct. If Amfd is not ready, it should > >>>>>>> not have sent the > >>>>>> Ack. > >>>>>>> So, it was a bug. > >>>>>>> > >>>>>>>> 2) Isn't this problem applicable to all OpenSAF directors, and > >>>>>>>> not just AMFD? Has anyone analyzed the other services to > >>>>>>>> determine if they also need a similar fix? Since NID starts the > >>>>>>>> services in serial and doesn't continue with the next service > >>>>>>>> before it has received a reply from the previous one, it would > >>>>>>>> mean that with the chosen solution all OpenSAF directors will > >>>>>>>> be cold synced in serial. With the other solution, they can be cold > synced in parallel. > >>>>>>> [Nagu]: I checked now in Clmd, Ntfd, Logd, Immd and all needs > >>>>>>> correction > >>>>>> and hence startup time > >>>>>>> will increase more. > >>>>>> [AndersW] Ok, thanks for checking! So then we will eventually > >>>>>> have to fix them as well. But again I would like to point out > >>>>>> that if we go for the solution that has been implemented in AMFD > >>>>>> then the OpenSAF cold sync will be serialized. With the solution > >>>>>> that I propose, the cold sync can be done in parallel. By syncing > >>>>>> the OpenSAF directors in parallel we get a shorter time window > >>>>>> when the system is without a fully synced standby SC, and hence > >>>>>> improve the availability of the > >> system. > >>>>>>>> / Anders Widell > >>>>>>>> > >>>>>>>> On 04/27/2015 11:19 AM, [email protected] wrote: > >>>>>>>>> Summary: amfd: respond to nid only after initialization is > >>>>>>>>> completed > >>>>>> [#1334] > >>>>>>>>> Review request for Trac Ticket(s): #1334 Peer Reviewer(s): > >>>>>>>>> Mathi, Hans N, Praveen Pull request to: <<LIST THE PERSON WITH > >>>>>>>>> PUSH ACCESS HERE>> Affected branch(es): All Development > branch: > >> Default > >>>>>>>>> -------------------------------- > >>>>>>>>> Impacted area Impact y/n > >>>>>>>>> -------------------------------- > >>>>>>>>> Docs n > >>>>>>>>> Build system n > >>>>>>>>> RPM/packaging n > >>>>>>>>> Configuration files n > >>>>>>>>> Startup scripts n > >>>>>>>>> SAF services y > >>>>>>>>> OpenSAF services n > >>>>>>>>> Core libraries n > >>>>>>>>> Samples n > >>>>>>>>> Tests n > >>>>>>>>> Other n > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Comments (indicate scope for each "y" above): > >>>>>>>>> --------------------------------------------- > >>>>>>>>> <<EXPLAIN/COMMENT THE PATCH SERIES HERE>> > >>>>>>>>> > >>>>>>>>> changeset eeba7fe22afb8f4c40bb393d45a108cc59061eda > >>>>>>>>> Author: Nagendra Kumar<[email protected]> > >>>>>>>>> Date: Mon, 27 Apr 2015 14:41:02 +0530 > >>>>>>>>> > >>>>>>>>> amfd: respond to nid only after initialization is completed > >>>>>>>>> [#1334] Act > >>>>>>>> Amfd > >>>>>>>>> initialization is said to be completed when it completes its > >>>>>>>> initialization > >>>>>>>>> with imm. Apart from initializing with imm, Standby Amfd > >> also > >>>>>>>>> need > >>>>>>>> to get > >>>>>>>>> run time data from Act Amfd using cold sync. So, Standby > >> Amfd > >>>>>>>> initialization > >>>>>>>>> is said to be completed when it completes its initialization > >> with > >>>>>>>>> imm > >>>>>>>> and it > >>>>>>>>> completes its cold sync with Act Amfd. In the present code, > >> Standby is > >>>>>>>>> sending response to nid without cold sync complete. So, code > >> has > >>>>>>>> been added > >>>>>>>>> to send nid response only when Amfd completes its > >> initialization. > >>>>>>>>> > >>>>>>>>> Complete diffstat: > >>>>>>>>> ------------------ > >>>>>>>>> osaf/services/saf/amf/amfd/chkop.cc | 4 ++++ > >>>>>>>>> osaf/services/saf/amf/amfd/main.cc | 6 +++++- > >>>>>>>>> 2 files changed, 9 insertions(+), 1 deletions(-) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Testing Commands: > >>>>>>>>> ----------------- > >>>>>>>>> 1. Start Act controller SC-1. > >>>>>>>>> Start Standby controller SC-2 and stop SC-1 when > >>>>>>>>> Standby Amfd is in the middle of cold sync. > >>>>>>>>> 2. Start Act controller SC-1. > >>>>>>>>> Start Standby controller SC-2 and stop SC-1 when > >>>>>>>>> Standby Amfd has completed cold sync but Amfnd has not > >> asigned > >>>>>>>>> role to Fmd at Standby controller. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Testing, Expected Results: > >>>>>>>>> -------------------------- > >>>>>>>>> 1. Fmd reboots the node : > >>>>>>>>> Apr 27 12:56:32 PM_SC-2 osaffmd[11792]: Rebooting OpenSAF > >> NodeId = > >>>>>>>>> 0 > >>>>>> EE > >>>>>>>> Name = No EE Mapped, Reason: Failover occurred, but this node > >>>>>>>> is not yet ready, OwnNodeId = 131599, SupervisionTime = 60 > >>>>>>>>> 2. Same as above. > >>>>>>>>> > >>>>>>>>> Conditions of Submission: > >>>>>>>>> ------------------------- > >>>>>>>>> Ack from peer reviewers. > >>>>>>>>> > >>>>>>>>> Arch Built Started Linux distro > >>>>>>>>> ------------------------------------------- > >>>>>>>>> mips n n > >>>>>>>>> mips64 n n > >>>>>>>>> x86 n n > >>>>>>>>> x86_64 y y > >>>>>>>>> powerpc n n > >>>>>>>>> powerpc64 n n > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Reviewer Checklist: > >>>>>>>>> ------------------- > >>>>>>>>> [Submitters: make sure that your review doesn't trigger any > >>>>>>>>> checkmarks!] > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Your checkin has not passed review because (see checked entries): > >>>>>>>>> > >>>>>>>>> ___ Your RR template is generally incomplete; it has too many > >>>>>>>>> blank > >>>>>> entries > >>>>>>>>> that need proper data filled in. > >>>>>>>>> > >>>>>>>>> ___ You have failed to nominate the proper persons for review > >>>>>>>>> and > >>>> push. > >>>>>>>>> ___ Your patches do not have proper short+long header > >>>>>>>>> > >>>>>>>>> ___ You have grammar/spelling in your header that is > unacceptable. > >>>>>>>>> > >>>>>>>>> ___ You have exceeded a sensible line length in your > >>>>>>>> headers/comments/text. > >>>>>>>>> ___ You have failed to put in a proper Trac Ticket # into your > >> commits. > >>>>>>>>> ___ You have incorrectly put/left internal data in your > >> comments/files > >>>>>>>>> (i.e. internal bug tracking tool IDs, product names > >>>>>>>>> etc) > >>>>>>>>> > >>>>>>>>> ___ You have not given any evidence of testing beyond basic > >>>>>>>>> build > >>>> tests. > >>>>>>>>> Demonstrate some level of runtime or other sanity testing. > >>>>>>>>> > >>>>>>>>> ___ You have ^M present in some of your files. These have to > >>>>>>>>> be > >>>>>> removed. > >>>>>>>>> ___ You have needlessly changed whitespace or added whitespace > >>>> crimes > >>>>>>>>> like trailing spaces, or spaces before tabs. > >>>>>>>>> > >>>>>>>>> ___ You have mixed real technical changes with whitespace and > >> other > >>>>>>>>> cosmetic code cleanup changes. These have to be > >>>>>>>>> separate > >>>> commits. > >>>>>>>>> ___ You need to refactor your submission into logical chunks; > >>>>>>>>> there > >> is > >>>>>>>>> too much content into a single commit. > >>>>>>>>> > >>>>>>>>> ___ You have extraneous garbage in your review (merge commits > >>>>>>>>> etc) > >>>>>>>>> > >>>>>>>>> ___ You have giant attachments which should never have been > sent; > >>>>>>>>> Instead you should place your content in a public > >>>>>>>>> tree to be > >> pulled. > >>>>>>>>> ___ You have too many commits attached to an e-mail; resend as > >>>>>> threaded > >>>>>>>>> commits, or place in a public tree for a pull. > >>>>>>>>> > >>>>>>>>> ___ You have resent this content multiple times without a > >>>>>>>>> clear > >>>>>> indication > >>>>>>>>> of what has changed between each re-send. > >>>>>>>>> > >>>>>>>>> ___ You have failed to adequately and individually address all of > the > >>>>>>>>> comments and change requests that were proposed in > >>>>>>>>> the initial > >>>>>> review. > >>>>>>>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, > >>>>>>>>> email > >>>>>>>>> etc) > >>>>>>>>> > >>>>>>>>> ___ Your computer have a badly configured date and time; > >>>>>>>>> confusing > >>>> the > >>>>>>>>> the threaded patch review. > >>>>>>>>> > >>>>>>>>> ___ Your changes affect IPC mechanism, and you don't present > >>>>>>>>> any > >>>> results > >>>>>>>>> for in-service upgradability test. > >>>>>>>>> > >>>>>>>>> ___ Your changes affect user manual and documentation, your > >>>>>>>>> patch > >>>>>> series > >>>>>>>>> do not contain the patch that updates the Doxygen manual. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -------------------------------------------------------------- > >>>>>>>>> ---- > >>>>>>>>> ------------ One dashboard for servers and applications across > >>>>>>>>> Physical-Virtual-Cloud Widest out-of-the-box monitoring > >>>>>>>>> support with 50+ applications Performance metrics, stats and > >>>>>>>>> reports that give you Actionable Insights Deep dive visibility > >>>>>>>>> with transaction tracing using APM Insight. > >>>>>>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > >>>>>>>>> _______________________________________________ > >>>>>>>>> Opensaf-devel mailing list > >>>>>>>>> [email protected] > >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > >>>>>>>>> > >>>>>>>>> >
1334_modified.patch
Description: Binary data
------------------------------------------------------------------------------
_______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
