Hi Anders,
                Please find the patch(1334_modified.patch) attached and confirm 
if it works for you. Of course, the commit is subject to discussion among us 
(including Mathi).

Thanks
-Nagu

> -----Original Message-----
> From: Anders Widell [mailto:[email protected]]
> Sent: 11 September 2015 18:16
> To: Nagendra Kumar; [email protected]; Praveen Malviya;
> Mathivanan Naickan Palanivelu
> Cc: [email protected]
> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to nid
> only after initialization is completed [#1334]
> 
> See inline.
> 
> thanks,
> Anders Widell
> 
> On 09/11/2015 02:02 PM, Nagendra Kumar wrote:
> > Thanks for the response.
> > If I could, I will exclude the applications timing coming up on Standby
> controller because I could install only management applications on it and that
> too in Standby mode. There has been many instances (I had seen), where
> Standby controller of some systems takes hours to come up, but it is not
> considered in Cluster startup time because of Act controller is present and it
> is working.
> [AndersW] The time it takes for the standby to come up is also important,
> since it counts towards your repair time and you wish to minimize the time
> you are running without (or with reduced) redundancy.
> >
> > But anyway, it looks to me that your test case is very much affected with
> this change, so what I can provide help is: I can send a patch which will 
> revert
> the changes in #1334 and add code for rebooting in Amfd in case of failover
> during cold sync. Will that work for you?
> [AndersW] That would be great!
> >
> > If all(Mathi is on vacation till Tuesday) agree then we can push that patch,
> else take some other decision or continue with old(as old is already pushed).
> >
> > Thanks
> > -Nagu
> >> -----Original Message-----
> >> From: Anders Widell [mailto:[email protected]]
> >> Sent: 11 September 2015 16:39
> >> To: Nagendra Kumar; [email protected]; Praveen Malviya;
> >> Mathivanan Naickan Palanivelu
> >> Cc: [email protected]
> >> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond
> >> to nid only after initialization is completed [#1334]
> >>
> >> See inline.
> >>
> >> / Anders Widell
> >>
> >> On 09/11/2015 12:45 PM, Nagendra Kumar wrote:
> >>> Hi Anders,
> >>>           Please check the comment on #1334 : "Amf can provide a
> >> reboot in this case if fms don't care to handle it.".
> >> [AndersW] Yes, I saw this comment and that's why I was wondering why
> >> you ended up with a different solution.
> >>> But as per discussion with FMS maintainer, Mathi, we have taken this
> >> decision to allow handing FMS on system level faults and Amf should
> >> mind its business with internal affairs. So, the solution was
> >> proposed in the ticket
> >> #1334 on 2015-04-27.
> >>> Let's hear Mathi's opinion as well.
> >>>
> >>> Just a question: is there any reason which enforces inclusion of
> >>> Standby's
> >> startup time in Cluster startup time, when failover can't work?
> >> [AndersW] There can be applications running on the system
> >> controllers, and if you by "cluster startup time" mean the time until
> >> applications are up and running on all nodes in the cluster, then
> >> this time has increased with the current solution to ticket [#1334].
> >> Also, my concern is that if we go for the same solution in the other
> >> OpenSAF directors where we have the same fault, then we will
> >> serialize the cold sync of OpenSAF directors and hence increase the
> >> repair time of the OpenSAF system controllers.
> >>> Thanks
> >>> -Nagu
> >>>> -----Original Message-----
> >>>> From: Anders Widell [mailto:[email protected]]
> >>>> Sent: 11 September 2015 15:50
> >>>> To: Nagendra Kumar; [email protected]; Praveen Malviya;
> >>>> Mathivanan Naickan Palanivelu
> >>>> Cc: [email protected]
> >>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd:
> >>>> respond to
> >> nid
> >>>> only after initialization is completed [#1334]
> >>>>
> >>>> I will bring up this issue on our next TLC meeting. The immediate
> >>>> concern here is the prolonged startup time, which can be regarded
> >>>> as a
> >> performance
> >>>> regression in AMF. The long-term concern is the alignment of the
> >>>> solution with future the direction of OpenSAF (removal of FM has been
> discussed).
> >>>>
> >>>> regards,
> >>>> Anders Widell
> >>>>
> >>>> On 09/11/2015 11:50 AM, Nagendra Kumar wrote:
> >>>>> Hi Anders,
> >>>>>                 Since you are talking about making FMS redundant and its
> >>>> removal and moving from NID to system and since this problem is
> >>>> existing
> >> in
> >>>> other pre Amf components(Amfnd actually), so I am being indecisive
> >>>> to response. May be you can talk in TLC and conclude and raise a
> >>>> ticket and prepare a roadmap for doing so. As a maintainer,
> >>>> respective people will fix this issue in their services.
> >>>>> Thanks
> >>>>> -Nagu
> >>>>>> -----Original Message-----
> >>>>>> From: Anders Widell [mailto:[email protected]]
> >>>>>> Sent: 10 September 2015 20:53
> >>>>>> To: Nagendra Kumar; [email protected]; Praveen
> Malviya;
> >>>>>> Mathivanan Naickan Palanivelu
> >>>>>> Cc: [email protected]
> >>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd:
> >>>>>> respond to nid only after initialization is completed [#1334]
> >>>>>>
> >>>>>> Hi
> >>>>>>
> >>>>>> See my replies marked with [AndersW] inline.
> >>>>>>
> >>>>>> thanks,
> >>>>>> Anders Widell
> >>>>>>
> >>>>>> On 09/10/2015 01:27 PM, Nagendra Kumar wrote:
> >>>>>>> Hi Anders,
> >>>>>>>               Thanks for pointing out. Please find my responses
> inlined with
> >>>>>> [Nagu].
> >>>>>>> Thanks
> >>>>>>> -Nagu
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Anders Widell [mailto:[email protected]]
> >>>>>>>> Sent: 08 September 2015 15:44
> >>>>>>>> To: Nagendra Kumar; [email protected]; Praveen
> >> Malviya;
> >>>>>>>> Mathivanan Naickan Palanivelu
> >>>>>>>> Cc: [email protected]
> >>>>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd:
> >>>>>>>> respond to
> >>>>>> nid
> >>>>>>>> only after initialization is completed [#1334]
> >>>>>>>>
> >>>>>>>> Hi!
> >>>>>>>>
> >>>>>>>> I have a couple of questing regarding this ticket:
> >>>>>>>>
> >>>>>>>> 1) Why did you chose a solution where you delay the reply to
> >>>>>>>> NID until amfd has completed its cold sync, instead of a
> >>>>>>>> solution where amfd would itself initiate a node reboot if a
> >>>>>>>> failover happens before it has completed the cold sync? The
> >>>>>>>> chosen solution will delay the startup time of the standby
> >>>>>>>> controller, so there is a clear (and visible, since we are
> >>>>>>>> measuring startup time) disadvantage with this solution. Is
> >>>>>>>> there any advantage with it, or
> >>>> disadvantage with the other solution?
> >>>>>>> [Nagu]: If you could read my comment in #1334 ticket, it will
> >>>>>>> point out a
> >>>>>> flaw
> >>>>>>> in handling failover by FMS in case of cold sync in progress(Amfd).
> >>>>>>> FMS collects evidence of Cold sync complete by checking whether
> >>>>>>> FMS has got CSI assignment or not. Since Amfd has acknowledged
> >>>>>>> NID that he has
> >>>>>> completed
> >>>>>>> its initialization(without Cold sync completed) and hence NID
> >>>>>>> started Amfnd and Amfnd took configuration from Act Amfd and
> >>>>>>> assigned CSI to FMS. Till
> >>>>>> here Amfd is
> >>>>>>> not synced. So, when failover is triggered FMS couldn't take any
> >> decision.
> >>>>>>> So, #1334 was reported.
> >>>>>> [AndersW] I agree that the issue reported in ticket [#1334] is a bug.
> >>>>>> The observed problem is that the standby SC does not go for a
> >>>>>> reboot immediately in the case when an SC failover happens before
> >>>>>> the cold sync is complete. It should reboot immediately. What I
> >>>>>> am questioning is the solution to this problem. The implemented
> >>>>>> solution delays the ack from amfd to NID until the cold sync is
> >>>>>> complete, so that the logic in FM will work as intended and
> >>>>>> reboot the node. My proposal is to instead solve this problem by
> >>>>>> initiating the reboot directly from
> >> AMFD.
> >>>>>>> Now the question was
> >>>>>>> 1. whether Amfd can ack NID that it is initialized and ready for
> >>>>>>> Assignment/failover without syncing.
> >>>>>> [AndersW] The question here is what an ack to NID actually means
> >>>>>> (or should mean). Ideally, I think we should aim to move away
> >>>>>> from using NID for serializing the OpenSAF startup and instead
> >>>>>> solve dependencies between OpenSAF services using TRY_AGAIN
> loops
> >>>>>> or
> >> similar
> >>>> constructs.
> >>>>>> I.e. the same philosophy as used in systemd. By delaying the
> >>>>>> reply to NID we are moving in the opposite direction.
> >>>>>>> 2. Whether FMS logic can be changed.
> >>>>>> [AndersW] My proposal wasn't really to change the logic in FM,
> >>>>>> but rather to order the reboot from the OpenSAF director (amfd in
> >>>>>> this
> >>>>>> case) when a failover happens before the cold sync is complete.
> >>>>>> The logic in FM doesn't have to be changed, but if all directors
> >>>>>> have been fixed in this way then the logic in FM would be
> >>>>>> redundant and
> >> could
> >>>> be removed.
> >>>>>>> I thought Amfd need to ack NID when it completes sync.
> >>>>>>>
> >>>>>>> Whether 1. it increased Startuptime of Standby controller or 2.
> >>>>>>> it actually corrected the errata of measurement, is subject to
> >>>>>>> interpretation.
> >>>>>> [AndersW] I think the startup time is delayed by the current
> >>>>>> solution to this ticket. Look at it from the perspective of an
> >>>>>> application component which is started by AMF on the standby SC,
> >>>>>> and measure
> >> how
> >>>>>> long time it takes until this component gets an assignment after
> >>>>>> a
> >> reboot.
> >>>>>>> In my opinion, #2 is correct. If Amfd is not ready, it should
> >>>>>>> not have sent the
> >>>>>> Ack.
> >>>>>>> So, it was a bug.
> >>>>>>>
> >>>>>>>> 2) Isn't this problem applicable to all OpenSAF directors, and
> >>>>>>>> not just AMFD? Has anyone analyzed the other services to
> >>>>>>>> determine if they also need a similar fix? Since NID starts the
> >>>>>>>> services in serial and doesn't continue with the next service
> >>>>>>>> before it has received a reply from the previous one, it would
> >>>>>>>> mean that with the chosen solution all OpenSAF directors will
> >>>>>>>> be cold synced in serial. With the other solution, they can be cold
> synced in parallel.
> >>>>>>> [Nagu]: I checked now in Clmd, Ntfd, Logd, Immd and all needs
> >>>>>>> correction
> >>>>>> and hence startup time
> >>>>>>> will increase more.
> >>>>>> [AndersW] Ok, thanks for checking! So then we will eventually
> >>>>>> have to fix them as well. But again I would like to point out
> >>>>>> that if we go for the solution that has been implemented in AMFD
> >>>>>> then the OpenSAF cold sync will be serialized. With the solution
> >>>>>> that I propose, the cold sync can be done in parallel. By syncing
> >>>>>> the OpenSAF directors in parallel we get a shorter time window
> >>>>>> when the system is without a fully synced standby SC, and hence
> >>>>>> improve the availability of the
> >> system.
> >>>>>>>> / Anders Widell
> >>>>>>>>
> >>>>>>>> On 04/27/2015 11:19 AM, [email protected] wrote:
> >>>>>>>>> Summary: amfd: respond to nid only after initialization is
> >>>>>>>>> completed
> >>>>>> [#1334]
> >>>>>>>>> Review request for Trac Ticket(s): #1334 Peer Reviewer(s):
> >>>>>>>>> Mathi, Hans N, Praveen Pull request to: <<LIST THE PERSON WITH
> >>>>>>>>> PUSH ACCESS HERE>> Affected branch(es): All Development
> branch:
> >> Default
> >>>>>>>>> --------------------------------
> >>>>>>>>> Impacted area       Impact y/n
> >>>>>>>>> --------------------------------
> >>>>>>>>>       Docs                    n
> >>>>>>>>>       Build system            n
> >>>>>>>>>       RPM/packaging           n
> >>>>>>>>>       Configuration files     n
> >>>>>>>>>       Startup scripts         n
> >>>>>>>>>       SAF services            y
> >>>>>>>>>       OpenSAF services        n
> >>>>>>>>>       Core libraries          n
> >>>>>>>>>       Samples                 n
> >>>>>>>>>       Tests                   n
> >>>>>>>>>       Other                   n
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Comments (indicate scope for each "y" above):
> >>>>>>>>> ---------------------------------------------
> >>>>>>>>>       <<EXPLAIN/COMMENT THE PATCH SERIES HERE>>
> >>>>>>>>>
> >>>>>>>>> changeset eeba7fe22afb8f4c40bb393d45a108cc59061eda
> >>>>>>>>> Author:     Nagendra Kumar<[email protected]>
> >>>>>>>>> Date:       Mon, 27 Apr 2015 14:41:02 +0530
> >>>>>>>>>
> >>>>>>>>>     amfd: respond to nid only after initialization is completed
> >>>>>>>>> [#1334] Act
> >>>>>>>> Amfd
> >>>>>>>>>     initialization is said to be completed when it completes its
> >>>>>>>> initialization
> >>>>>>>>>     with imm. Apart from initializing with imm, Standby Amfd
> >> also
> >>>>>>>>> need
> >>>>>>>> to get
> >>>>>>>>>     run time data from Act Amfd using cold sync. So, Standby
> >> Amfd
> >>>>>>>> initialization
> >>>>>>>>>     is said to be completed when it completes its initialization
> >> with
> >>>>>>>>> imm
> >>>>>>>> and it
> >>>>>>>>>     completes its cold sync with Act Amfd. In the present code,
> >> Standby is
> >>>>>>>>>     sending response to nid without cold sync complete. So, code
> >> has
> >>>>>>>> been added
> >>>>>>>>>     to send nid response only when Amfd completes its
> >> initialization.
> >>>>>>>>>
> >>>>>>>>> Complete diffstat:
> >>>>>>>>> ------------------
> >>>>>>>>>       osaf/services/saf/amf/amfd/chkop.cc |  4 ++++
> >>>>>>>>>       osaf/services/saf/amf/amfd/main.cc  |  6 +++++-
> >>>>>>>>>       2 files changed, 9 insertions(+), 1 deletions(-)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Testing Commands:
> >>>>>>>>> -----------------
> >>>>>>>>> 1. Start Act controller SC-1.
> >>>>>>>>>         Start Standby controller SC-2 and stop SC-1 when
> >>>>>>>>>         Standby Amfd is in the middle of cold sync.
> >>>>>>>>> 2. Start Act controller SC-1.
> >>>>>>>>>         Start Standby controller SC-2 and stop SC-1 when
> >>>>>>>>>         Standby Amfd has completed cold sync but Amfnd has not
> >> asigned
> >>>>>>>>>         role to Fmd at Standby controller.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Testing, Expected Results:
> >>>>>>>>> --------------------------
> >>>>>>>>> 1. Fmd reboots the node :
> >>>>>>>>> Apr 27 12:56:32 PM_SC-2 osaffmd[11792]: Rebooting OpenSAF
> >> NodeId =
> >>>>>>>>> 0
> >>>>>> EE
> >>>>>>>> Name = No EE Mapped, Reason: Failover occurred, but this node
> >>>>>>>> is not yet ready, OwnNodeId = 131599, SupervisionTime = 60
> >>>>>>>>> 2. Same as above.
> >>>>>>>>>
> >>>>>>>>> Conditions of Submission:
> >>>>>>>>> -------------------------
> >>>>>>>>> Ack from peer reviewers.
> >>>>>>>>>
> >>>>>>>>> Arch      Built     Started    Linux distro
> >>>>>>>>> -------------------------------------------
> >>>>>>>>> mips        n          n
> >>>>>>>>> mips64      n          n
> >>>>>>>>> x86         n          n
> >>>>>>>>> x86_64      y          y
> >>>>>>>>> powerpc     n          n
> >>>>>>>>> powerpc64   n          n
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Reviewer Checklist:
> >>>>>>>>> -------------------
> >>>>>>>>> [Submitters: make sure that your review doesn't trigger any
> >>>>>>>>> checkmarks!]
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Your checkin has not passed review because (see checked entries):
> >>>>>>>>>
> >>>>>>>>> ___ Your RR template is generally incomplete; it has too many
> >>>>>>>>> blank
> >>>>>> entries
> >>>>>>>>>          that need proper data filled in.
> >>>>>>>>>
> >>>>>>>>> ___ You have failed to nominate the proper persons for review
> >>>>>>>>> and
> >>>> push.
> >>>>>>>>> ___ Your patches do not have proper short+long header
> >>>>>>>>>
> >>>>>>>>> ___ You have grammar/spelling in your header that is
> unacceptable.
> >>>>>>>>>
> >>>>>>>>> ___ You have exceeded a sensible line length in your
> >>>>>>>> headers/comments/text.
> >>>>>>>>> ___ You have failed to put in a proper Trac Ticket # into your
> >> commits.
> >>>>>>>>> ___ You have incorrectly put/left internal data in your
> >> comments/files
> >>>>>>>>>          (i.e. internal bug tracking tool IDs, product names
> >>>>>>>>> etc)
> >>>>>>>>>
> >>>>>>>>> ___ You have not given any evidence of testing beyond basic
> >>>>>>>>> build
> >>>> tests.
> >>>>>>>>>          Demonstrate some level of runtime or other sanity testing.
> >>>>>>>>>
> >>>>>>>>> ___ You have ^M present in some of your files. These have to
> >>>>>>>>> be
> >>>>>> removed.
> >>>>>>>>> ___ You have needlessly changed whitespace or added whitespace
> >>>> crimes
> >>>>>>>>>          like trailing spaces, or spaces before tabs.
> >>>>>>>>>
> >>>>>>>>> ___ You have mixed real technical changes with whitespace and
> >> other
> >>>>>>>>>          cosmetic code cleanup changes. These have to be
> >>>>>>>>> separate
> >>>> commits.
> >>>>>>>>> ___ You need to refactor your submission into logical chunks;
> >>>>>>>>> there
> >> is
> >>>>>>>>>          too much content into a single commit.
> >>>>>>>>>
> >>>>>>>>> ___ You have extraneous garbage in your review (merge commits
> >>>>>>>>> etc)
> >>>>>>>>>
> >>>>>>>>> ___ You have giant attachments which should never have been
> sent;
> >>>>>>>>>          Instead you should place your content in a public
> >>>>>>>>> tree to be
> >> pulled.
> >>>>>>>>> ___ You have too many commits attached to an e-mail; resend as
> >>>>>> threaded
> >>>>>>>>>          commits, or place in a public tree for a pull.
> >>>>>>>>>
> >>>>>>>>> ___ You have resent this content multiple times without a
> >>>>>>>>> clear
> >>>>>> indication
> >>>>>>>>>          of what has changed between each re-send.
> >>>>>>>>>
> >>>>>>>>> ___ You have failed to adequately and individually address all of
> the
> >>>>>>>>>          comments and change requests that were proposed in
> >>>>>>>>> the initial
> >>>>>> review.
> >>>>>>>>> ___ You have a misconfigured ~/.hgrc file (i.e. username,
> >>>>>>>>> email
> >>>>>>>>> etc)
> >>>>>>>>>
> >>>>>>>>> ___ Your computer have a badly configured date and time;
> >>>>>>>>> confusing
> >>>> the
> >>>>>>>>>          the threaded patch review.
> >>>>>>>>>
> >>>>>>>>> ___ Your changes affect IPC mechanism, and you don't present
> >>>>>>>>> any
> >>>> results
> >>>>>>>>>          for in-service upgradability test.
> >>>>>>>>>
> >>>>>>>>> ___ Your changes affect user manual and documentation, your
> >>>>>>>>> patch
> >>>>>> series
> >>>>>>>>>          do not contain the patch that updates the Doxygen manual.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --------------------------------------------------------------
> >>>>>>>>> ----
> >>>>>>>>> ------------ One dashboard for servers and applications across
> >>>>>>>>> Physical-Virtual-Cloud Widest out-of-the-box monitoring
> >>>>>>>>> support with 50+ applications Performance metrics, stats and
> >>>>>>>>> reports that give you Actionable Insights Deep dive visibility
> >>>>>>>>> with transaction tracing using APM Insight.
> >>>>>>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Opensaf-devel mailing list
> >>>>>>>>> [email protected]
> >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
> >>>>>>>>>
> >>>>>>>>>
> 

Attachment: 1334_modified.patch
Description: Binary data

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to