Hi Anders,
                Since you are talking about making FMS redundant and its 
removal and moving from NID to system and since this problem is existing in 
other pre Amf components(Amfnd actually), so I am being indecisive to response. 
May be you can talk in TLC and conclude and raise a ticket and prepare a 
roadmap for doing so. As a maintainer, respective people will fix this issue in 
their services.

Thanks
-Nagu
> -----Original Message-----
> From: Anders Widell [mailto:[email protected]]
> Sent: 10 September 2015 20:53
> To: Nagendra Kumar; [email protected]; Praveen Malviya;
> Mathivanan Naickan Palanivelu
> Cc: [email protected]
> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to nid
> only after initialization is completed [#1334]
> 
> Hi
> 
> See my replies marked with [AndersW] inline.
> 
> thanks,
> Anders Widell
> 
> On 09/10/2015 01:27 PM, Nagendra Kumar wrote:
> > Hi Anders,
> >             Thanks for pointing out. Please find my responses inlined with
> [Nagu].
> >
> > Thanks
> > -Nagu
> >> -----Original Message-----
> >> From: Anders Widell [mailto:[email protected]]
> >> Sent: 08 September 2015 15:44
> >> To: Nagendra Kumar; [email protected]; Praveen Malviya;
> >> Mathivanan Naickan Palanivelu
> >> Cc: [email protected]
> >> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to
> nid
> >> only after initialization is completed [#1334]
> >>
> >> Hi!
> >>
> >> I have a couple of questing regarding this ticket:
> >>
> >> 1) Why did you chose a solution where you delay the reply to NID until
> >> amfd has completed its cold sync, instead of a solution where amfd would
> >> itself initiate a node reboot if a failover happens before it has
> >> completed the cold sync? The chosen solution will delay the startup time
> >> of the standby controller, so there is a clear (and visible, since we
> >> are measuring startup time) disadvantage with this solution. Is there
> >> any advantage with it, or disadvantage with the other solution?
> > [Nagu]: If you could read my comment in #1334 ticket, it will point out a
> flaw
> > in handling failover by FMS in case of cold sync in progress(Amfd).
> > FMS collects evidence of Cold sync complete by checking whether FMS has
> > got CSI assignment or not. Since Amfd has acknowledged NID that he has
> completed
> > its initialization(without Cold sync completed) and hence NID started Amfnd
> > and Amfnd took configuration from Act Amfd and assigned CSI to FMS. Till
> here Amfd is
> > not synced. So, when failover is triggered FMS couldn't take any decision.
> > So, #1334 was reported.
> [AndersW] I agree that the issue reported in ticket [#1334] is a bug.
> The observed problem is that the standby SC does not go for a reboot
> immediately in the case when an SC failover happens before the cold sync
> is complete. It should reboot immediately. What I am questioning is the
> solution to this problem. The implemented solution delays the ack from
> amfd to NID until the cold sync is complete, so that the logic in FM
> will work as intended and reboot the node. My proposal is to instead
> solve this problem by initiating the reboot directly from AMFD.
> > Now the question was
> > 1. whether Amfd can ack NID that it is initialized and ready for
> > Assignment/failover without syncing.
> [AndersW] The question here is what an ack to NID actually means (or
> should mean). Ideally, I think we should aim to move away from using NID
> for serializing the OpenSAF startup and instead solve dependencies
> between OpenSAF services using TRY_AGAIN loops or similar constructs.
> I.e. the same philosophy as used in systemd. By delaying the reply to
> NID we are moving in the opposite direction.
> > 2. Whether FMS logic can be changed.
> [AndersW] My proposal wasn't really to change the logic in FM, but
> rather to order the reboot from the OpenSAF director (amfd in this case)
> when a failover happens before the cold sync is complete. The logic in
> FM doesn't have to be changed, but if all directors have been fixed in
> this way then the logic in FM would be redundant and could be removed.
> >
> > I thought Amfd need to ack NID when it completes sync.
> >
> > Whether 1. it increased Startuptime of Standby controller or
> > 2. it actually corrected the errata of measurement,
> > is subject to interpretation.
> [AndersW] I think the startup time is delayed by the current solution to
> this ticket. Look at it from the perspective of an application component
> which is started by AMF on the standby SC, and measure how long time it
> takes until this component gets an assignment after a reboot.
> >
> > In my opinion, #2 is correct. If Amfd is not ready, it should not have sent 
> > the
> Ack.
> > So, it was a bug.
> >
> >> 2) Isn't this problem applicable to all OpenSAF directors, and not just
> >> AMFD? Has anyone analyzed the other services to determine if they also
> >> need a similar fix? Since NID starts the services in serial and doesn't
> >> continue with the next service before it has received a reply from the
> >> previous one, it would mean that with the chosen solution all OpenSAF
> >> directors will be cold synced in serial. With the other solution, they
> >> can be cold synced in parallel.
> > [Nagu]: I checked now in Clmd, Ntfd, Logd, Immd and all needs correction
> and hence startup time
> > will increase more.
> [AndersW] Ok, thanks for checking! So then we will eventually have to
> fix them as well. But again I would like to point out that if we go for
> the solution that has been implemented in AMFD then the OpenSAF cold
> sync will be serialized. With the solution that I propose, the cold sync
> can be done in parallel. By syncing the OpenSAF directors in parallel we
> get a shorter time window when the system is without a fully synced
> standby SC, and hence improve the availability of the system.
> >
> >> / Anders Widell
> >>
> >> On 04/27/2015 11:19 AM, [email protected] wrote:
> >>> Summary: amfd: respond to nid only after initialization is completed
> [#1334]
> >>> Review request for Trac Ticket(s): #1334
> >>> Peer Reviewer(s): Mathi, Hans N, Praveen
> >>> Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>>
> >>> Affected branch(es): All
> >>> Development branch: Default
> >>>
> >>> --------------------------------
> >>> Impacted area       Impact y/n
> >>> --------------------------------
> >>>    Docs                    n
> >>>    Build system            n
> >>>    RPM/packaging           n
> >>>    Configuration files     n
> >>>    Startup scripts         n
> >>>    SAF services            y
> >>>    OpenSAF services        n
> >>>    Core libraries          n
> >>>    Samples                 n
> >>>    Tests                   n
> >>>    Other                   n
> >>>
> >>>
> >>> Comments (indicate scope for each "y" above):
> >>> ---------------------------------------------
> >>>    <<EXPLAIN/COMMENT THE PATCH SERIES HERE>>
> >>>
> >>> changeset eeba7fe22afb8f4c40bb393d45a108cc59061eda
> >>> Author:   Nagendra Kumar<[email protected]>
> >>> Date:     Mon, 27 Apr 2015 14:41:02 +0530
> >>>
> >>>   amfd: respond to nid only after initialization is completed [#1334] Act
> >> Amfd
> >>>   initialization is said to be completed when it completes its
> >> initialization
> >>>   with imm. Apart from initializing with imm, Standby Amfd also need
> >> to get
> >>>   run time data from Act Amfd using cold sync. So, Standby Amfd
> >> initialization
> >>>   is said to be completed when it completes its initialization with imm
> >> and it
> >>>   completes its cold sync with Act Amfd. In the present code, Standby is
> >>>   sending response to nid without cold sync complete. So, code has
> >> been added
> >>>   to send nid response only when Amfd completes its initialization.
> >>>
> >>>
> >>> Complete diffstat:
> >>> ------------------
> >>>    osaf/services/saf/amf/amfd/chkop.cc |  4 ++++
> >>>    osaf/services/saf/amf/amfd/main.cc  |  6 +++++-
> >>>    2 files changed, 9 insertions(+), 1 deletions(-)
> >>>
> >>>
> >>> Testing Commands:
> >>> -----------------
> >>> 1. Start Act controller SC-1.
> >>>      Start Standby controller SC-2 and stop SC-1 when
> >>>      Standby Amfd is in the middle of cold sync.
> >>> 2. Start Act controller SC-1.
> >>>      Start Standby controller SC-2 and stop SC-1 when
> >>>      Standby Amfd has completed cold sync but Amfnd has not asigned
> >>>      role to Fmd at Standby controller.
> >>>
> >>>
> >>> Testing, Expected Results:
> >>> --------------------------
> >>> 1. Fmd reboots the node :
> >>> Apr 27 12:56:32 PM_SC-2 osaffmd[11792]: Rebooting OpenSAF NodeId = 0
> EE
> >> Name = No EE Mapped, Reason: Failover occurred, but this node is not yet
> >> ready, OwnNodeId = 131599, SupervisionTime = 60
> >>> 2. Same as above.
> >>>
> >>> Conditions of Submission:
> >>> -------------------------
> >>> Ack from peer reviewers.
> >>>
> >>> Arch      Built     Started    Linux distro
> >>> -------------------------------------------
> >>> mips        n          n
> >>> mips64      n          n
> >>> x86         n          n
> >>> x86_64      y          y
> >>> powerpc     n          n
> >>> powerpc64   n          n
> >>>
> >>>
> >>> Reviewer Checklist:
> >>> -------------------
> >>> [Submitters: make sure that your review doesn't trigger any checkmarks!]
> >>>
> >>>
> >>> Your checkin has not passed review because (see checked entries):
> >>>
> >>> ___ Your RR template is generally incomplete; it has too many blank
> entries
> >>>       that need proper data filled in.
> >>>
> >>> ___ You have failed to nominate the proper persons for review and push.
> >>>
> >>> ___ Your patches do not have proper short+long header
> >>>
> >>> ___ You have grammar/spelling in your header that is unacceptable.
> >>>
> >>> ___ You have exceeded a sensible line length in your
> >> headers/comments/text.
> >>> ___ You have failed to put in a proper Trac Ticket # into your commits.
> >>>
> >>> ___ You have incorrectly put/left internal data in your comments/files
> >>>       (i.e. internal bug tracking tool IDs, product names etc)
> >>>
> >>> ___ You have not given any evidence of testing beyond basic build tests.
> >>>       Demonstrate some level of runtime or other sanity testing.
> >>>
> >>> ___ You have ^M present in some of your files. These have to be
> removed.
> >>>
> >>> ___ You have needlessly changed whitespace or added whitespace crimes
> >>>       like trailing spaces, or spaces before tabs.
> >>>
> >>> ___ You have mixed real technical changes with whitespace and other
> >>>       cosmetic code cleanup changes. These have to be separate commits.
> >>>
> >>> ___ You need to refactor your submission into logical chunks; there is
> >>>       too much content into a single commit.
> >>>
> >>> ___ You have extraneous garbage in your review (merge commits etc)
> >>>
> >>> ___ You have giant attachments which should never have been sent;
> >>>       Instead you should place your content in a public tree to be pulled.
> >>>
> >>> ___ You have too many commits attached to an e-mail; resend as
> threaded
> >>>       commits, or place in a public tree for a pull.
> >>>
> >>> ___ You have resent this content multiple times without a clear
> indication
> >>>       of what has changed between each re-send.
> >>>
> >>> ___ You have failed to adequately and individually address all of the
> >>>       comments and change requests that were proposed in the initial
> review.
> >>>
> >>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
> >>>
> >>> ___ Your computer have a badly configured date and time; confusing the
> >>>       the threaded patch review.
> >>>
> >>> ___ Your changes affect IPC mechanism, and you don't present any results
> >>>       for in-service upgradability test.
> >>>
> >>> ___ Your changes affect user manual and documentation, your patch
> series
> >>>       do not contain the patch that updates the Doxygen manual.
> >>>
> >>>
> >>> ------------------------------------------------------------------------------
> >>> One dashboard for servers and applications across Physical-Virtual-Cloud
> >>> Widest out-of-the-box monitoring support with 50+ applications
> >>> Performance metrics, stats and reports that give you Actionable Insights
> >>> Deep dive visibility with transaction tracing using APM Insight.
> >>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> >>> _______________________________________________
> >>> Opensaf-devel mailing list
> >>> [email protected]
> >>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
> >>>
> >>>
> 

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to