I will bring up this issue on our next TLC meeting. The immediate 
concern here is the prolonged startup time, which can be regarded as a 
performance regression in AMF. The long-term concern is the alignment of 
the solution with future the direction of OpenSAF (removal of FM has 
been discussed).

regards,
Anders Widell

On 09/11/2015 11:50 AM, Nagendra Kumar wrote:
> Hi Anders,
>               Since you are talking about making FMS redundant and its 
> removal and moving from NID to system and since this problem is existing in 
> other pre Amf components(Amfnd actually), so I am being indecisive to 
> response. May be you can talk in TLC and conclude and raise a ticket and 
> prepare a roadmap for doing so. As a maintainer, respective people will fix 
> this issue in their services.
>
> Thanks
> -Nagu
>> -----Original Message-----
>> From: Anders Widell [mailto:[email protected]]
>> Sent: 10 September 2015 20:53
>> To: Nagendra Kumar; [email protected]; Praveen Malviya;
>> Mathivanan Naickan Palanivelu
>> Cc: [email protected]
>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to nid
>> only after initialization is completed [#1334]
>>
>> Hi
>>
>> See my replies marked with [AndersW] inline.
>>
>> thanks,
>> Anders Widell
>>
>> On 09/10/2015 01:27 PM, Nagendra Kumar wrote:
>>> Hi Anders,
>>>             Thanks for pointing out. Please find my responses inlined with
>> [Nagu].
>>> Thanks
>>> -Nagu
>>>> -----Original Message-----
>>>> From: Anders Widell [mailto:[email protected]]
>>>> Sent: 08 September 2015 15:44
>>>> To: Nagendra Kumar; [email protected]; Praveen Malviya;
>>>> Mathivanan Naickan Palanivelu
>>>> Cc: [email protected]
>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to
>> nid
>>>> only after initialization is completed [#1334]
>>>>
>>>> Hi!
>>>>
>>>> I have a couple of questing regarding this ticket:
>>>>
>>>> 1) Why did you chose a solution where you delay the reply to NID until
>>>> amfd has completed its cold sync, instead of a solution where amfd would
>>>> itself initiate a node reboot if a failover happens before it has
>>>> completed the cold sync? The chosen solution will delay the startup time
>>>> of the standby controller, so there is a clear (and visible, since we
>>>> are measuring startup time) disadvantage with this solution. Is there
>>>> any advantage with it, or disadvantage with the other solution?
>>> [Nagu]: If you could read my comment in #1334 ticket, it will point out a
>> flaw
>>> in handling failover by FMS in case of cold sync in progress(Amfd).
>>> FMS collects evidence of Cold sync complete by checking whether FMS has
>>> got CSI assignment or not. Since Amfd has acknowledged NID that he has
>> completed
>>> its initialization(without Cold sync completed) and hence NID started Amfnd
>>> and Amfnd took configuration from Act Amfd and assigned CSI to FMS. Till
>> here Amfd is
>>> not synced. So, when failover is triggered FMS couldn't take any decision.
>>> So, #1334 was reported.
>> [AndersW] I agree that the issue reported in ticket [#1334] is a bug.
>> The observed problem is that the standby SC does not go for a reboot
>> immediately in the case when an SC failover happens before the cold sync
>> is complete. It should reboot immediately. What I am questioning is the
>> solution to this problem. The implemented solution delays the ack from
>> amfd to NID until the cold sync is complete, so that the logic in FM
>> will work as intended and reboot the node. My proposal is to instead
>> solve this problem by initiating the reboot directly from AMFD.
>>> Now the question was
>>> 1. whether Amfd can ack NID that it is initialized and ready for
>>> Assignment/failover without syncing.
>> [AndersW] The question here is what an ack to NID actually means (or
>> should mean). Ideally, I think we should aim to move away from using NID
>> for serializing the OpenSAF startup and instead solve dependencies
>> between OpenSAF services using TRY_AGAIN loops or similar constructs.
>> I.e. the same philosophy as used in systemd. By delaying the reply to
>> NID we are moving in the opposite direction.
>>> 2. Whether FMS logic can be changed.
>> [AndersW] My proposal wasn't really to change the logic in FM, but
>> rather to order the reboot from the OpenSAF director (amfd in this case)
>> when a failover happens before the cold sync is complete. The logic in
>> FM doesn't have to be changed, but if all directors have been fixed in
>> this way then the logic in FM would be redundant and could be removed.
>>> I thought Amfd need to ack NID when it completes sync.
>>>
>>> Whether 1. it increased Startuptime of Standby controller or
>>> 2. it actually corrected the errata of measurement,
>>> is subject to interpretation.
>> [AndersW] I think the startup time is delayed by the current solution to
>> this ticket. Look at it from the perspective of an application component
>> which is started by AMF on the standby SC, and measure how long time it
>> takes until this component gets an assignment after a reboot.
>>> In my opinion, #2 is correct. If Amfd is not ready, it should not have sent 
>>> the
>> Ack.
>>> So, it was a bug.
>>>
>>>> 2) Isn't this problem applicable to all OpenSAF directors, and not just
>>>> AMFD? Has anyone analyzed the other services to determine if they also
>>>> need a similar fix? Since NID starts the services in serial and doesn't
>>>> continue with the next service before it has received a reply from the
>>>> previous one, it would mean that with the chosen solution all OpenSAF
>>>> directors will be cold synced in serial. With the other solution, they
>>>> can be cold synced in parallel.
>>> [Nagu]: I checked now in Clmd, Ntfd, Logd, Immd and all needs correction
>> and hence startup time
>>> will increase more.
>> [AndersW] Ok, thanks for checking! So then we will eventually have to
>> fix them as well. But again I would like to point out that if we go for
>> the solution that has been implemented in AMFD then the OpenSAF cold
>> sync will be serialized. With the solution that I propose, the cold sync
>> can be done in parallel. By syncing the OpenSAF directors in parallel we
>> get a shorter time window when the system is without a fully synced
>> standby SC, and hence improve the availability of the system.
>>>> / Anders Widell
>>>>
>>>> On 04/27/2015 11:19 AM, [email protected] wrote:
>>>>> Summary: amfd: respond to nid only after initialization is completed
>> [#1334]
>>>>> Review request for Trac Ticket(s): #1334
>>>>> Peer Reviewer(s): Mathi, Hans N, Praveen
>>>>> Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>>
>>>>> Affected branch(es): All
>>>>> Development branch: Default
>>>>>
>>>>> --------------------------------
>>>>> Impacted area       Impact y/n
>>>>> --------------------------------
>>>>>     Docs                    n
>>>>>     Build system            n
>>>>>     RPM/packaging           n
>>>>>     Configuration files     n
>>>>>     Startup scripts         n
>>>>>     SAF services            y
>>>>>     OpenSAF services        n
>>>>>     Core libraries          n
>>>>>     Samples                 n
>>>>>     Tests                   n
>>>>>     Other                   n
>>>>>
>>>>>
>>>>> Comments (indicate scope for each "y" above):
>>>>> ---------------------------------------------
>>>>>     <<EXPLAIN/COMMENT THE PATCH SERIES HERE>>
>>>>>
>>>>> changeset eeba7fe22afb8f4c40bb393d45a108cc59061eda
>>>>> Author:   Nagendra Kumar<[email protected]>
>>>>> Date:     Mon, 27 Apr 2015 14:41:02 +0530
>>>>>
>>>>>   amfd: respond to nid only after initialization is completed [#1334] Act
>>>> Amfd
>>>>>   initialization is said to be completed when it completes its
>>>> initialization
>>>>>   with imm. Apart from initializing with imm, Standby Amfd also need
>>>> to get
>>>>>   run time data from Act Amfd using cold sync. So, Standby Amfd
>>>> initialization
>>>>>   is said to be completed when it completes its initialization with imm
>>>> and it
>>>>>   completes its cold sync with Act Amfd. In the present code, Standby is
>>>>>   sending response to nid without cold sync complete. So, code has
>>>> been added
>>>>>   to send nid response only when Amfd completes its initialization.
>>>>>
>>>>>
>>>>> Complete diffstat:
>>>>> ------------------
>>>>>     osaf/services/saf/amf/amfd/chkop.cc |  4 ++++
>>>>>     osaf/services/saf/amf/amfd/main.cc  |  6 +++++-
>>>>>     2 files changed, 9 insertions(+), 1 deletions(-)
>>>>>
>>>>>
>>>>> Testing Commands:
>>>>> -----------------
>>>>> 1. Start Act controller SC-1.
>>>>>       Start Standby controller SC-2 and stop SC-1 when
>>>>>       Standby Amfd is in the middle of cold sync.
>>>>> 2. Start Act controller SC-1.
>>>>>       Start Standby controller SC-2 and stop SC-1 when
>>>>>       Standby Amfd has completed cold sync but Amfnd has not asigned
>>>>>       role to Fmd at Standby controller.
>>>>>
>>>>>
>>>>> Testing, Expected Results:
>>>>> --------------------------
>>>>> 1. Fmd reboots the node :
>>>>> Apr 27 12:56:32 PM_SC-2 osaffmd[11792]: Rebooting OpenSAF NodeId = 0
>> EE
>>>> Name = No EE Mapped, Reason: Failover occurred, but this node is not yet
>>>> ready, OwnNodeId = 131599, SupervisionTime = 60
>>>>> 2. Same as above.
>>>>>
>>>>> Conditions of Submission:
>>>>> -------------------------
>>>>> Ack from peer reviewers.
>>>>>
>>>>> Arch      Built     Started    Linux distro
>>>>> -------------------------------------------
>>>>> mips        n          n
>>>>> mips64      n          n
>>>>> x86         n          n
>>>>> x86_64      y          y
>>>>> powerpc     n          n
>>>>> powerpc64   n          n
>>>>>
>>>>>
>>>>> Reviewer Checklist:
>>>>> -------------------
>>>>> [Submitters: make sure that your review doesn't trigger any checkmarks!]
>>>>>
>>>>>
>>>>> Your checkin has not passed review because (see checked entries):
>>>>>
>>>>> ___ Your RR template is generally incomplete; it has too many blank
>> entries
>>>>>        that need proper data filled in.
>>>>>
>>>>> ___ You have failed to nominate the proper persons for review and push.
>>>>>
>>>>> ___ Your patches do not have proper short+long header
>>>>>
>>>>> ___ You have grammar/spelling in your header that is unacceptable.
>>>>>
>>>>> ___ You have exceeded a sensible line length in your
>>>> headers/comments/text.
>>>>> ___ You have failed to put in a proper Trac Ticket # into your commits.
>>>>>
>>>>> ___ You have incorrectly put/left internal data in your comments/files
>>>>>        (i.e. internal bug tracking tool IDs, product names etc)
>>>>>
>>>>> ___ You have not given any evidence of testing beyond basic build tests.
>>>>>        Demonstrate some level of runtime or other sanity testing.
>>>>>
>>>>> ___ You have ^M present in some of your files. These have to be
>> removed.
>>>>> ___ You have needlessly changed whitespace or added whitespace crimes
>>>>>        like trailing spaces, or spaces before tabs.
>>>>>
>>>>> ___ You have mixed real technical changes with whitespace and other
>>>>>        cosmetic code cleanup changes. These have to be separate commits.
>>>>>
>>>>> ___ You need to refactor your submission into logical chunks; there is
>>>>>        too much content into a single commit.
>>>>>
>>>>> ___ You have extraneous garbage in your review (merge commits etc)
>>>>>
>>>>> ___ You have giant attachments which should never have been sent;
>>>>>        Instead you should place your content in a public tree to be 
>>>>> pulled.
>>>>>
>>>>> ___ You have too many commits attached to an e-mail; resend as
>> threaded
>>>>>        commits, or place in a public tree for a pull.
>>>>>
>>>>> ___ You have resent this content multiple times without a clear
>> indication
>>>>>        of what has changed between each re-send.
>>>>>
>>>>> ___ You have failed to adequately and individually address all of the
>>>>>        comments and change requests that were proposed in the initial
>> review.
>>>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>>>>>
>>>>> ___ Your computer have a badly configured date and time; confusing the
>>>>>        the threaded patch review.
>>>>>
>>>>> ___ Your changes affect IPC mechanism, and you don't present any results
>>>>>        for in-service upgradability test.
>>>>>
>>>>> ___ Your changes affect user manual and documentation, your patch
>> series
>>>>>        do not contain the patch that updates the Doxygen manual.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>>>> Widest out-of-the-box monitoring support with 50+ applications
>>>>> Performance metrics, stats and reports that give you Actionable Insights
>>>>> Deep dive visibility with transaction tracing using APM Insight.
>>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>>>> _______________________________________________
>>>>> Opensaf-devel mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>>>
>>>>>


------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to