Hi

See my replies marked with [AndersW] inline.

thanks,
Anders Widell

On 09/10/2015 01:27 PM, Nagendra Kumar wrote:
> Hi Anders,
>               Thanks for pointing out. Please find my responses inlined with 
> [Nagu].
>
> Thanks
> -Nagu
>> -----Original Message-----
>> From: Anders Widell [mailto:[email protected]]
>> Sent: 08 September 2015 15:44
>> To: Nagendra Kumar; [email protected]; Praveen Malviya;
>> Mathivanan Naickan Palanivelu
>> Cc: [email protected]
>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to nid
>> only after initialization is completed [#1334]
>>
>> Hi!
>>
>> I have a couple of questing regarding this ticket:
>>
>> 1) Why did you chose a solution where you delay the reply to NID until
>> amfd has completed its cold sync, instead of a solution where amfd would
>> itself initiate a node reboot if a failover happens before it has
>> completed the cold sync? The chosen solution will delay the startup time
>> of the standby controller, so there is a clear (and visible, since we
>> are measuring startup time) disadvantage with this solution. Is there
>> any advantage with it, or disadvantage with the other solution?
> [Nagu]: If you could read my comment in #1334 ticket, it will point out a flaw
> in handling failover by FMS in case of cold sync in progress(Amfd).
> FMS collects evidence of Cold sync complete by checking whether FMS has
> got CSI assignment or not. Since Amfd has acknowledged NID that he has 
> completed
> its initialization(without Cold sync completed) and hence NID started Amfnd
> and Amfnd took configuration from Act Amfd and assigned CSI to FMS. Till here 
> Amfd is
> not synced. So, when failover is triggered FMS couldn't take any decision.
> So, #1334 was reported.
[AndersW] I agree that the issue reported in ticket [#1334] is a bug. 
The observed problem is that the standby SC does not go for a reboot 
immediately in the case when an SC failover happens before the cold sync 
is complete. It should reboot immediately. What I am questioning is the 
solution to this problem. The implemented solution delays the ack from 
amfd to NID until the cold sync is complete, so that the logic in FM 
will work as intended and reboot the node. My proposal is to instead 
solve this problem by initiating the reboot directly from AMFD.
> Now the question was
> 1. whether Amfd can ack NID that it is initialized and ready for
> Assignment/failover without syncing.
[AndersW] The question here is what an ack to NID actually means (or 
should mean). Ideally, I think we should aim to move away from using NID 
for serializing the OpenSAF startup and instead solve dependencies 
between OpenSAF services using TRY_AGAIN loops or similar constructs. 
I.e. the same philosophy as used in systemd. By delaying the reply to 
NID we are moving in the opposite direction.
> 2. Whether FMS logic can be changed.
[AndersW] My proposal wasn't really to change the logic in FM, but 
rather to order the reboot from the OpenSAF director (amfd in this case) 
when a failover happens before the cold sync is complete. The logic in 
FM doesn't have to be changed, but if all directors have been fixed in 
this way then the logic in FM would be redundant and could be removed.
>
> I thought Amfd need to ack NID when it completes sync.
>
> Whether 1. it increased Startuptime of Standby controller or
> 2. it actually corrected the errata of measurement,
> is subject to interpretation.
[AndersW] I think the startup time is delayed by the current solution to 
this ticket. Look at it from the perspective of an application component 
which is started by AMF on the standby SC, and measure how long time it 
takes until this component gets an assignment after a reboot.
>
> In my opinion, #2 is correct. If Amfd is not ready, it should not have sent 
> the Ack.
> So, it was a bug.
>
>> 2) Isn't this problem applicable to all OpenSAF directors, and not just
>> AMFD? Has anyone analyzed the other services to determine if they also
>> need a similar fix? Since NID starts the services in serial and doesn't
>> continue with the next service before it has received a reply from the
>> previous one, it would mean that with the chosen solution all OpenSAF
>> directors will be cold synced in serial. With the other solution, they
>> can be cold synced in parallel.
> [Nagu]: I checked now in Clmd, Ntfd, Logd, Immd and all needs correction and 
> hence startup time
> will increase more.
[AndersW] Ok, thanks for checking! So then we will eventually have to 
fix them as well. But again I would like to point out that if we go for 
the solution that has been implemented in AMFD then the OpenSAF cold 
sync will be serialized. With the solution that I propose, the cold sync 
can be done in parallel. By syncing the OpenSAF directors in parallel we 
get a shorter time window when the system is without a fully synced 
standby SC, and hence improve the availability of the system.
>
>> / Anders Widell
>>
>> On 04/27/2015 11:19 AM, [email protected] wrote:
>>> Summary: amfd: respond to nid only after initialization is completed [#1334]
>>> Review request for Trac Ticket(s): #1334
>>> Peer Reviewer(s): Mathi, Hans N, Praveen
>>> Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>>
>>> Affected branch(es): All
>>> Development branch: Default
>>>
>>> --------------------------------
>>> Impacted area       Impact y/n
>>> --------------------------------
>>>    Docs                    n
>>>    Build system            n
>>>    RPM/packaging           n
>>>    Configuration files     n
>>>    Startup scripts         n
>>>    SAF services            y
>>>    OpenSAF services        n
>>>    Core libraries          n
>>>    Samples                 n
>>>    Tests                   n
>>>    Other                   n
>>>
>>>
>>> Comments (indicate scope for each "y" above):
>>> ---------------------------------------------
>>>    <<EXPLAIN/COMMENT THE PATCH SERIES HERE>>
>>>
>>> changeset eeba7fe22afb8f4c40bb393d45a108cc59061eda
>>> Author:     Nagendra Kumar<[email protected]>
>>> Date:       Mon, 27 Apr 2015 14:41:02 +0530
>>>
>>>     amfd: respond to nid only after initialization is completed [#1334] Act
>> Amfd
>>>     initialization is said to be completed when it completes its
>> initialization
>>>     with imm. Apart from initializing with imm, Standby Amfd also need
>> to get
>>>     run time data from Act Amfd using cold sync. So, Standby Amfd
>> initialization
>>>     is said to be completed when it completes its initialization with imm
>> and it
>>>     completes its cold sync with Act Amfd. In the present code, Standby is
>>>     sending response to nid without cold sync complete. So, code has
>> been added
>>>     to send nid response only when Amfd completes its initialization.
>>>
>>>
>>> Complete diffstat:
>>> ------------------
>>>    osaf/services/saf/amf/amfd/chkop.cc |  4 ++++
>>>    osaf/services/saf/amf/amfd/main.cc  |  6 +++++-
>>>    2 files changed, 9 insertions(+), 1 deletions(-)
>>>
>>>
>>> Testing Commands:
>>> -----------------
>>> 1. Start Act controller SC-1.
>>>      Start Standby controller SC-2 and stop SC-1 when
>>>      Standby Amfd is in the middle of cold sync.
>>> 2. Start Act controller SC-1.
>>>      Start Standby controller SC-2 and stop SC-1 when
>>>      Standby Amfd has completed cold sync but Amfnd has not asigned
>>>      role to Fmd at Standby controller.
>>>
>>>
>>> Testing, Expected Results:
>>> --------------------------
>>> 1. Fmd reboots the node :
>>> Apr 27 12:56:32 PM_SC-2 osaffmd[11792]: Rebooting OpenSAF NodeId = 0 EE
>> Name = No EE Mapped, Reason: Failover occurred, but this node is not yet
>> ready, OwnNodeId = 131599, SupervisionTime = 60
>>> 2. Same as above.
>>>
>>> Conditions of Submission:
>>> -------------------------
>>> Ack from peer reviewers.
>>>
>>> Arch      Built     Started    Linux distro
>>> -------------------------------------------
>>> mips        n          n
>>> mips64      n          n
>>> x86         n          n
>>> x86_64      y          y
>>> powerpc     n          n
>>> powerpc64   n          n
>>>
>>>
>>> Reviewer Checklist:
>>> -------------------
>>> [Submitters: make sure that your review doesn't trigger any checkmarks!]
>>>
>>>
>>> Your checkin has not passed review because (see checked entries):
>>>
>>> ___ Your RR template is generally incomplete; it has too many blank entries
>>>       that need proper data filled in.
>>>
>>> ___ You have failed to nominate the proper persons for review and push.
>>>
>>> ___ Your patches do not have proper short+long header
>>>
>>> ___ You have grammar/spelling in your header that is unacceptable.
>>>
>>> ___ You have exceeded a sensible line length in your
>> headers/comments/text.
>>> ___ You have failed to put in a proper Trac Ticket # into your commits.
>>>
>>> ___ You have incorrectly put/left internal data in your comments/files
>>>       (i.e. internal bug tracking tool IDs, product names etc)
>>>
>>> ___ You have not given any evidence of testing beyond basic build tests.
>>>       Demonstrate some level of runtime or other sanity testing.
>>>
>>> ___ You have ^M present in some of your files. These have to be removed.
>>>
>>> ___ You have needlessly changed whitespace or added whitespace crimes
>>>       like trailing spaces, or spaces before tabs.
>>>
>>> ___ You have mixed real technical changes with whitespace and other
>>>       cosmetic code cleanup changes. These have to be separate commits.
>>>
>>> ___ You need to refactor your submission into logical chunks; there is
>>>       too much content into a single commit.
>>>
>>> ___ You have extraneous garbage in your review (merge commits etc)
>>>
>>> ___ You have giant attachments which should never have been sent;
>>>       Instead you should place your content in a public tree to be pulled.
>>>
>>> ___ You have too many commits attached to an e-mail; resend as threaded
>>>       commits, or place in a public tree for a pull.
>>>
>>> ___ You have resent this content multiple times without a clear indication
>>>       of what has changed between each re-send.
>>>
>>> ___ You have failed to adequately and individually address all of the
>>>       comments and change requests that were proposed in the initial review.
>>>
>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>>>
>>> ___ Your computer have a badly configured date and time; confusing the
>>>       the threaded patch review.
>>>
>>> ___ Your changes affect IPC mechanism, and you don't present any results
>>>       for in-service upgradability test.
>>>
>>> ___ Your changes affect user manual and documentation, your patch series
>>>       do not contain the patch that updates the Doxygen manual.
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>> Widest out-of-the-box monitoring support with 50+ applications
>>> Performance metrics, stats and reports that give you Actionable Insights
>>> Deep dive visibility with transaction tracing using APM Insight.
>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>> _______________________________________________
>>> Opensaf-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>
>>>


------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to