Thanks, we will try it out!

Btw I checked the patch and it only changes the function 
avd_role_failover(), but there is no corresponding change in the 
function avd_role_failover_qsd_actv(). Doesn't the latter function also 
have to be modified?

regards,
Anders Widell

On 09/11/2015 03:08 PM, Nagendra Kumar wrote:
> Hi Anders,
>               Please find the patch(1334_modified.patch) attached and confirm 
> if it works for you. Of course, the commit is subject to discussion among us 
> (including Mathi).
>
> Thanks
> -Nagu
>
>> -----Original Message-----
>> From: Anders Widell [mailto:[email protected]]
>> Sent: 11 September 2015 18:16
>> To: Nagendra Kumar; [email protected]; Praveen Malviya;
>> Mathivanan Naickan Palanivelu
>> Cc: [email protected]
>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond to nid
>> only after initialization is completed [#1334]
>>
>> See inline.
>>
>> thanks,
>> Anders Widell
>>
>> On 09/11/2015 02:02 PM, Nagendra Kumar wrote:
>>> Thanks for the response.
>>> If I could, I will exclude the applications timing coming up on Standby
>> controller because I could install only management applications on it and 
>> that
>> too in Standby mode. There has been many instances (I had seen), where
>> Standby controller of some systems takes hours to come up, but it is not
>> considered in Cluster startup time because of Act controller is present and 
>> it
>> is working.
>> [AndersW] The time it takes for the standby to come up is also important,
>> since it counts towards your repair time and you wish to minimize the time
>> you are running without (or with reduced) redundancy.
>>> But anyway, it looks to me that your test case is very much affected with
>> this change, so what I can provide help is: I can send a patch which will 
>> revert
>> the changes in #1334 and add code for rebooting in Amfd in case of failover
>> during cold sync. Will that work for you?
>> [AndersW] That would be great!
>>> If all(Mathi is on vacation till Tuesday) agree then we can push that patch,
>> else take some other decision or continue with old(as old is already pushed).
>>> Thanks
>>> -Nagu
>>>> -----Original Message-----
>>>> From: Anders Widell [mailto:[email protected]]
>>>> Sent: 11 September 2015 16:39
>>>> To: Nagendra Kumar; [email protected]; Praveen Malviya;
>>>> Mathivanan Naickan Palanivelu
>>>> Cc: [email protected]
>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: respond
>>>> to nid only after initialization is completed [#1334]
>>>>
>>>> See inline.
>>>>
>>>> / Anders Widell
>>>>
>>>> On 09/11/2015 12:45 PM, Nagendra Kumar wrote:
>>>>> Hi Anders,
>>>>>           Please check the comment on #1334 : "Amf can provide a
>>>> reboot in this case if fms don't care to handle it.".
>>>> [AndersW] Yes, I saw this comment and that's why I was wondering why
>>>> you ended up with a different solution.
>>>>> But as per discussion with FMS maintainer, Mathi, we have taken this
>>>> decision to allow handing FMS on system level faults and Amf should
>>>> mind its business with internal affairs. So, the solution was
>>>> proposed in the ticket
>>>> #1334 on 2015-04-27.
>>>>> Let's hear Mathi's opinion as well.
>>>>>
>>>>> Just a question: is there any reason which enforces inclusion of
>>>>> Standby's
>>>> startup time in Cluster startup time, when failover can't work?
>>>> [AndersW] There can be applications running on the system
>>>> controllers, and if you by "cluster startup time" mean the time until
>>>> applications are up and running on all nodes in the cluster, then
>>>> this time has increased with the current solution to ticket [#1334].
>>>> Also, my concern is that if we go for the same solution in the other
>>>> OpenSAF directors where we have the same fault, then we will
>>>> serialize the cold sync of OpenSAF directors and hence increase the
>>>> repair time of the OpenSAF system controllers.
>>>>> Thanks
>>>>> -Nagu
>>>>>> -----Original Message-----
>>>>>> From: Anders Widell [mailto:[email protected]]
>>>>>> Sent: 11 September 2015 15:50
>>>>>> To: Nagendra Kumar; [email protected]; Praveen Malviya;
>>>>>> Mathivanan Naickan Palanivelu
>>>>>> Cc: [email protected]
>>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd:
>>>>>> respond to
>>>> nid
>>>>>> only after initialization is completed [#1334]
>>>>>>
>>>>>> I will bring up this issue on our next TLC meeting. The immediate
>>>>>> concern here is the prolonged startup time, which can be regarded
>>>>>> as a
>>>> performance
>>>>>> regression in AMF. The long-term concern is the alignment of the
>>>>>> solution with future the direction of OpenSAF (removal of FM has been
>> discussed).
>>>>>> regards,
>>>>>> Anders Widell
>>>>>>
>>>>>> On 09/11/2015 11:50 AM, Nagendra Kumar wrote:
>>>>>>> Hi Anders,
>>>>>>>                 Since you are talking about making FMS redundant and its
>>>>>> removal and moving from NID to system and since this problem is
>>>>>> existing
>>>> in
>>>>>> other pre Amf components(Amfnd actually), so I am being indecisive
>>>>>> to response. May be you can talk in TLC and conclude and raise a
>>>>>> ticket and prepare a roadmap for doing so. As a maintainer,
>>>>>> respective people will fix this issue in their services.
>>>>>>> Thanks
>>>>>>> -Nagu
>>>>>>>> -----Original Message-----
>>>>>>>> From: Anders Widell [mailto:[email protected]]
>>>>>>>> Sent: 10 September 2015 20:53
>>>>>>>> To: Nagendra Kumar; [email protected]; Praveen
>> Malviya;
>>>>>>>> Mathivanan Naickan Palanivelu
>>>>>>>> Cc: [email protected]
>>>>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd:
>>>>>>>> respond to nid only after initialization is completed [#1334]
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> See my replies marked with [AndersW] inline.
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> Anders Widell
>>>>>>>>
>>>>>>>> On 09/10/2015 01:27 PM, Nagendra Kumar wrote:
>>>>>>>>> Hi Anders,
>>>>>>>>>               Thanks for pointing out. Please find my responses
>> inlined with
>>>>>>>> [Nagu].
>>>>>>>>> Thanks
>>>>>>>>> -Nagu
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Anders Widell [mailto:[email protected]]
>>>>>>>>>> Sent: 08 September 2015 15:44
>>>>>>>>>> To: Nagendra Kumar; [email protected]; Praveen
>>>> Malviya;
>>>>>>>>>> Mathivanan Naickan Palanivelu
>>>>>>>>>> Cc: [email protected]
>>>>>>>>>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd:
>>>>>>>>>> respond to
>>>>>>>> nid
>>>>>>>>>> only after initialization is completed [#1334]
>>>>>>>>>>
>>>>>>>>>> Hi!
>>>>>>>>>>
>>>>>>>>>> I have a couple of questing regarding this ticket:
>>>>>>>>>>
>>>>>>>>>> 1) Why did you chose a solution where you delay the reply to
>>>>>>>>>> NID until amfd has completed its cold sync, instead of a
>>>>>>>>>> solution where amfd would itself initiate a node reboot if a
>>>>>>>>>> failover happens before it has completed the cold sync? The
>>>>>>>>>> chosen solution will delay the startup time of the standby
>>>>>>>>>> controller, so there is a clear (and visible, since we are
>>>>>>>>>> measuring startup time) disadvantage with this solution. Is
>>>>>>>>>> there any advantage with it, or
>>>>>> disadvantage with the other solution?
>>>>>>>>> [Nagu]: If you could read my comment in #1334 ticket, it will
>>>>>>>>> point out a
>>>>>>>> flaw
>>>>>>>>> in handling failover by FMS in case of cold sync in progress(Amfd).
>>>>>>>>> FMS collects evidence of Cold sync complete by checking whether
>>>>>>>>> FMS has got CSI assignment or not. Since Amfd has acknowledged
>>>>>>>>> NID that he has
>>>>>>>> completed
>>>>>>>>> its initialization(without Cold sync completed) and hence NID
>>>>>>>>> started Amfnd and Amfnd took configuration from Act Amfd and
>>>>>>>>> assigned CSI to FMS. Till
>>>>>>>> here Amfd is
>>>>>>>>> not synced. So, when failover is triggered FMS couldn't take any
>>>> decision.
>>>>>>>>> So, #1334 was reported.
>>>>>>>> [AndersW] I agree that the issue reported in ticket [#1334] is a bug.
>>>>>>>> The observed problem is that the standby SC does not go for a
>>>>>>>> reboot immediately in the case when an SC failover happens before
>>>>>>>> the cold sync is complete. It should reboot immediately. What I
>>>>>>>> am questioning is the solution to this problem. The implemented
>>>>>>>> solution delays the ack from amfd to NID until the cold sync is
>>>>>>>> complete, so that the logic in FM will work as intended and
>>>>>>>> reboot the node. My proposal is to instead solve this problem by
>>>>>>>> initiating the reboot directly from
>>>> AMFD.
>>>>>>>>> Now the question was
>>>>>>>>> 1. whether Amfd can ack NID that it is initialized and ready for
>>>>>>>>> Assignment/failover without syncing.
>>>>>>>> [AndersW] The question here is what an ack to NID actually means
>>>>>>>> (or should mean). Ideally, I think we should aim to move away
>>>>>>>> from using NID for serializing the OpenSAF startup and instead
>>>>>>>> solve dependencies between OpenSAF services using TRY_AGAIN
>> loops
>>>>>>>> or
>>>> similar
>>>>>> constructs.
>>>>>>>> I.e. the same philosophy as used in systemd. By delaying the
>>>>>>>> reply to NID we are moving in the opposite direction.
>>>>>>>>> 2. Whether FMS logic can be changed.
>>>>>>>> [AndersW] My proposal wasn't really to change the logic in FM,
>>>>>>>> but rather to order the reboot from the OpenSAF director (amfd in
>>>>>>>> this
>>>>>>>> case) when a failover happens before the cold sync is complete.
>>>>>>>> The logic in FM doesn't have to be changed, but if all directors
>>>>>>>> have been fixed in this way then the logic in FM would be
>>>>>>>> redundant and
>>>> could
>>>>>> be removed.
>>>>>>>>> I thought Amfd need to ack NID when it completes sync.
>>>>>>>>>
>>>>>>>>> Whether 1. it increased Startuptime of Standby controller or 2.
>>>>>>>>> it actually corrected the errata of measurement, is subject to
>>>>>>>>> interpretation.
>>>>>>>> [AndersW] I think the startup time is delayed by the current
>>>>>>>> solution to this ticket. Look at it from the perspective of an
>>>>>>>> application component which is started by AMF on the standby SC,
>>>>>>>> and measure
>>>> how
>>>>>>>> long time it takes until this component gets an assignment after
>>>>>>>> a
>>>> reboot.
>>>>>>>>> In my opinion, #2 is correct. If Amfd is not ready, it should
>>>>>>>>> not have sent the
>>>>>>>> Ack.
>>>>>>>>> So, it was a bug.
>>>>>>>>>
>>>>>>>>>> 2) Isn't this problem applicable to all OpenSAF directors, and
>>>>>>>>>> not just AMFD? Has anyone analyzed the other services to
>>>>>>>>>> determine if they also need a similar fix? Since NID starts the
>>>>>>>>>> services in serial and doesn't continue with the next service
>>>>>>>>>> before it has received a reply from the previous one, it would
>>>>>>>>>> mean that with the chosen solution all OpenSAF directors will
>>>>>>>>>> be cold synced in serial. With the other solution, they can be cold
>> synced in parallel.
>>>>>>>>> [Nagu]: I checked now in Clmd, Ntfd, Logd, Immd and all needs
>>>>>>>>> correction
>>>>>>>> and hence startup time
>>>>>>>>> will increase more.
>>>>>>>> [AndersW] Ok, thanks for checking! So then we will eventually
>>>>>>>> have to fix them as well. But again I would like to point out
>>>>>>>> that if we go for the solution that has been implemented in AMFD
>>>>>>>> then the OpenSAF cold sync will be serialized. With the solution
>>>>>>>> that I propose, the cold sync can be done in parallel. By syncing
>>>>>>>> the OpenSAF directors in parallel we get a shorter time window
>>>>>>>> when the system is without a fully synced standby SC, and hence
>>>>>>>> improve the availability of the
>>>> system.
>>>>>>>>>> / Anders Widell
>>>>>>>>>>
>>>>>>>>>> On 04/27/2015 11:19 AM, [email protected] wrote:
>>>>>>>>>>> Summary: amfd: respond to nid only after initialization is
>>>>>>>>>>> completed
>>>>>>>> [#1334]
>>>>>>>>>>> Review request for Trac Ticket(s): #1334 Peer Reviewer(s):
>>>>>>>>>>> Mathi, Hans N, Praveen Pull request to: <<LIST THE PERSON WITH
>>>>>>>>>>> PUSH ACCESS HERE>> Affected branch(es): All Development
>> branch:
>>>> Default
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> Impacted area       Impact y/n
>>>>>>>>>>> --------------------------------
>>>>>>>>>>>        Docs                    n
>>>>>>>>>>>        Build system            n
>>>>>>>>>>>        RPM/packaging           n
>>>>>>>>>>>        Configuration files     n
>>>>>>>>>>>        Startup scripts         n
>>>>>>>>>>>        SAF services            y
>>>>>>>>>>>        OpenSAF services        n
>>>>>>>>>>>        Core libraries          n
>>>>>>>>>>>        Samples                 n
>>>>>>>>>>>        Tests                   n
>>>>>>>>>>>        Other                   n
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Comments (indicate scope for each "y" above):
>>>>>>>>>>> ---------------------------------------------
>>>>>>>>>>>        <<EXPLAIN/COMMENT THE PATCH SERIES HERE>>
>>>>>>>>>>>
>>>>>>>>>>> changeset eeba7fe22afb8f4c40bb393d45a108cc59061eda
>>>>>>>>>>> Author:     Nagendra Kumar<[email protected]>
>>>>>>>>>>> Date:       Mon, 27 Apr 2015 14:41:02 +0530
>>>>>>>>>>>
>>>>>>>>>>>     amfd: respond to nid only after initialization is completed
>>>>>>>>>>> [#1334] Act
>>>>>>>>>> Amfd
>>>>>>>>>>>     initialization is said to be completed when it completes its
>>>>>>>>>> initialization
>>>>>>>>>>>     with imm. Apart from initializing with imm, Standby Amfd
>>>> also
>>>>>>>>>>> need
>>>>>>>>>> to get
>>>>>>>>>>>     run time data from Act Amfd using cold sync. So, Standby
>>>> Amfd
>>>>>>>>>> initialization
>>>>>>>>>>>     is said to be completed when it completes its initialization
>>>> with
>>>>>>>>>>> imm
>>>>>>>>>> and it
>>>>>>>>>>>     completes its cold sync with Act Amfd. In the present code,
>>>> Standby is
>>>>>>>>>>>     sending response to nid without cold sync complete. So, code
>>>> has
>>>>>>>>>> been added
>>>>>>>>>>>     to send nid response only when Amfd completes its
>>>> initialization.
>>>>>>>>>>> Complete diffstat:
>>>>>>>>>>> ------------------
>>>>>>>>>>>        osaf/services/saf/amf/amfd/chkop.cc |  4 ++++
>>>>>>>>>>>        osaf/services/saf/amf/amfd/main.cc  |  6 +++++-
>>>>>>>>>>>        2 files changed, 9 insertions(+), 1 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Testing Commands:
>>>>>>>>>>> -----------------
>>>>>>>>>>> 1. Start Act controller SC-1.
>>>>>>>>>>>          Start Standby controller SC-2 and stop SC-1 when
>>>>>>>>>>>          Standby Amfd is in the middle of cold sync.
>>>>>>>>>>> 2. Start Act controller SC-1.
>>>>>>>>>>>          Start Standby controller SC-2 and stop SC-1 when
>>>>>>>>>>>          Standby Amfd has completed cold sync but Amfnd has not
>>>> asigned
>>>>>>>>>>>          role to Fmd at Standby controller.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Testing, Expected Results:
>>>>>>>>>>> --------------------------
>>>>>>>>>>> 1. Fmd reboots the node :
>>>>>>>>>>> Apr 27 12:56:32 PM_SC-2 osaffmd[11792]: Rebooting OpenSAF
>>>> NodeId =
>>>>>>>>>>> 0
>>>>>>>> EE
>>>>>>>>>> Name = No EE Mapped, Reason: Failover occurred, but this node
>>>>>>>>>> is not yet ready, OwnNodeId = 131599, SupervisionTime = 60
>>>>>>>>>>> 2. Same as above.
>>>>>>>>>>>
>>>>>>>>>>> Conditions of Submission:
>>>>>>>>>>> -------------------------
>>>>>>>>>>> Ack from peer reviewers.
>>>>>>>>>>>
>>>>>>>>>>> Arch      Built     Started    Linux distro
>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>> mips        n          n
>>>>>>>>>>> mips64      n          n
>>>>>>>>>>> x86         n          n
>>>>>>>>>>> x86_64      y          y
>>>>>>>>>>> powerpc     n          n
>>>>>>>>>>> powerpc64   n          n
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Reviewer Checklist:
>>>>>>>>>>> -------------------
>>>>>>>>>>> [Submitters: make sure that your review doesn't trigger any
>>>>>>>>>>> checkmarks!]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Your checkin has not passed review because (see checked entries):
>>>>>>>>>>>
>>>>>>>>>>> ___ Your RR template is generally incomplete; it has too many
>>>>>>>>>>> blank
>>>>>>>> entries
>>>>>>>>>>>           that need proper data filled in.
>>>>>>>>>>>
>>>>>>>>>>> ___ You have failed to nominate the proper persons for review
>>>>>>>>>>> and
>>>>>> push.
>>>>>>>>>>> ___ Your patches do not have proper short+long header
>>>>>>>>>>>
>>>>>>>>>>> ___ You have grammar/spelling in your header that is
>> unacceptable.
>>>>>>>>>>> ___ You have exceeded a sensible line length in your
>>>>>>>>>> headers/comments/text.
>>>>>>>>>>> ___ You have failed to put in a proper Trac Ticket # into your
>>>> commits.
>>>>>>>>>>> ___ You have incorrectly put/left internal data in your
>>>> comments/files
>>>>>>>>>>>           (i.e. internal bug tracking tool IDs, product names
>>>>>>>>>>> etc)
>>>>>>>>>>>
>>>>>>>>>>> ___ You have not given any evidence of testing beyond basic
>>>>>>>>>>> build
>>>>>> tests.
>>>>>>>>>>>           Demonstrate some level of runtime or other sanity testing.
>>>>>>>>>>>
>>>>>>>>>>> ___ You have ^M present in some of your files. These have to
>>>>>>>>>>> be
>>>>>>>> removed.
>>>>>>>>>>> ___ You have needlessly changed whitespace or added whitespace
>>>>>> crimes
>>>>>>>>>>>           like trailing spaces, or spaces before tabs.
>>>>>>>>>>>
>>>>>>>>>>> ___ You have mixed real technical changes with whitespace and
>>>> other
>>>>>>>>>>>           cosmetic code cleanup changes. These have to be
>>>>>>>>>>> separate
>>>>>> commits.
>>>>>>>>>>> ___ You need to refactor your submission into logical chunks;
>>>>>>>>>>> there
>>>> is
>>>>>>>>>>>           too much content into a single commit.
>>>>>>>>>>>
>>>>>>>>>>> ___ You have extraneous garbage in your review (merge commits
>>>>>>>>>>> etc)
>>>>>>>>>>>
>>>>>>>>>>> ___ You have giant attachments which should never have been
>> sent;
>>>>>>>>>>>           Instead you should place your content in a public
>>>>>>>>>>> tree to be
>>>> pulled.
>>>>>>>>>>> ___ You have too many commits attached to an e-mail; resend as
>>>>>>>> threaded
>>>>>>>>>>>           commits, or place in a public tree for a pull.
>>>>>>>>>>>
>>>>>>>>>>> ___ You have resent this content multiple times without a
>>>>>>>>>>> clear
>>>>>>>> indication
>>>>>>>>>>>           of what has changed between each re-send.
>>>>>>>>>>>
>>>>>>>>>>> ___ You have failed to adequately and individually address all of
>> the
>>>>>>>>>>>           comments and change requests that were proposed in
>>>>>>>>>>> the initial
>>>>>>>> review.
>>>>>>>>>>> ___ You have a misconfigured ~/.hgrc file (i.e. username,
>>>>>>>>>>> email
>>>>>>>>>>> etc)
>>>>>>>>>>>
>>>>>>>>>>> ___ Your computer have a badly configured date and time;
>>>>>>>>>>> confusing
>>>>>> the
>>>>>>>>>>>           the threaded patch review.
>>>>>>>>>>>
>>>>>>>>>>> ___ Your changes affect IPC mechanism, and you don't present
>>>>>>>>>>> any
>>>>>> results
>>>>>>>>>>>           for in-service upgradability test.
>>>>>>>>>>>
>>>>>>>>>>> ___ Your changes affect user manual and documentation, your
>>>>>>>>>>> patch
>>>>>>>> series
>>>>>>>>>>>           do not contain the patch that updates the Doxygen manual.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>>> ----
>>>>>>>>>>> ------------ One dashboard for servers and applications across
>>>>>>>>>>> Physical-Virtual-Cloud Widest out-of-the-box monitoring
>>>>>>>>>>> support with 50+ applications Performance metrics, stats and
>>>>>>>>>>> reports that give you Actionable Insights Deep dive visibility
>>>>>>>>>>> with transaction tracing using APM Insight.
>>>>>>>>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Opensaf-devel mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>>>>>>>>>
>>>>>>>>>>>


------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to