Quaote: "This new controller will not delete SUSI in IMM as it does not have any clue of their existence in IMM as they are not in AMF database".
That must be a design error. Runtime data in imm is NEVER the roginal/only data. Runtime data in imm is ALWAYS a COPY of the original residing in the service. For cached runtime data it can always be more or less out of date with the original. Non-cached runtime data is only fetched on demand and resides in the imm transiently. So if you have a design where the AMF creates runtime data/objects in the IMM and on occasion FORGETS that it has created such data. Then the AMF has a problem in need of fxing. If the standby/new-active is not informed of some cached runtime data created by old-active, Then at failover the new-active MUST scan the relevant class/subtree for garbage and clean it up. Any design relying on old-active doing cleanup is faulty. /AndersBj -----Original Message----- From: praveen malviya [mailto:praveen.malv...@oracle.com] Sent: den 7 maj 2014 13:31 To: Hans Feldt Cc: Anders Björnerstedt; nagendr...@oracle.com; opensaf-devel@lists.sourceforge.net Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: update RT objects before node-failover of active controller [#494]. The Job queue not only contains the updates of run time objects but deletion/creation of them also. In the #494 case, active AMF deletes some SUSIs and push their deletion in IMM for deletion to be done later. But at the same time standby AMF also deletes them(MBCSv checkpointing). Now active controller reboots and standby becomes active. This new controller will not delete SUSI in IMM as it does not have any clue of their existence in IMM as they are not in AMF database. Regarding the delay of AMF reboot, it can still be maintained by first sending the reboot message and then clear job queue as much as possible. In that case patch will look like this: diff --git a/osaf/services/saf/amf/amfd/sgproc.cc b/osaf/services/saf/amf/amfd/sgproc.cc --- a/osaf/services/saf/amf/amfd/sgproc.cc +++ b/osaf/services/saf/amf/amfd/sgproc.cc @@ -529,6 +529,14 @@ void avd_su_oper_state_evh(AVD_CL_CB *cb " repair action", node->name.value); avd_d2n_reboot_snd(node); + + AvdJobDequeueResultT job_res = JOB_EXECUTED; + /* Finish as many IMM jobs as possible because active + controller is rebooting. + */ + while (job_res == JOB_EXECUTED) + job_res = Fifo::execute(cb->immOiHandle); + goto done; } else { avd_pg_node_csi_del_all(avd_cb, node); Ideally new active should handle all types of imm activities (deletion/creation/updation). Thanks, Praveen On 07-May-14 2:35 PM, Hans Feldt wrote: > avd_imm_update_runtime_attrs() cannot handle this case but it should be > enhanced to. > The problem is the runtime association objects (SI/CSI assignments). They > need to be in sync. > /Hans > >> -----Original Message----- >> From: Hans Feldt [mailto:hans.fe...@ericsson.com] >> Sent: den 7 maj 2014 10:59 >> To: Anders Björnerstedt; praveen.malv...@oracle.com; >> nagendr...@oracle.com >> Cc: opensaf-devel@lists.sourceforge.net >> Subject: Re: [devel] [PATCH 0 of 1] Review Request for amfd: update RT >> objects before node-failover of active controller [#494]. >> >> Agree. Amfd/imm.cc contains avd_imm_update_runtime_attrs() that >> should be called on the new active. It will sync the IMM attributes with >> AMFs view. >> /Hans >> >>> -----Original Message----- >>> From: Anders Björnerstedt >>> Sent: den 7 maj 2014 10:44 >>> To: praveen.malv...@oracle.com; Hans Feldt; nagendr...@oracle.com >>> Cc: opensaf-devel@lists.sourceforge.net >>> Subject: RE: [devel] [PATCH 0 of 1] Review Request for amfd: update RT >>> objects before node-failover of active controller [#494]. >>> >>> Hi Praveen >>> >>> I normally dont get involved in AMF patch reviews but this ticket and the >>> fix caught my attention. >>> There is a general issue that bothers me about the approach, if I have not >>> missunderstood it. >>> >>> I understand this is a node failover of active controller. >>> That is inherrently an event that is not fully under control. >>> It is also an event that really is time critical. >>> A failover may occurr in several ways. >>> >>> Here it seems that one kind of failover is "semi-controlable" and >>> old active is in essence trying to "clean up" its backlog in a job queue >>> before it triggers the failover. >>> >>> There will be other failover cases, such as a crash of the IMMD >>> where it will not be able to do this. So any cleanup (if necessary) must >>> anyway be covered by new active. >>> >>> In addition, updates to cached runtime data is a secondary duty of the AMF. >>> Cached runtime data is CACHED and not absolutely obligated to >>> reflect the original State (which is in the AMF) in realtime. So >>> updates of cached runtiome data should not Really be a reason for delaying >>> a failover. >>> >>> /AndersBj >>> >>> >>> -----Original Message----- >>> From: praveen.malv...@oracle.com [mailto:praveen.malv...@oracle.com] >>> Sent: den 7 maj 2014 10:26 >>> To: Hans Feldt; nagendr...@oracle.com >>> Cc: opensaf-devel@lists.sourceforge.net >>> Subject: [devel] [PATCH 0 of 1] Review Request for amfd: update RT objects >>> before node-failover of active controller [#494]. >>> >>> Summary: amfd: update RT objects before node-failover of active controller >>> [#494]. >>> Review request for Trac Ticket(s): #494 (its duplicates #853 and #858) Peer >>> Reviewer(s): Hans F., Nagendra. >>> Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>> Affected >>> branch(es): All Development branch: <<IF ANY GIVE THE REPO URL>> >>> >>> -------------------------------- >>> Impacted area Impact y/n >>> -------------------------------- >>> Docs n >>> Build system n >>> RPM/packaging n >>> Configuration files n >>> Startup scripts n >>> SAF services n >>> OpenSAF services y >>> Core libraries n >>> Samples n >>> Tests n >>> Other n >>> >>> >>> Comments (indicate scope for each "y" above): >>> --------------------------------------------- >>> Please see the analysis og tickets and commit log below. >>> >>> changeset bcf6eda79102f83c6940d75dd13073a9130026d0 >>> Author: praveen.malv...@oracle.com >>> Date: Wed, 07 May 2014 13:43:33 +0530 >>> >>> amfd: update RT objects before node-failover of active controller >>> [#494]. >>> >>> Problem: Run time objects and attributes are not updated when >>> node-failover >>> gots escalated for active controller and standby controller took the >>> active >>> role. >>> >>> Reason: Activities related to update of runtime objects and certain >>> attribute to IMM are given low priotiy and are pushed in Job queue by >>> AMF. >>> These jobs are completed when AMF is not busy in any other high priority >>> activity. When node-failover is escalated, AMFD sends reboot message to >>> AMFND to reboot the node. In case node-failover is escalated for active >>> controller, it will send reboot message to AMFND which will reboot the >>> controller. In such a case, some IMM related activites in JOB queue will >>> remian uncompleted. All such activites should be compleleted before >>> rebooting the active controller when node-failover is escalated for it. >>> >>> Fix: Fix will finish all IMM related jobs before sending reboot message >>> to >>> AMFND when node-failover is escalated for active controller. >>> >>> >>> Complete diffstat: >>> ------------------ >>> osaf/services/saf/amf/amfd/sgproc.cc | 6 ++++++ >>> 1 files changed, 6 insertions(+), 0 deletions(-) >>> >>> >>> Testing Commands: >>> ----------------- >>> Tested the duplicate bug #858. >>> This is easy to reproduce. >>> After reproducing observed the states: >>> safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 >>> saAmfSUAdminState=UNLOCKED(1) >>> saAmfSUOperState=ENABLED(1) >>> saAmfSUPresenceState=UNINSTANTIATED(1) >>> saAmfSUReadinessState=IN-SERVICE(2) >>> >>> >>> Testing, Expected Results: >>> -------------------------- >>> Pass observed the satates: >>> safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 >>> saAmfSUAdminState=UNLOCKED(1) >>> saAmfSUOperState=DISABLED(2) >>> saAmfSUPresenceState=UNINSTANTIATED(1) >>> saAmfSUReadinessState=OUT-OF-SERVICE(1) >>> AMFD logs: >>> May 7 12:05:47.624746 osafamfd [26472:imm.cc:0143] >> exec: Update >>> 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' >>> saAmfSUReadinessState May 7 12:05:47.624799 osafamfd >>> [26472:imma_oi_api.c:2270] >> saImmOiRtObjectUpdate_2 May 7 >>> 12:05:47.626863 osafamfd [26472:mds_dt_trans.c:0671] >> >>> mdtm_process_poll_recv_data_tcp May 7 12:05:47.627392 osafamfd >>> [26472:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2 May 7 >>> 12:05:47.627419 osafamfd [26472:imm.cc:0172] << exec >>> >>> May 7 12:05:47.634134 osafamfd [26472:util.cc:1681] TR Sending >>> REBOOT MSG to 2010f May 7 12:05:47.634372 osafamfd >>> [26472:sgproc.cc:0715] << avd_su_oper_state_evh >>> >>> >>> >>> Conditions of Submission: >>> ------------------------- >>> Ack from one of the reviewers. >>> >>> Arch Built Started Linux distro >>> ------------------------------------------- >>> mips n n >>> mips64 n n >>> x86 n n >>> x86_64 y y >>> powerpc n n >>> powerpc64 n n >>> >>> >>> Reviewer Checklist: >>> ------------------- >>> [Submitters: make sure that your review doesn't trigger any >>> checkmarks!] >>> >>> >>> Your checkin has not passed review because (see checked entries): >>> >>> ___ Your RR template is generally incomplete; it has too many blank entries >>> that need proper data filled in. >>> >>> ___ You have failed to nominate the proper persons for review and push. >>> >>> ___ Your patches do not have proper short+long header >>> >>> ___ You have grammar/spelling in your header that is unacceptable. >>> >>> ___ You have exceeded a sensible line length in your headers/comments/text. >>> >>> ___ You have failed to put in a proper Trac Ticket # into your commits. >>> >>> ___ You have incorrectly put/left internal data in your comments/files >>> (i.e. internal bug tracking tool IDs, product names etc) >>> >>> ___ You have not given any evidence of testing beyond basic build tests. >>> Demonstrate some level of runtime or other sanity testing. >>> >>> ___ You have ^M present in some of your files. These have to be removed. >>> >>> ___ You have needlessly changed whitespace or added whitespace crimes >>> like trailing spaces, or spaces before tabs. >>> >>> ___ You have mixed real technical changes with whitespace and other >>> cosmetic code cleanup changes. These have to be separate commits. >>> >>> ___ You need to refactor your submission into logical chunks; there is >>> too much content into a single commit. >>> >>> ___ You have extraneous garbage in your review (merge commits etc) >>> >>> ___ You have giant attachments which should never have been sent; >>> Instead you should place your content in a public tree to be pulled. >>> >>> ___ You have too many commits attached to an e-mail; resend as threaded >>> commits, or place in a public tree for a pull. >>> >>> ___ You have resent this content multiple times without a clear indication >>> of what has changed between each re-send. >>> >>> ___ You have failed to adequately and individually address all of the >>> comments and change requests that were proposed in the initial review. >>> >>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) >>> >>> ___ Your computer have a badly configured date and time; confusing the >>> the threaded patch review. >>> >>> ___ Your changes affect IPC mechanism, and you don't present any results >>> for in-service upgradability test. >>> >>> ___ Your changes affect user manual and documentation, your patch series >>> do not contain the patch that updates the Doxygen manual. >>> >>> >>> -------------------------------------------------------------------- >>> ---------- Is your legacy SCM system holding you back? Join Perforce >>> May 7 to find out: >>> • 3 signs your SCM is hindering your productivity • >>> Requirements for releasing software faster • Expert tips and >>> advice for migrating your SCM now http://p.sf.net/sfu/perforce >>> _______________________________________________ >>> Opensaf-devel mailing list >>> Opensaf-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >> --------------------------------------------------------------------- >> --------- Is your legacy SCM system holding you back? Join Perforce >> May 7 to find out: >> • 3 signs your SCM is hindering your productivity • >> Requirements for releasing software faster • Expert tips and >> advice for migrating your SCM now http://p.sf.net/sfu/perforce >> _______________________________________________ >> Opensaf-devel mailing list >> Opensaf-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Is your legacy SCM system holding you back? Join Perforce May 7 to find out: • 3 signs your SCM is hindering your productivity • Requirements for releasing software faster • Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel