Summary: AMF: Add support for cloud resilience [#1620] V4 Review request for Trac Ticket(s): 1620 Peer Reviewer(s): Hans N, Gary, Nagu, Praveen Pull request to: AMF maintainers Affected branch(es): default Development branch: default
-------------------------------- Impacted area Impact y/n -------------------------------- Docs n Build system n RPM/packaging n Configuration files n Startup scripts n SAF services y OpenSAF services n Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): --------------------------------------------- This V4 has splitted the delayed_failover, delayed_sidep and partial support comp/su failover during headless into separated patches. And followings are patches fix for most of issues found by Nagu except TC27. 1620_amfd_adjust_intermediate_adminstate_and_assignment.diff 1620_pg_try_again.diff 1620_amfnd_resend_pg.diff 1620_amfnd_fix_coredump.diff 1620_amfnd_dont_disabled_healthy_su.diff 1620_amfd_data_inconsistency.diff 1620_amfd_su_fault_if_inconsistent_si.diff changeset 8c96f8cb52f1608911eeaa8977c64c740ce76af3 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:55 +1100 amfd: Add support for cloud resilience at common libs [#1620] Outlined changes: . Introduce messages sisu_state_info and csicomp_state_info to carry sync information which are sent to amfd to recover from headless . Some encode/decode functions for these 2 new messages changeset ab04edb95e84d6034869ad5973fe7d02648a217e Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:55 +1100 amfd: Add saAmfUnassignedAlarmStatus attribute to memorize the alarm_sent status [#1620] If the SI Unassigned Alarm is raised before headless by locking SU for instance, then after cluster recovers from headless and unlocking the SU, this alarm is not cleared. As the application can reside in PL nodes and it's right to expect the previous raised alarm should be cleared once the SI gets back assignments. The patch adds new attribute saAmfUnassignedAlarmStatus attribute to SaAmfSI class to memorize the variable alarm_sent for headless. changeset 8195c1b2890a16610b4d66549e9125c028041a56 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:55 +1100 amfd: Add support for cloud resilience at director [#1620] Outlined changes: . node_up_msg event handling has changed so that amfd can collect the sync information sent from amfnd . Node Sync timer is introduced as a window of amfnd sync from headless changeset 01c17e2baf841a886b9d3fc8dd6245c4a29cd5d4 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:55 +1100 amfnd: Add support for cloud resilience at node director [#1620] Outline changes: . amfnd does not reboot if amfd is down . componentRestart and suRestart is supported, the node reboot if any escalation to component/su failover . SC absence timer is introduced, node will reboot if timeout . amfnd sends sync information if amfd is up after headless changeset 6e71fd7f5bea7391eacc0d407a5327539f2880fe Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:55 +1100 amfnd: Support component/su failover [#1620] If any error escalates to component/su failover during headless, amfnd reboot node. The issue is other healthy SUs get affected by this reboot, and this degrades the availability characteristic that AMF supports. The patch allow component/su failover during headless, but supports it partially (mark comp/su as failed) since failover to another comp/su requires amfd's presence. changeset 4d235262e10306804d83c681a99fb3eab3f902f7 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:55 +1100 amfd: Support delayed failover [#1620] After SC comes back from headless, the assignment of SU(s) can be dropped into inappropriate states: . Under admin command shutdown node, component can configure the csi callback timeout is large enough so that component can do its task longer when it receives csi callback for QUIESCING. At that time if both SC have gone, then component responses QuiescingComplete so when SC comes back, the assignment of SU will be in QUIESCING, and the other is STANDBY (for 2N). There are many other admin commands that can cause the assignment state of SU(s) in inappropriate states . Another scenario, if the node that hosting SU assigned ACTIVE assignment reboots (due to error) during headless, and when SC come backs the other SU is still assigned STANDBY and no SU has ACTIVE assignment The reason is that current implementation of various realign() does not support the function to balance up the assignment of SU. The patch introduces delayed_failover function to solve this problem. It's implemented separately the realigned() (although it could be integrated into realign()) so that maintainance of none headless does not become complicated. delayed_failover should also comply the AMF specificification B.04.01 (figure 3, page 83) to move uncompleted assignment to appropriate states. This patch version, delayed failover does not support NpM and NWayActive. changeset 913216d3b6475c960217204774458c445740086a Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:55 +1100 amfd: Support delayed si_dep [#1620] si_dep is configured in cluster and it could be broken due to sponsor SI get unassigned during headless. When SC comes back, the current implementation of non-headless starts tolerance timer for dependent SI before removes it. The issue is the timer dependent SI now has been tolerated larger than configuration since sponsor SI had been unassigned during headless. Since AMF can not start another tolerance timer once SC come back, so the patch has removed the dependent SI once AMF scans through the unassigned sponsor SI. This is limitation for now. Ideally, AMF can figure how long tolerating time is left for dependent SI since sponsor SI actually get unassigned during headless, so that the real tolerance timer can be started accurately. changeset 88a493266b93767618e01ea6ce85e52325268ca5 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:55 +1100 amfd: Adjust uncompleted admin command after headless [#1620] The adjustment for uncompleted admin command has been implemented for 2N SG, but it's applicable for all other SG. The patch makes this adjustment common for all other SGs, plus adding support for uncompleted admin command on nodegroup changeset 946ddaceb319cb5bd769beb92b7c3f7bf0299a18 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:56 +1100 amfnd: Return TRY_AGAIN for saAmfProtectionGroupTrack and saAmfProtectionGroupTrackStop [#1620] Patch returns TRY_AGAIN for saAmfProtectionGroupTrack and saAmfProtectionGroupTrackStop during headless since the proctection group tracking requires amfd's presence changeset d16eebad23c866444a4d293cf60f9475b138fb26 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:56 +1100 amfnd: Resend pg information after headless [#1620] If SC comes back from headless, currently protection group information will be lost at amfd. Patch resends protection group information, which is similiar to failover changeset ad13083e6440c1cfc148c822971d5aa0842a42e4 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:56 +1100 amf: Fix various amfnd coredump and mapping SU [#1620] changeset 7c8731792c967b09811ff57006c68ea1a31a3027 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:56 +1100 amfd: Don't disable healthy SU [#1620] This scenario happen if unlock-in SU before going headless. After headless, amfnd sends SU oper state DISABLE in recovery data. The patch comments out the suspicious setting SU's oper state to DISABLED while knowing that SU is not FAILED. changeset 23e7c25a75106a0a498f21048dc607a93f80601a Author: Hans Nordeback <hans.nordeb...@ericsson.com> Date: Mon, 22 Feb 2016 13:40:54 +0100 amfd: Reboot cluster at data inconsistency [#1620] Cluster is preferly configured with one payload without PBE. After two times of headless, IMM will reload from xml. That cause amfd lost all objects which were created before headless and the data inconsistency happens between amfnd and amfnd/IMM The patch broadcast reboot message to all nodes changeset 6263d98a891b5c9dc863e5b99ecc9444fb235f5e Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:18:56 +1100 amfd: Treat su fault if inconsistency of csi between amfd and amfnd [#1620] The problem happens if csi is deleted and component delays the csi_remove_callback after SC comes back from headless. At standby SU, this csi hasn't been removed It's because the standby SU still sends assignment info as recovery data since the component in active SU has pending the csi_remove_callback. Logically, amfnd should verify all csi being sent to amfd as recovery data. If csi is deleted, amfnd will issue remove callback and don't send deleted csi. However, verifying csi needs to initialize IMM handle, that could lead to hang amfnd (if IMMND dies) and eventually cause node synce timeout. The patch views this scenario as an inconsistency of csi between amfd and amfnd, thus the standby SU is removed assigment (including deleted csi) and re- assigned standby assignment (excluding deleted csi). changeset f7650482e58bc460350d6aa53af224ba8542f301 Author: Minh Hon Chau <minh.c...@dektech.com.au> Date: Thu, 25 Feb 2016 19:33:36 +1100 imported patch 1620_README_V4.diff Complete diffstat: ------------------ osaf/libs/common/amf/d2nedu.c | 311 ++++++++++++++++++++++++++-- osaf/libs/common/amf/d2nmsg.c | 266 +++++++++++++++++++++++++ osaf/libs/common/amf/include/Makefile.am | 1 + osaf/libs/common/amf/include/amf_d2nedu.h | 16 + osaf/libs/common/amf/include/amf_d2nmsg.h | 61 +++++ osaf/libs/common/amf/include/amf_defs.h | 3 + osaf/libs/common/amf/include/amf_si_assign.h | 49 ++++ osaf/services/saf/amf/README_HEADLESS | 172 ++++++++++++++++ osaf/services/saf/amf/amfd/cluster.cc | 75 ++++++- osaf/services/saf/amf/amfd/comp.cc | 8 +- osaf/services/saf/amf/amfd/csi.cc | 117 +++++++++++ osaf/services/saf/amf/amfd/imm.cc | 58 +++++ osaf/services/saf/amf/amfd/include/cb.h | 5 + osaf/services/saf/amf/amfd/include/cluster.h | 1 + osaf/services/saf/amf/amfd/include/csi.h | 2 + osaf/services/saf/amf/amfd/include/db_template.h | 1 + osaf/services/saf/amf/amfd/include/evt.h | 3 + osaf/services/saf/amf/amfd/include/mds.h | 7 +- osaf/services/saf/amf/amfd/include/msg.h | 2 +- osaf/services/saf/amf/amfd/include/node.h | 9 +- osaf/services/saf/amf/amfd/include/proc.h | 7 + osaf/services/saf/amf/amfd/include/sg.h | 18 +- osaf/services/saf/amf/amfd/include/si.h | 1 + osaf/services/saf/amf/amfd/include/su.h | 2 +- osaf/services/saf/amf/amfd/include/susi.h | 3 + osaf/services/saf/amf/amfd/include/timer.h | 1 + osaf/services/saf/amf/amfd/include/util.h | 1 + osaf/services/saf/amf/amfd/main.cc | 24 ++ osaf/services/saf/amf/amfd/mds.cc | 4 +- osaf/services/saf/amf/amfd/ndfsm.cc | 325 +++++++++++++++++++++++++++++- osaf/services/saf/amf/amfd/ndmsg.cc | 18 +- osaf/services/saf/amf/amfd/ndproc.cc | 103 +++++++++- osaf/services/saf/amf/amfd/node.cc | 52 ++++- osaf/services/saf/amf/amfd/nodegroup.cc | 16 + osaf/services/saf/amf/amfd/role.cc | 10 +- osaf/services/saf/amf/amfd/sg.cc | 155 ++++++++++++++ osaf/services/saf/amf/amfd/sg_2n_fsm.cc | 74 ++++++ osaf/services/saf/amf/amfd/sg_nored_fsm.cc | 6 + osaf/services/saf/amf/amfd/sg_npm_fsm.cc | 24 ++ osaf/services/saf/amf/amfd/sg_nway_fsm.cc | 24 ++ osaf/services/saf/amf/amfd/sg_nwayact_fsm.cc | 6 + osaf/services/saf/amf/amfd/sgproc.cc | 49 +++- osaf/services/saf/amf/amfd/si.cc | 43 +++- osaf/services/saf/amf/amfd/siass.cc | 130 ++++++++++++ osaf/services/saf/amf/amfd/su.cc | 20 +- osaf/services/saf/amf/amfd/util.cc | 22 ++ osaf/services/saf/amf/amfnd/amfnd.cc | 3 +- osaf/services/saf/amf/amfnd/clc.cc | 100 ++++++--- osaf/services/saf/amf/amfnd/clm.cc | 11 +- osaf/services/saf/amf/amfnd/comp.cc | 42 +++- osaf/services/saf/amf/amfnd/compdb.cc | 45 +++- osaf/services/saf/amf/amfnd/di.cc | 455 ++++++++++++++++++++++++++++++++++++++++++- osaf/services/saf/amf/amfnd/err.cc | 105 ++++++++- osaf/services/saf/amf/amfnd/evt.cc | 2 + osaf/services/saf/amf/amfnd/hcdb.cc | 8 +- osaf/services/saf/amf/amfnd/include/avnd_cb.h | 13 +- osaf/services/saf/amf/amfnd/include/avnd_comp.h | 17 +- osaf/services/saf/amf/amfnd/include/avnd_di.h | 5 + osaf/services/saf/amf/amfnd/include/avnd_evt.h | 2 + osaf/services/saf/amf/amfnd/include/avnd_mds.h | 4 +- osaf/services/saf/amf/amfnd/include/avnd_proc.h | 1 + osaf/services/saf/amf/amfnd/include/avnd_su.h | 4 +- osaf/services/saf/amf/amfnd/include/avnd_tmr.h | 1 + osaf/services/saf/amf/amfnd/include/avnd_util.h | 4 + osaf/services/saf/amf/amfnd/main.cc | 103 +++++++++- osaf/services/saf/amf/amfnd/mds.cc | 22 +- osaf/services/saf/amf/amfnd/pg.cc | 18 + osaf/services/saf/amf/amfnd/sidb.cc | 9 +- osaf/services/saf/amf/amfnd/su.cc | 47 ++- osaf/services/saf/amf/amfnd/susm.cc | 123 ++++++---- osaf/services/saf/amf/amfnd/tmr.cc | 1 + osaf/services/saf/amf/amfnd/util.cc | 153 ++++++++++++++- osaf/services/saf/amf/amfnd/verify.cc | 38 +--- osaf/services/saf/amf/config/amf_classes.xml | 8 + 74 files changed, 3341 insertions(+), 308 deletions(-) Testing Commands: ----------------- Repeat Nagu's failed tests Testing, Expected Results: -------------------------- Tests pass Conditions of Submission: ------------------------- ack from reviewers Arch Built Started Linux distro ------------------------------------------- mips n n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: ------------------- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel