Summary: AMF: Add support for cloud resilience [#1620] V4
Review request for Trac Ticket(s): 1620
Peer Reviewer(s): Hans N, Gary, Nagu, Praveen
Pull request to: AMF maintainers
Affected branch(es): default
Development branch: default

--------------------------------
Impacted area       Impact y/n
--------------------------------
 Docs                    n
 Build system            n
 RPM/packaging           n
 Configuration files     n
 Startup scripts         n
 SAF services            y
 OpenSAF services        n
 Core libraries          n
 Samples                 n
 Tests                   n
 Other                   n


Comments (indicate scope for each "y" above):
---------------------------------------------
This V4 has splitted the delayed_failover, delayed_sidep
and partial support comp/su failover during headless into
separated patches. And followings are patches fix for most of
issues found by Nagu except TC27.

1620_amfd_adjust_intermediate_adminstate_and_assignment.diff
1620_pg_try_again.diff
1620_amfnd_resend_pg.diff
1620_amfnd_fix_coredump.diff
1620_amfnd_dont_disabled_healthy_su.diff
1620_amfd_data_inconsistency.diff
1620_amfd_su_fault_if_inconsistent_si.diff

changeset 8c96f8cb52f1608911eeaa8977c64c740ce76af3
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:55 +1100

        amfd: Add support for cloud resilience at common libs [#1620] Outlined
        changes: . Introduce messages sisu_state_info and csicomp_state_info to
        carry sync information which are sent to amfd to recover from headless .
        Some encode/decode functions for these 2 new messages

changeset ab04edb95e84d6034869ad5973fe7d02648a217e
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:55 +1100

        amfd: Add saAmfUnassignedAlarmStatus attribute to memorize the 
alarm_sent
        status [#1620]

        If the SI Unassigned Alarm is raised before headless by locking SU for
        instance, then after cluster recovers from headless and unlocking the 
SU,
        this alarm is not cleared. As the application can reside in PL nodes and
        it's right to expect the previous raised alarm should be cleared once 
the SI
        gets back assignments. The patch adds new attribute
        saAmfUnassignedAlarmStatus attribute to SaAmfSI class to memorize the
        variable alarm_sent for headless.

changeset 8195c1b2890a16610b4d66549e9125c028041a56
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:55 +1100

        amfd: Add support for cloud resilience at director [#1620] Outlined 
changes:
        . node_up_msg event handling has changed so that amfd can collect the 
sync
        information sent from amfnd . Node Sync timer is introduced as a window 
of
        amfnd sync from headless

changeset 01c17e2baf841a886b9d3fc8dd6245c4a29cd5d4
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:55 +1100

        amfnd: Add support for cloud resilience at node director [#1620] Outline
        changes: . amfnd does not reboot if amfd is down . componentRestart and
        suRestart is supported, the node reboot if any escalation to 
component/su
        failover . SC absence timer is introduced, node will reboot if timeout .
        amfnd sends sync information if amfd is up after headless

changeset 6e71fd7f5bea7391eacc0d407a5327539f2880fe
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:55 +1100

        amfnd: Support component/su failover [#1620]

        If any error escalates to component/su failover during headless, amfnd
        reboot node.

        The issue is other healthy SUs get affected by this reboot, and this
        degrades the availability characteristic that AMF supports.

        The patch allow component/su failover during headless, but supports it
        partially (mark comp/su as failed) since failover to another comp/su
        requires amfd's presence.

changeset 4d235262e10306804d83c681a99fb3eab3f902f7
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:55 +1100

        amfd: Support delayed failover [#1620]

        After SC comes back from headless, the assignment of SU(s) can be 
dropped
        into inappropriate states: . Under admin command shutdown node, 
component
        can configure the csi callback timeout is large enough so that 
component can
        do its task longer when it receives csi callback for QUIESCING. At that 
time
        if both SC have gone, then component responses QuiescingComplete so 
when SC
        comes back, the assignment of SU will be in QUIESCING, and the other is
        STANDBY (for 2N). There are many other admin commands that can cause the
        assignment state of SU(s) in inappropriate states . Another scenario, 
if the
        node that hosting SU assigned ACTIVE assignment reboots (due to error)
        during headless, and when SC come backs the other SU is still assigned
        STANDBY and no SU has ACTIVE assignment

        The reason is that current implementation of various realign() does not
        support the function to balance up the assignment of SU.

        The patch introduces delayed_failover function to solve this problem. 
It's
        implemented separately the realigned() (although it could be integrated 
into
        realign()) so that maintainance of none headless does not become
        complicated. delayed_failover should also comply the AMF 
specificification
        B.04.01 (figure 3, page 83) to move uncompleted assignment to 
appropriate
        states. This patch version, delayed failover does not support NpM and
        NWayActive.

changeset 913216d3b6475c960217204774458c445740086a
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:55 +1100

        amfd: Support delayed si_dep [#1620]

        si_dep is configured in cluster and it could be broken due to sponsor 
SI get
        unassigned during headless. When SC comes back, the current 
implementation
        of non-headless starts tolerance timer for dependent SI before removes 
it.
        The issue is the timer dependent SI now has been tolerated larger than
        configuration since sponsor SI had been unassigned during headless.

        Since AMF can not start another tolerance timer once SC come back, so 
the
        patch has removed the dependent SI once AMF scans through the unassigned
        sponsor SI. This is limitation for now. Ideally, AMF can figure how long
        tolerating time is left for dependent SI since sponsor SI actually get
        unassigned during headless, so that the real tolerance timer can be 
started
        accurately.

changeset 88a493266b93767618e01ea6ce85e52325268ca5
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:55 +1100

        amfd: Adjust uncompleted admin command after headless [#1620]

        The adjustment for uncompleted admin command has been implemented for 
2N SG,
        but it's applicable for all other SG.

        The patch makes this adjustment common for all other SGs, plus adding
        support for uncompleted admin command on nodegroup

changeset 946ddaceb319cb5bd769beb92b7c3f7bf0299a18
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:56 +1100

        amfnd: Return TRY_AGAIN for saAmfProtectionGroupTrack and
        saAmfProtectionGroupTrackStop [#1620]

        Patch returns TRY_AGAIN for saAmfProtectionGroupTrack and
        saAmfProtectionGroupTrackStop during headless since the proctection 
group
        tracking requires amfd's presence

changeset d16eebad23c866444a4d293cf60f9475b138fb26
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:56 +1100

        amfnd: Resend pg information after headless [#1620]

        If SC comes back from headless, currently protection group information 
will
        be lost at amfd.

        Patch resends protection group information, which is similiar to 
failover

changeset ad13083e6440c1cfc148c822971d5aa0842a42e4
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:56 +1100

        amf: Fix various amfnd coredump and mapping SU [#1620]

changeset 7c8731792c967b09811ff57006c68ea1a31a3027
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:56 +1100

        amfd: Don't disable healthy SU [#1620]

        This scenario happen if unlock-in SU before going headless. After 
headless,
        amfnd sends SU oper state DISABLE in recovery data.

        The patch comments out the suspicious setting SU's oper state to 
DISABLED
        while knowing that SU is not FAILED.

changeset 23e7c25a75106a0a498f21048dc607a93f80601a
Author: Hans Nordeback <hans.nordeb...@ericsson.com>
Date:   Mon, 22 Feb 2016 13:40:54 +0100

        amfd: Reboot cluster at data inconsistency [#1620]

        Cluster is preferly configured with one payload without PBE. After two 
times
        of headless, IMM will reload from xml.

        That cause amfd lost all objects which were created before headless and 
the
        data inconsistency happens between amfnd and amfnd/IMM

        The patch broadcast reboot message to all nodes

changeset 6263d98a891b5c9dc863e5b99ecc9444fb235f5e
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:18:56 +1100

        amfd: Treat su fault if inconsistency of csi between amfd and amfnd 
[#1620]

        The problem happens if csi is deleted and component delays the
        csi_remove_callback after SC comes back from headless. At standby SU, 
this
        csi hasn't been removed

        It's because the standby SU still sends assignment info as recovery data
        since the component in active SU has pending the csi_remove_callback.

        Logically, amfnd should verify all csi being sent to amfd as recovery 
data.
        If csi is deleted, amfnd will issue remove callback and don't send 
deleted
        csi. However, verifying csi needs to initialize IMM handle, that could 
lead
        to hang amfnd (if IMMND dies) and eventually cause node synce timeout. 
The
        patch views this scenario as an inconsistency of csi between amfd and 
amfnd,
        thus the standby SU is removed assigment (including deleted csi) and re-
        assigned standby assignment (excluding deleted csi).

changeset f7650482e58bc460350d6aa53af224ba8542f301
Author: Minh Hon Chau <minh.c...@dektech.com.au>
Date:   Thu, 25 Feb 2016 19:33:36 +1100

        imported patch 1620_README_V4.diff


Complete diffstat:
------------------
 osaf/libs/common/amf/d2nedu.c                    |  311 
++++++++++++++++++++++++++--
 osaf/libs/common/amf/d2nmsg.c                    |  266 
+++++++++++++++++++++++++
 osaf/libs/common/amf/include/Makefile.am         |    1 +
 osaf/libs/common/amf/include/amf_d2nedu.h        |   16 +
 osaf/libs/common/amf/include/amf_d2nmsg.h        |   61 +++++
 osaf/libs/common/amf/include/amf_defs.h          |    3 +
 osaf/libs/common/amf/include/amf_si_assign.h     |   49 ++++
 osaf/services/saf/amf/README_HEADLESS            |  172 ++++++++++++++++
 osaf/services/saf/amf/amfd/cluster.cc            |   75 ++++++-
 osaf/services/saf/amf/amfd/comp.cc               |    8 +-
 osaf/services/saf/amf/amfd/csi.cc                |  117 +++++++++++
 osaf/services/saf/amf/amfd/imm.cc                |   58 +++++
 osaf/services/saf/amf/amfd/include/cb.h          |    5 +
 osaf/services/saf/amf/amfd/include/cluster.h     |    1 +
 osaf/services/saf/amf/amfd/include/csi.h         |    2 +
 osaf/services/saf/amf/amfd/include/db_template.h |    1 +
 osaf/services/saf/amf/amfd/include/evt.h         |    3 +
 osaf/services/saf/amf/amfd/include/mds.h         |    7 +-
 osaf/services/saf/amf/amfd/include/msg.h         |    2 +-
 osaf/services/saf/amf/amfd/include/node.h        |    9 +-
 osaf/services/saf/amf/amfd/include/proc.h        |    7 +
 osaf/services/saf/amf/amfd/include/sg.h          |   18 +-
 osaf/services/saf/amf/amfd/include/si.h          |    1 +
 osaf/services/saf/amf/amfd/include/su.h          |    2 +-
 osaf/services/saf/amf/amfd/include/susi.h        |    3 +
 osaf/services/saf/amf/amfd/include/timer.h       |    1 +
 osaf/services/saf/amf/amfd/include/util.h        |    1 +
 osaf/services/saf/amf/amfd/main.cc               |   24 ++
 osaf/services/saf/amf/amfd/mds.cc                |    4 +-
 osaf/services/saf/amf/amfd/ndfsm.cc              |  325 
+++++++++++++++++++++++++++++-
 osaf/services/saf/amf/amfd/ndmsg.cc              |   18 +-
 osaf/services/saf/amf/amfd/ndproc.cc             |  103 +++++++++-
 osaf/services/saf/amf/amfd/node.cc               |   52 ++++-
 osaf/services/saf/amf/amfd/nodegroup.cc          |   16 +
 osaf/services/saf/amf/amfd/role.cc               |   10 +-
 osaf/services/saf/amf/amfd/sg.cc                 |  155 ++++++++++++++
 osaf/services/saf/amf/amfd/sg_2n_fsm.cc          |   74 ++++++
 osaf/services/saf/amf/amfd/sg_nored_fsm.cc       |    6 +
 osaf/services/saf/amf/amfd/sg_npm_fsm.cc         |   24 ++
 osaf/services/saf/amf/amfd/sg_nway_fsm.cc        |   24 ++
 osaf/services/saf/amf/amfd/sg_nwayact_fsm.cc     |    6 +
 osaf/services/saf/amf/amfd/sgproc.cc             |   49 +++-
 osaf/services/saf/amf/amfd/si.cc                 |   43 +++-
 osaf/services/saf/amf/amfd/siass.cc              |  130 ++++++++++++
 osaf/services/saf/amf/amfd/su.cc                 |   20 +-
 osaf/services/saf/amf/amfd/util.cc               |   22 ++
 osaf/services/saf/amf/amfnd/amfnd.cc             |    3 +-
 osaf/services/saf/amf/amfnd/clc.cc               |  100 ++++++---
 osaf/services/saf/amf/amfnd/clm.cc               |   11 +-
 osaf/services/saf/amf/amfnd/comp.cc              |   42 +++-
 osaf/services/saf/amf/amfnd/compdb.cc            |   45 +++-
 osaf/services/saf/amf/amfnd/di.cc                |  455 
++++++++++++++++++++++++++++++++++++++++++-
 osaf/services/saf/amf/amfnd/err.cc               |  105 ++++++++-
 osaf/services/saf/amf/amfnd/evt.cc               |    2 +
 osaf/services/saf/amf/amfnd/hcdb.cc              |    8 +-
 osaf/services/saf/amf/amfnd/include/avnd_cb.h    |   13 +-
 osaf/services/saf/amf/amfnd/include/avnd_comp.h  |   17 +-
 osaf/services/saf/amf/amfnd/include/avnd_di.h    |    5 +
 osaf/services/saf/amf/amfnd/include/avnd_evt.h   |    2 +
 osaf/services/saf/amf/amfnd/include/avnd_mds.h   |    4 +-
 osaf/services/saf/amf/amfnd/include/avnd_proc.h  |    1 +
 osaf/services/saf/amf/amfnd/include/avnd_su.h    |    4 +-
 osaf/services/saf/amf/amfnd/include/avnd_tmr.h   |    1 +
 osaf/services/saf/amf/amfnd/include/avnd_util.h  |    4 +
 osaf/services/saf/amf/amfnd/main.cc              |  103 +++++++++-
 osaf/services/saf/amf/amfnd/mds.cc               |   22 +-
 osaf/services/saf/amf/amfnd/pg.cc                |   18 +
 osaf/services/saf/amf/amfnd/sidb.cc              |    9 +-
 osaf/services/saf/amf/amfnd/su.cc                |   47 ++-
 osaf/services/saf/amf/amfnd/susm.cc              |  123 ++++++----
 osaf/services/saf/amf/amfnd/tmr.cc               |    1 +
 osaf/services/saf/amf/amfnd/util.cc              |  153 ++++++++++++++-
 osaf/services/saf/amf/amfnd/verify.cc            |   38 +---
 osaf/services/saf/amf/config/amf_classes.xml     |    8 +
 74 files changed, 3341 insertions(+), 308 deletions(-)


Testing Commands:
-----------------
 Repeat Nagu's failed tests

Testing, Expected Results:
--------------------------
 Tests pass


Conditions of Submission:
-------------------------
 ack from reviewers


Arch      Built     Started    Linux distro
-------------------------------------------
mips        n          n
mips64      n          n
x86         n          n
x86_64      y          y
powerpc     n          n
powerpc64   n          n


Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
    that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
    (i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
    Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
    like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
    cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
    too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
    Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
    commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
    of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
    comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)

___ Your computer have a badly configured date and time; confusing the
    the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
    for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
    do not contain the patch that updates the Doxygen manual.


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to