Summary: AMF: Recover transient SUSIs from headless (admin continuation, node 
restart) [#1725] V3
Review request for Trac Ticket(s): 1725
Peer Reviewer(s): AMF devs
Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>>
Affected branch(es): 5.0, default
Development branch: default

--------------------------------
Impacted area       Impact y/n
--------------------------------
 Docs                    n
 Build system            n
 RPM/packaging           n
 Configuration files     n
 Startup scripts         n
 SAF services            y
 OpenSAF services        n
 Core libraries          n
 Samples                 n
 Tests                   n
 Other                   n


Comments (indicate scope for each "y" above):
---------------------------------------------
 Additions in V3:
 - Add patch recover in case of node restart during headless
 - Add patch to validate cached RTA read from IMM

changeset d921dfed678b396087c46cb3af1249e4f3f5b7ab
Author: minh-chau <minh.c...@dektech.com.au>
Date:   Thu, 18 Aug 2016 09:56:19 +1000

        AMFD: Introduce new RTA states for admin operation continuation after
        headless [#1725 part 1] V3

        If there's an admin operation running and at that time cluster goes into
        headless stage, the normal admin operation sequence is interrupted. 
Since
        both SCs are down, the SI assignments at AMFND could be on going or
        completed during headless period. After headless this admin operation 
should
        be continued. This patch series supports the admin operation 
continuation
        after headless.

        To resume the admin operation after headless, the states need to be 
restored
        are: SUSI fsm states, SG fsm states, SI Dependency states (not 
suppported in
        this patch), SU Switch toggle, and SU operation list in SG at the time
        cluster goes headless.

        At this moment, the SG fsm states are set variously in each specific SG
        models. Also, the rule that a SU to be added in SG's operation list is 
not
        consistent. A SU is added to operation list after AMFD sends 
su_si_assign
        event on this SU in most of the places. However, there're are some 
scenarios
        that a SU is added to the list for other purposes (failover). These
        difficulties make the state deduction logic hard to implemenent.

        This patch introduces new RTA states: osafAmfSGSuOperationList,
        osafAmfSGFsmState, osafAmfSISUFsmState and osafAmfSUSwitch to capture 
the SU
        operation list of SG, SG fsm state, SUSI fsm state, and SU Switch of 
AMFD
        memory to IMM during AMFD lifetime. When cluster comes back from 
headless,
        these RTA will read from IMM to restore states in AMFD's memory. It also
        adds additional field in state_info (headless synchronization) message 
which
        indicates current SUSI fsm states. Both of SUSI fsm states help to 
validate
        the new RTA states read from IMM after headless. Example: if IMM SUSI 
fsm
        state is ASGN, synced SUSI fsm state is ASGND, then HA state must be 
ACTIVE
        or STANDBY. Such validation is indeed neccessary since headless 
interruption
        is unplanned and the recovery heavily depends on RTA read from IMM.

changeset 30e3871ace1c014efab53e7428c33cc6ce4aece6
Author: minh-chau <minh.c...@dektech.com.au>
Date:   Thu, 18 Aug 2016 09:56:23 +1000

        AMFND: Admin operation continuation if csi completes during headless 
[#1725
        part 1] V1

        There're two options basically that AMFD can continue admin operation 
wih
        completed csi(s)

        First: AMFD can use the sync SUSI fsm state as latest, AMFD then has to
        explore its SUSI assignments with adminStates of relevant entities to
        determine which SU should be on call of susi_success(). Deeper level of
        exploration for csi addition. It also depends on SG Fsm state which is 
being
        used variously in different SG types.

        Second: AMFD uses the SUSI fsm state read from IMM as latest, and AMFND
        needs to resend susi_resp messages which were deferred during headless 
so
        that AMFD can continue the admin operation sequence. Both cases of csi
        completion [during or after] headless can run in the same code flow.

        The patch buffers susi_resp_msg during headless stage and resend it to 
AMFD
        after headless. There could be a chance that AMFND sent out susi 
response
        message but AMFD could not receive or process it. This case could be 
seen as
        a defect, which can be fixed by securing the result of sending susi_resp
        message from AMFND toward AMFD.

changeset 683d8522ee2175539f4aa63f2200513fcc6b0022
Author: minh-chau <minh.c...@dektech.com.au>
Date:   Thu, 18 Aug 2016 09:56:31 +1000

        AMFD: Failover absent assignment due to node restart or powered off 
while
        headless [#1725 part 2]

        When a payload restarts or is powered off during headless, the SUSI
        assignments in this payload were removed, that shall break down the HA
        characteristic of SUSI assignments after headless.

        This patch treats the SUSI assignments removed during headless as ABSENT
        SUSI, and reuse node_fail() to perform a failover on SU having ABSENT 
SUSIs,
        in order that the HA of SUSI assignments shall become STABLE, which 
means no
        QUIESCED/QUIESCING SUSI, etc... Inside node_fail(), any su_si 
assignments
        event on ABSENT SUSI toward AMFND likes modification, deletion will 
ignored.

changeset cfdffd52354ba8d00a2fb53de94f28b5f58bbdf5
Author: minh-chau <minh.c...@dektech.com.au>
Date:   Thu, 18 Aug 2016 09:56:35 +1000

        AMFD: Validate headless cached RTA read from IMM [#1725]

        Since headless interuption is unplanned action and writing rta to IMM is
        currently queued up in AMFD implemenentation. That can result into
        inappropriate states of SG fsm state, SUSI fsm state, ha state,
        SUOperationList, etc. Eventually, AMFD will run into SG unstable, false
        assertion, or even SUSIs become permanently PARTIALLY, which is hard to
        debug (even harder without trace)

        This patch adds a validation routine to check headless cached RTAs read 
from
        IMM, more validation rule to be added. Also, a TODO is left for 
discussion
        about what's a action should be taken if validation is failed.


Complete diffstat:
------------------
 osaf/libs/common/amf/d2nedu.c                  |    5 +-
 osaf/libs/common/amf/include/amf_d2nmsg.h      |    4 +
 osaf/libs/common/amf/include/amf_si_assign.h   |    2 +-
 osaf/services/saf/amf/amfd/cluster.cc          |    9 +
 osaf/services/saf/amf/amfd/csi.cc              |   93 +++++++-----
 osaf/services/saf/amf/amfd/imm.cc              |    5 +-
 osaf/services/saf/amf/amfd/include/csi.h       |    3 +-
 osaf/services/saf/amf/amfd/include/imm.h       |    5 +-
 osaf/services/saf/amf/amfd/include/mds.h       |    7 +-
 osaf/services/saf/amf/amfd/include/proc.h      |    4 +-
 osaf/services/saf/amf/amfd/include/sg.h        |   11 +-
 osaf/services/saf/amf/amfd/include/su.h        |    6 +-
 osaf/services/saf/amf/amfd/include/susi.h      |   13 +-
 osaf/services/saf/amf/amfd/include/util.h      |    2 +
 osaf/services/saf/amf/amfd/mds.cc              |    7 +-
 osaf/services/saf/amf/amfd/ndfsm.cc            |   24 +++-
 osaf/services/saf/amf/amfd/role.cc             |    6 -
 osaf/services/saf/amf/amfd/sg.cc               |  180 
+++++++++++++++++++++++++-
 osaf/services/saf/amf/amfd/sg_2n_fsm.cc        |   21 +-
 osaf/services/saf/amf/amfd/sg_npm_fsm.cc       |    2 +-
 osaf/services/saf/amf/amfd/sg_nwayact_fsm.cc   |    2 +-
 osaf/services/saf/amf/amfd/sgproc.cc           |  168 ++++++++++++++----------
 osaf/services/saf/amf/amfd/siass.cc            |  317 
++++++++++++++++++++++++++++++++++++---------
 osaf/services/saf/amf/amfd/su.cc               |  128 ++++++++++++++---
 osaf/services/saf/amf/amfnd/di.cc              |  213 
+++++++++++++++++++++---------
 osaf/services/saf/amf/amfnd/include/avnd_di.h  |    1 +
 osaf/services/saf/amf/amfnd/include/avnd_mds.h |    4 +-
 osaf/services/saf/amf/amfnd/mds.cc             |    6 +-
 osaf/services/saf/amf/config/amf_classes.xml   |   28 ++++
 29 files changed, 962 insertions(+), 314 deletions(-)


Testing Commands:
-----------------
 Execute the test list attached in ticket #1725 in scope of test for
 admin continuation and node restart. This series still goes with immediate
 escalation while headless, which mean node will reboot if kinds of failover
 switchover.


Testing, Expected Results:
--------------------------
 There are some test cases in non-headless that have already failed without 
#1725.
 Tickets were raised for these failing cases, but haven't been fixed. So, if 
test
 fails with #1725, please rerun the same test in non-headless


Conditions of Submission:
-------------------------
 ack


Arch      Built     Started    Linux distro
-------------------------------------------
mips        n          n
mips64      n          n
x86         n          n
x86_64      y          y
powerpc     n          n
powerpc64   n          n


Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
    that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
    (i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
    Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
    like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
    cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
    too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
    Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
    commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
    of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
    comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)

___ Your computer have a badly configured date and time; confusing the
    the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
    for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
    do not contain the patch that updates the Doxygen manual.


------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to