From: Anand Sundararaj <s.an...@gethighavailability.com>

Summary: amf: implement node repair admin command [#3204]
Review request for Ticket(s): 3204
Peer Reviewer(s): Minh, Thang, Nagendra, Paul
Pull request to: Amf Maintainers
Affected branch(es): develop
Development branch: ticket-3204
Base revision: 59ded7cdf6a431e522229afd5ecb989e4a61c7d8
Personal repository: git://git.code.sf.net/u/s-anand-has/review

--------------------------------
Impacted area       Impact y/n
--------------------------------
 Docs                    n
 Build system            n
 RPM/packaging           n
 Configuration files     n
 Startup scripts         n
 SAF services            y
 OpenSAF services        n
 Core libraries          n
 Samples                 n
 Tests                   n
 Other                   n

NOTE: Patch(es) contain lines longer than 80 characers

Comments (indicate scope for each "y" above):
---------------------------------------------
*** EXPLAIN/COMMENT THE PATCH SERIES HERE ***

revision 3f86a0aefe7dd17e78b2d156178f9dff670e59b8
Author: Anand Sundararaj <s.an...@gethighavailability.com>
Date:   Fri, 24 Jul 2020 04:28:51 +0530

amf: implement node repair admin command [#3204]



Complete diffstat:
------------------
 src/amf/amfd/node.cc    | 56 +++++++++++++++++++++++++++++++++++++++++++-
 src/amf/amfd/node.h     |  2 ++
 src/amf/amfd/sgproc.cc  | 18 ++++++++++++++
 src/amf/amfnd/avnd_su.h |  1 +
 src/amf/amfnd/di.cc     | 62 +++++++++++++++++++++++++++++++++++++++++++++++++
 src/amf/amfnd/err.cc    |  2 +-
 src/amf/amfnd/su.cc     |  2 +-
 7 files changed, 140 insertions(+), 3 deletions(-)


Testing Commands:
-----------------
Configure two demo appl(available in samples/amf/sa_aware)(App1 & App2) on SC-1 
and PL-3.
Configure PL-3 saAmfNodeAutoRepair as false
Configure App2 demo appl saAmfCtDefRecoveryOnError as 5 (node failover)
Unlock all 4 SUs: 2 running on SC-1(Std) and two running on PL-3 (Act)
1. Kill demo app App2 on PL-3. Node failover happens. SUs running on SC-1 
becomes Act.
osafamfd[10367]: NO NodeAutorepair disabled for 
'safAmfNode=PL-3,safAmfCluster=myAmfCluster', no reboot ordered

Node repair the node PL-3 using 
amf-adm repaired safAmfNode=PL-3,safAmfCluster=myAmfCluster

PL-3 node state is enabled and 2 SUs runing on PL-3 get Standby assignment.

2. Repeat the test case just to see if all is well.
3. Repeat #1 before repair. Then delete both SUs running on PL-3. Then repair 
PL-3
   Now add both the SUs and unlock them. They are given Std assignments.
4. Repeat test case #1 before repair
When repair command is issued, hold the repair command at amfnd at PL-3 using 
gdb and reboot the machine.
Repair command will return SA_AIS_ERR_REPAIR_PENDING

amf-adm repaired safAmfNode=PL-3,safAmfCluster=myAmfCluster
error - saImmOmAdminOperationInvoke_2 admin-op RETURNED: 
SA_AIS_ERR_REPAIR_PENDING (29)
error-string: node failure

When node starts again, it gets 2 SUs Standby assignments.

5. Repeat the test case before repair.
   Issue node lock/lock-in adn then issue repair, followed by unlock-in/unlock. 
2 SUs on PL-3 gets Standby assignment
6. Repeat #5 for SU/SG/node-group/SI
7. Make the following changes for App1 and SUFailover as false:
                <attr>
                        <name>saAmfSgtDefCompRestartProb</name>
                        <value>40000000000</value>
                </attr>
                <attr>
                        <name>saAmfSgtDefCompRestartMax</name>
                        <value>2</value>
                </attr>
                <attr>
                        <name>saAmfSgtDefSuRestartProb</name>
                        <value>40000000000</value>
                </attr>
                <attr>
                        <name>saAmfSgtDefSuRestartMax</name>
                        <value>1</value>
                </attr>

 Kill demo component of App1, till node failover gets escalated.
osafamfnd[11249]: NO SU failovers have reached configured limit of 2
osafamfnd[11249]: NO SU failover probation timer stopped
osafamfnd[11249]: NO 'safComp=AmfDemo,safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' 
recovery action escalated from 'componentRestart' to 'nodeFailover'
osafamfnd[11249]: NO 'safComp=AmfDemo,safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1' 
faulted due to 'avaDown' : Recovery is 'nodeFailover'
osafamfnd[11249]: NO Informing director of node fail-over
Then repair the node. 2 SUs gets Std assignments.

8. Repeat all the above test cases with App2 demo appl 
saAmfCtDefRecoveryOnError as 4 (node switchover)

Testing, Expected Results:
--------------------------
After node repair command, all eligible SUs will get Standy assignments.

Conditions of Submission:
-------------------------
Ack from amf maintainers, timeout in 3 days.

Arch      Built     Started    Linux distro
-------------------------------------------
mips        n          n
mips64      n          n
x86         n          n
x86_64      y          y
powerpc     n          n
powerpc64   n          n


Reviewer Checklist:
-------------------
[Submitters: make sure that your review doesn't trigger any checkmarks!]


Your checkin has not passed review because (see checked entries):

___ Your RR template is generally incomplete; it has too many blank entries
    that need proper data filled in.

___ You have failed to nominate the proper persons for review and push.

___ Your patches do not have proper short+long header

___ You have grammar/spelling in your header that is unacceptable.

___ You have exceeded a sensible line length in your headers/comments/text.

___ You have failed to put in a proper Trac Ticket # into your commits.

___ You have incorrectly put/left internal data in your comments/files
    (i.e. internal bug tracking tool IDs, product names etc)

___ You have not given any evidence of testing beyond basic build tests.
    Demonstrate some level of runtime or other sanity testing.

___ You have ^M present in some of your files. These have to be removed.

___ You have needlessly changed whitespace or added whitespace crimes
    like trailing spaces, or spaces before tabs.

___ You have mixed real technical changes with whitespace and other
    cosmetic code cleanup changes. These have to be separate commits.

___ You need to refactor your submission into logical chunks; there is
    too much content into a single commit.

___ You have extraneous garbage in your review (merge commits etc)

___ You have giant attachments which should never have been sent;
    Instead you should place your content in a public tree to be pulled.

___ You have too many commits attached to an e-mail; resend as threaded
    commits, or place in a public tree for a pull.

___ You have resent this content multiple times without a clear indication
    of what has changed between each re-send.

___ You have failed to adequately and individually address all of the
    comments and change requests that were proposed in the initial review.

___ You have a misconfigured ~/.gitconfig file (i.e. user.name, user.email etc)

___ Your computer have a badly configured date and time; confusing the
    the threaded patch review.

___ Your changes affect IPC mechanism, and you don't present any results
    for in-service upgradability test.

___ Your changes affect user manual and documentation, your patch series
    do not contain the patch that updates the Doxygen manual.



_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to