Summary: smfd: handle failed middleware si-swap Review request for Trac Ticket(s): 1605 Peer Reviewer(s): mathi, neel, lennart, rafael Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>> Affected branch(es): default, 5.1, 5.0 Development branch: <<IF ANY GIVE THE REPO URL>>
-------------------------------- Impacted area Impact y/n -------------------------------- Docs n Build system n RPM/packaging n Configuration files n Startup scripts n SAF services y OpenSAF services n Core libraries n Samples n Tests n Other n Comments (indicate scope for each "y" above): --------------------------------------------- changeset adebe71f5cb378d300ac63f1989a3fa03056d889 Author: Alex Jones <ajo...@genband.com> Date: Wed, 05 Oct 2016 16:10:56 -0400 smfd: handle failed middlware si-swap [#1605] Sep 27 00:34:14 q50-s1 osafsmfd[6667]: NO SA_AMF_ADMIN_SI_SWAP [rc=1] successfully initiated Sep 27 00:34:15 q50-s1 osafimmnd[6571]: NO ERR_BAD_OPERATION: Mismatch on administrative owner '' != 'SMFSERVICE' Sep 27 00:34:17 q50-s1 osafsmfd[6667]: NO Fail to invoke admin operation, rc=SA_AIS_ERR_BAD_OPERATION (20). dn=[safSi=SC-2N,safApp=OpenSAF], opId=[7] Sep 27 00:34:17 q50-s1 osafsmfd[6667]: NO Admin op SA_AMF_ADMIN_SI_SWAP fail [rc = 20] Sep 27 00:34:17 q50-s1 osafsmfd[6667]: NO CAMP: Procedure safSmfProc=RollingUpgrade returned FAILED Sep 27 00:36:14 q50-s1 osafsmfd[6667]: NO Campaign thread does not disappear within 120 seconds after SA_AMF_ADMIN_SI_SWAP, the operation was assumed failed. Sep 27 00:36:14 q50-s1 kernel: [14934029.531187] osafsmfd[32024]: segfault at 4 ip 00000000004425b6 sp 00007f67f7ffe1c0 error 4 in osafsmfd[400000+9a000] Sep 27 00:36:14 q50-s1 osafamfnd[6649]: NO 'safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' Sep 27 00:36:14 q50-s1 osafamfnd[6649]: ER safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast There are a few problems here. One is that the SmfSwapThread is pointing to a deleted procedure when the original active controller is reassigned active. The second problem is that a new SmfSwapThread is created when the original active controller is reassigned active, so now there are two running. The first thread tries to use its proc pointer (which has been deleted when the original active goes to quiesced) and causes the segfault. The proposed solution is a little different from that proposed in the ticket description. This solution proposes to use the existence of the SmfSwapThread as a test. When the original active controller is reassigned active because the si-swap failed, it will still remove the RestartIndicator as it does now. But, if the SmfSwapThread is still running, it will not create a new one, but update it with the recreated procedure pointer, and let it handle the si-swap timeout. Then it will report the error. I believe this solution is backwards compatible because no IMM changes are made like the ones proposed in the ticket. Complete diffstat: ------------------ osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.cc | 69 +++++++++++++++++--- osaf/services/saf/smfsv/smfd/SmfUpgradeProcedure.hh | 6 + 2 files changed, 64 insertions(+), 11 deletions(-) Testing Commands: ----------------- (0) you must have manual code to fail the active assignment on the current standby controller. I use PLM, but others should work, too. (1) do upgrade using smf that requires si-swap of middleware Testing, Expected Results: -------------------------- (1) current active controller should initiate si-swap (2) current active controller should go to QUIESCED state (3) current standby should get ACTIVE assignment (4) current standby should fail ACTIVE assignment and reboot (5) original active controller should get active assignment again (6) SmfRestartIndicator should be removed (7) smfd should detect that SmfSwapThread is still running, and not create a new one, but update it with recreated procedure (8) SmfSwapThread should see that Campaign thread is never destroyed, and should set procedure state to StepUndone, and set cmpgError to "si-swap of middleware failed" Conditions of Submission: ------------------------- <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>> Arch Built Started Linux distro ------------------------------------------- mips n n mips64 n n x86 n n x86_64 y y powerpc n n powerpc64 n n Reviewer Checklist: ------------------- [Submitters: make sure that your review doesn't trigger any checkmarks!] Your checkin has not passed review because (see checked entries): ___ Your RR template is generally incomplete; it has too many blank entries that need proper data filled in. ___ You have failed to nominate the proper persons for review and push. ___ Your patches do not have proper short+long header ___ You have grammar/spelling in your header that is unacceptable. ___ You have exceeded a sensible line length in your headers/comments/text. ___ You have failed to put in a proper Trac Ticket # into your commits. ___ You have incorrectly put/left internal data in your comments/files (i.e. internal bug tracking tool IDs, product names etc) ___ You have not given any evidence of testing beyond basic build tests. Demonstrate some level of runtime or other sanity testing. ___ You have ^M present in some of your files. These have to be removed. ___ You have needlessly changed whitespace or added whitespace crimes like trailing spaces, or spaces before tabs. ___ You have mixed real technical changes with whitespace and other cosmetic code cleanup changes. These have to be separate commits. ___ You need to refactor your submission into logical chunks; there is too much content into a single commit. ___ You have extraneous garbage in your review (merge commits etc) ___ You have giant attachments which should never have been sent; Instead you should place your content in a public tree to be pulled. ___ You have too many commits attached to an e-mail; resend as threaded commits, or place in a public tree for a pull. ___ You have resent this content multiple times without a clear indication of what has changed between each re-send. ___ You have failed to adequately and individually address all of the comments and change requests that were proposed in the initial review. ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) ___ Your computer have a badly configured date and time; confusing the the threaded patch review. ___ Your changes affect IPC mechanism, and you don't present any results for in-service upgradability test. ___ Your changes affect user manual and documentation, your patch series do not contain the patch that updates the Doxygen manual. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel