Analysis: Active controller SC-1 successfully performed fail-over of SU1 of SG-C by making SU2 active. But AMFD could not update this new HA state of SU2 in IMM as IMMND was already down by that time. AMFD at SC-1 got TRY_AGAIN from IMM.
Since this is OpenSAF 4.6 release, it does include fix of #1141. Ticket #1141 was pushed in 4.6 branch after 4.6 GA. In the fix of #1141, standby AMFD updates HA state after becoming active. Also please include the fix of #2009 when available (which will be on top of #1141.) Messages from AMFD trace of SC-1 which confirms this: ------------------------------------------------- 1)Active AMFD at SC-1 starts failover of Node PL-5 at: Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster' 2)It sends active assignment to SU2 : Oct 18 17:23:19.986349 osafamfd [1826:sgproc.cc:2114] >> avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1 3)It seems local IMMND (also LGS and NTFS) went down before AMFD gets successful response: Oct 18 17:24:24.075904 osafamfd [1826:imma_mds.c:0404] T3 IMMND DOWN 4)AMFD at SC-1 gets response for this successful assignment also: Oct 18 17:25:19.948061 osafamfd [1826:sgproc.cc:0889] >> avd_su_si_assign_evh: id:33, node:2060f, act:5, 'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0 AMFD adds IMM update in job queue: Oct 18 17:25:19.948758 osafamfd [1826:imm.cc:1506] >> avd_saImmOiRtObjectUpdate: 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG- C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState Oct 18 17:25:19.948781 osafamfd [1826:imm.cc:1525] << avd_saImmOiRtObjectUpdate Oct 18 17:25:19.948787 osafamfd [1826:imm.cc:1506] >> avd_saImmOiRtObjectUpdate: 'safCSIComp=safComp=SG-C \,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C' saAmfCSICompHAState Oct 18 17:25:19.948808 osafamfd [1826:imm.cc:1525] << avd_saImmOiRtObjectUpdate 5)But AMFD at SC-1 could not update HA state in IMM: Oct 18 17:25:19.951838 osafamfd [1826:imm.cc:0151] >> exec: Update 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState Oct 18 17:25:19.951861 osafamfd [1826:imma_oi_api.c:2417] >> rt_object_update_common Oct 18 17:25:19.951866 osafamfd [1826:imma_oi_api.c:2435] T2 ERR_TRY_AGAIN: IMMND is DOWN Oct 18 17:25:19.951874 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN Oct 18 17:25:19.951877 osafamfd [1826:imm.cc:0180] << exec Oct 18 17:25:19.951881 osafamfd [1826:imm.cc:0316] << execute: 2 Oct 18 17:25:20.452499 osafamfd [1826:imm.cc:0312] >> execute Oct 18 17:25:20.452519 osafamfd [1826:imm.cc:0151] >> exec: Update 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState Oct 18 17:25:20.452526 osafamfd [1826:imma_oi_api.c:2417] >> rt_object_update_common Oct 18 17:25:20.452531 osafamfd [1826:imma_oi_api.c:2435] T2 ERR_TRY_AGAIN: IMMND is DOWN Oct 18 17:25:20.452536 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN Oct 18 17:25:20.452540 osafamfd [1826:imm.cc:0180] << exec Oct 18 17:25:20.452545 osafamfd [1826:imm.cc:0316] << execute: 2 ---------------------------------------------------------------- Thanks, Praveen On 19-Oct-16 12:28 AM, David Hoyt wrote: > Attached are the amfd logs from both controllers. > > > > Server-1: has 3 VMs for SC-1, PL-3 and PL-5. All the SUs within a 2N > redundancy are Active. > > Server-2: has 3 VMs for SC-2, PL-4 and PL-6. All the SUs within a 2N > redundancy are Standby > > > > SC-1: opensaf SU1 and SG-A, SU1 > > SC-2: opensaf SU2 and SG-A, SU2 > > PL-3: SG-B, SU1 > > PL-4: SG-B, SU2 > > PL-5: SG-C, SU1 > > PL-6: SG-C, SU2 > > > > > > Initial conditions: All the SUs within a 2N redundancy are Active on > Server-1. > > > > I issued a reboot of Server-1 around Oct 18 17:23. From the amfd logs of > SC-1, it appears PL-5 began the failover around 17:23:19. > > Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >> > avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster' > > > > > > I waited to see if SG-C’s HA state would correct itself but it did not: > > > > [root@sc-2 ~]# date ; amf-state siass | grep -A 1 SG-C > > Tue Oct 18 17:36:26 UTC 2016 > > safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C > > saAmfSISUHAState=STANDBY(2) > > -- > > safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C > > saAmfSISUHAState=STANDBY(2) > > [root@sc-2 ~]# > > > > > > Even though the controller indicates that both SUs are standby, the logs > from the original standby SG-C, SU2 shows it going active: > > > > Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: NO Assigning > 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C' > > Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: IN Assigning 'all CSIs' ACTIVE > to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C' > > Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object > 'safCSIComp=safComp=AMFWDOG\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=AMFWDOG,safSi=NoRed2,safApp=OpenSAF' > by Impl-id: 253 > > Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object > 'safCSIComp=safComp=IMMND\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=IMMND,safSi=NoRed2,safApp=OpenSAF' > by Impl-id: 253 > > Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object > 'safCSIComp=safComp=CLMNA\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=CLMNA,safSi=NoRed2,safApp=OpenSAF' > by Impl-id: 253 > > Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object > 'safSISU=safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF' > by Impl-id: 253 > > Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object > 'safCSIComp=safComp=SG-C\#safSu=SU1\#safSg=SG-C\#safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C' > by Impl-id: 253 > > Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object > 'safSISU=safSu=SU1\#safSg=SG-C\#safApp=SG-C,safSi=SG-C,safApp=SG-C' by > Impl-id: 253 > > Oct 18 17:23:20 SG-C-1 osafdtmd[26332]: NO Lost contact with 'SG-C-0' > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS > message:51732 > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for > message type 57 - ignoring > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS > message:51733 > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for > message type 57 - ignoring > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Global discard node received > for nodeId:2010f pid:1739 > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 253 > <0, 2010f(down)> (safAmfService) > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 251 > <0, 2010f(down)> (safClmService) > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 250 > <0, 2010f(down)> (safLogService) > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 248 > <0, 2010f(down)> (@OpenSafImmPBE) > > Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 249 > <0, 2010f(down)> (OsafImmPbeRt_B) > > Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: IN Assigned 'all CSIs' ACTIVE > to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C' > > Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: NO Assigned > 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C' > > Oct 18 17:25:37 SG-C-1 osafdtmd[26332]: NO Lost contact with 'sc-1' > > > > > > Thanks, > > Dave > > > > > > *From:*praveen malviya [mailto:praveen.malv...@oracle.com] > *Sent:* Tuesday, October 18, 2016 8:30 AM > *To:* David Hoyt <david.h...@genband.com>; > opensaf-users@lists.sourceforge.net > *Subject:* Re: [users] both SUs within a 2N Service Group appear as STANDBY > > > > ------------------------------------------------------------------------ > > NOTICE: This email was received from an EXTERNAL sender > > ------------------------------------------------------------------------ > > > Please see response inline with [Praveen]. > > Thanks, > Praveen > > On 17-Oct-16 11:46 PM, David Hoyt wrote: >> Hi all, >> >> I'm encountering a scenario where opensaf shows the HA state of both > SUs within a 2N redundancy Service Group as standby. >> Setup: >> >> - Opensaf 4.6 running on RHEL 6.6 VMs with TCP >> >> - 2 controllers, 4 payloads >> >> - SC-1 & SC-2 are the VMs with the controller nodes (SC-1 is active) >> >> - PL-3 & PL4 have SU1 & SU2 from SG-A (2N redundancy) >> >> - PL-5 & PL-6 have SU1 & SU2 from SG-B (2N redundancy) >> >> - Server-1 has three VMs consisting of SC-1, PL-3 and PL-5 >> >> - Likewise, server-2 has SC-2, PL-4 and PL-6 >> >> I reboot server 1 and shortly afterwards, the SG-A SUs begin to > failover. SU2 on PL-4 goes active. >> Around the same time, the opensaf 2N SUs failover. >> After the dust has settled, and server-1 comes back as well as the > VMs, all appears fine except the SG-A SUs. They both have a standby HA > state. >> >> Is there any way to correct this? > [Praveen] I think there is no issue from the callback perspectives as > SU2 on PL-4 was made active(comps received callbacks). Only problem is > outout of "amf-state siass". > Please share AMFD traces from both the controller. > > > >> Is there some audit that periodically checks the validity of the HA > states? >> >> Now, when SG-A, SU1 recovers, I did swact the SUs and it corrected the > HA state. However, if server-1 goes down for an extended period, the HA > state of SG-A, SU2 will appear as Standby, when it's actually running as > active. >> >> >> Before the reboot: >> >> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N >> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> saAmfSISUHAState=ACTIVE(1) >> -- >> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> saAmfSISUHAState=STANDBY(2) >> [root@jenga-56-sysvm-1 ~]# >> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A >> safSISU=safSu=SU2\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A >> saAmfSISUHAState=STANDBY(2) >> -- >> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A >> saAmfSISUHAState=ACTIVE(1) >> [root@sc-2 ~]# >> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B >> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >> saAmfSISUHAState=STANDBY(2) >> -- >> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >> saAmfSISUHAState=ACTIVE(1) >> [root@sc-2 ~]# >> >> >> >> After the reboot: >> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N >> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> saAmfSISUHAState=STANDBY(2) >> -- >> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> saAmfSISUHAState=ACTIVE(1) >> [root@sc-2 ~]# >> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A >> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A >> saAmfSISUHAState=STANDBY(2) >> -- >> safSISU=safSu=SU2\,safSg=SG-A\,safApp=DVN,safSi=SG-A,safApp=SG-A >> saAmfSISUHAState=STANDBY(2) >> [root@sc-2 ~]# >> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B >> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >> saAmfSISUHAState=ACTIVE(1) >> -- >> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >> saAmfSISUHAState=STANDBY(2) >> [root@sc-2 ~]# >> >> > ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >> _______________________________________________ >> Opensaf-users mailing list >> Opensaf-users@lists.sourceforge.net > <mailto:Opensaf-users@lists.sourceforge.net> >> https://lists.sourceforge.net/lists/listinfo/opensaf-users >> > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users