Hi Dave, In ticket #1141, there are two patches in 4.6 with changesets:
changeset: 6839:964e043fa545 branch: opensaf-4.6.x parent: 6836:39806387b4c3 user: praveen.malv...@oracle.com date: Thu Sep 17 10:33:32 2015 +0530 summary: amfd: maintain runtime updates for su, comp, si and csi at standby [#1141] and changeset: 6896:18b4458e357a branch: opensaf-4.6.x tag: tip parent: 6886:22d7d85287e6 user: praveen.malv...@oracle.com date: Mon Sep 28 11:51:34 2015 +0530 summary: amfd: include comp and su oper state at standby amfd[#1141] Please share AMFD traces if you are observing after applying above two patches. Thanks, Praveen On 21-Oct-16 4:24 AM, David Hoyt wrote: > Hi Praveen, > > > > Since I’m using OpenSAF 4.6.0, I did not have the fix for #1141. > > It looks like it was included as part of the 4.6.1 maintenance release. > > > > So, I looked at the file diff for #1141, made the same code changes, > rebuilt the rpms, and replaced them in my lab. > > However, I’m still seeing the same behavior. > > > > Are there other changes I should look at adding? > > > > Thanks, > > Dave > > > > > > *From:*praveen malviya [mailto:praveen.malv...@oracle.com] > *Sent:* Wednesday, October 19, 2016 4:30 AM > *To:* David Hoyt <david.h...@genband.com>; > opensaf-users@lists.sourceforge.net > *Subject:* Re: [users] both SUs within a 2N Service Group appear as STANDBY > > > > ------------------------------------------------------------------------ > > NOTICE: This email was received from an EXTERNAL sender > > ------------------------------------------------------------------------ > > > > Analysis: > Active controller SC-1 successfully performed fail-over of SU1 of SG-C > by making SU2 active. But AMFD could not update this new HA state of SU2 > in IMM as IMMND was already down by that time. AMFD at SC-1 got > TRY_AGAIN from IMM. > > Since this is OpenSAF 4.6 release, it does include fix of #1141. > Ticket #1141 was pushed in 4.6 branch after 4.6 GA. In the fix of #1141, > standby AMFD updates HA state after becoming active. > > Also please include the fix of #2009 when available (which will be on > top of #1141.) > > > Messages from AMFD trace of SC-1 which confirms this: > ------------------------------------------------- > 1)Active AMFD at SC-1 starts failover of Node PL-5 at: > Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >> > avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster' > > 2)It sends active assignment to SU2 : > Oct 18 17:23:19.986349 osafamfd [1826:sgproc.cc:2114] >> > avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1 > > 3)It seems local IMMND (also LGS and NTFS) went down before AMFD gets > successful response: > Oct 18 17:24:24.075904 osafamfd [1826:imma_mds.c:0404] T3 IMMND DOWN > > 4)AMFD at SC-1 gets response for this successful assignment also: > Oct 18 17:25:19.948061 osafamfd [1826:sgproc.cc:0889] >> > avd_su_si_assign_evh: id:33, node:2060f, act:5, > > 'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0 > AMFD adds IMM update in job queue: > Oct 18 17:25:19.948758 osafamfd [1826:imm.cc:1506] >> > avd_saImmOiRtObjectUpdate: 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG- > > C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState > Oct 18 17:25:19.948781 osafamfd [1826:imm.cc:1525] << > avd_saImmOiRtObjectUpdate > Oct 18 17:25:19.948787 osafamfd [1826:imm.cc:1506] >> > avd_saImmOiRtObjectUpdate: 'safCSIComp=safComp=SG-C > > \,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C' > saAmfCSICompHAState > Oct 18 17:25:19.948808 osafamfd [1826:imm.cc:1525] << > avd_saImmOiRtObjectUpdate > > 5)But AMFD at SC-1 could not update HA state in IMM: > Oct 18 17:25:19.951838 osafamfd [1826:imm.cc:0151] >> exec: Update > 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' > saAmfSISUHAState > Oct 18 17:25:19.951861 osafamfd [1826:imma_oi_api.c:2417] >> > rt_object_update_common > Oct 18 17:25:19.951866 osafamfd [1826:imma_oi_api.c:2435] T2 > ERR_TRY_AGAIN: IMMND is DOWN > Oct 18 17:25:19.951874 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN > Oct 18 17:25:19.951877 osafamfd [1826:imm.cc:0180] << exec > Oct 18 17:25:19.951881 osafamfd [1826:imm.cc:0316] << execute: 2 > Oct 18 17:25:20.452499 osafamfd [1826:imm.cc:0312] >> execute > Oct 18 17:25:20.452519 osafamfd [1826:imm.cc:0151] >> exec: Update > 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' > saAmfSISUHAState > Oct 18 17:25:20.452526 osafamfd [1826:imma_oi_api.c:2417] >> > rt_object_update_common > Oct 18 17:25:20.452531 osafamfd [1826:imma_oi_api.c:2435] T2 > ERR_TRY_AGAIN: IMMND is DOWN > Oct 18 17:25:20.452536 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN > Oct 18 17:25:20.452540 osafamfd [1826:imm.cc:0180] << exec > Oct 18 17:25:20.452545 osafamfd [1826:imm.cc:0316] << execute: 2 > ---------------------------------------------------------------- > > > Thanks, > Praveen > > > On 19-Oct-16 12:28 AM, David Hoyt wrote: >> Attached are the amfd logs from both controllers. >> >> >> >> Server-1: has 3 VMs for SC-1, PL-3 and PL-5. All the SUs within a 2N >> redundancy are Active. >> >> Server-2: has 3 VMs for SC-2, PL-4 and PL-6. All the SUs within a 2N >> redundancy are Standby >> >> >> >> SC-1: opensaf SU1 and SG-A, SU1 >> >> SC-2: opensaf SU2 and SG-A, SU2 >> >> PL-3: SG-B, SU1 >> >> PL-4: SG-B, SU2 >> >> PL-5: SG-C, SU1 >> >> PL-6: SG-C, SU2 >> >> >> >> >> >> Initial conditions: All the SUs within a 2N redundancy are Active on >> Server-1. >> >> >> >> I issued a reboot of Server-1 around Oct 18 17:23. From the amfd logs of >> SC-1, it appears PL-5 began the failover around 17:23:19. >> >> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >> >> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster' >> >> >> >> >> >> I waited to see if SG-C’s HA state would correct itself but it did not: >> >> >> >> [root@sc-2 ~]# date ; amf-state siass | grep -A 1 SG-C >> >> Tue Oct 18 17:36:26 UTC 2016 >> >> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C >> >> saAmfSISUHAState=STANDBY(2) >> >> -- >> >> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C >> >> saAmfSISUHAState=STANDBY(2) >> >> [root@sc-2 ~]# >> >> >> >> >> >> Even though the controller indicates that both SUs are standby, the logs >> from the original standby SG-C, SU2 shows it going active: >> >> >> >> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: NO Assigning >> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C' >> >> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: IN Assigning 'all CSIs' ACTIVE >> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C' >> >> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >> > 'safCSIComp=safComp=AMFWDOG\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=AMFWDOG,safSi=NoRed2,safApp=OpenSAF' >> by Impl-id: 253 >> >> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >> > 'safCSIComp=safComp=IMMND\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=IMMND,safSi=NoRed2,safApp=OpenSAF' >> by Impl-id: 253 >> >> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >> > 'safCSIComp=safComp=CLMNA\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=CLMNA,safSi=NoRed2,safApp=OpenSAF' >> by Impl-id: 253 >> >> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >> > 'safSISU=safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF' >> by Impl-id: 253 >> >> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >> > 'safCSIComp=safComp=SG-C\#safSu=SU1\#safSg=SG-C\#safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C' >> by Impl-id: 253 >> >> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >> 'safSISU=safSu=SU1\#safSg=SG-C\#safApp=SG-C,safSi=SG-C,safApp=SG-C' by >> Impl-id: 253 >> >> Oct 18 17:23:20 SG-C-1 osafdtmd[26332]: NO Lost contact with 'SG-C-0' >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS >> message:51732 >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for >> message type 57 - ignoring >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS >> message:51733 >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for >> message type 57 - ignoring >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Global discard node received >> for nodeId:2010f pid:1739 >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 253 >> <0, 2010f(down)> (safAmfService) >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 251 >> <0, 2010f(down)> (safClmService) >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 250 >> <0, 2010f(down)> (safLogService) >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 248 >> <0, 2010f(down)> (@OpenSafImmPBE) >> >> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 249 >> <0, 2010f(down)> (OsafImmPbeRt_B) >> >> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: IN Assigned 'all CSIs' ACTIVE >> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C' >> >> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: NO Assigned >> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C' >> >> Oct 18 17:25:37 SG-C-1 osafdtmd[26332]: NO Lost contact with 'sc-1' >> >> >> >> >> >> Thanks, >> >> Dave >> >> >> >> >> >> *From:*praveen malviya [mailto:praveen.malv...@oracle.com] >> *Sent:* Tuesday, October 18, 2016 8:30 AM >> *To:* David Hoyt <david.h...@genband.com <mailto:david.h...@genband.com>>; >> opensaf-users@lists.sourceforge.net > <mailto:opensaf-users@lists.sourceforge.net> >> *Subject:* Re: [users] both SUs within a 2N Service Group appear as > STANDBY >> >> >> >> ------------------------------------------------------------------------ >> >> NOTICE: This email was received from an EXTERNAL sender >> >> ------------------------------------------------------------------------ >> >> >> Please see response inline with [Praveen]. >> >> Thanks, >> Praveen >> >> On 17-Oct-16 11:46 PM, David Hoyt wrote: >>> Hi all, >>> >>> I'm encountering a scenario where opensaf shows the HA state of both >> SUs within a 2N redundancy Service Group as standby. >>> Setup: >>> >>> - Opensaf 4.6 running on RHEL 6.6 VMs with TCP >>> >>> - 2 controllers, 4 payloads >>> >>> - SC-1 & SC-2 are the VMs with the controller nodes (SC-1 is active) >>> >>> - PL-3 & PL4 have SU1 & SU2 from SG-A (2N redundancy) >>> >>> - PL-5 & PL-6 have SU1 & SU2 from SG-B (2N redundancy) >>> >>> - Server-1 has three VMs consisting of SC-1, PL-3 and PL-5 >>> >>> - Likewise, server-2 has SC-2, PL-4 and PL-6 >>> >>> I reboot server 1 and shortly afterwards, the SG-A SUs begin to >> failover. SU2 on PL-4 goes active. >>> Around the same time, the opensaf 2N SUs failover. >>> After the dust has settled, and server-1 comes back as well as the >> VMs, all appears fine except the SG-A SUs. They both have a standby HA >> state. >>> >>> Is there any way to correct this? >> [Praveen] I think there is no issue from the callback perspectives as >> SU2 on PL-4 was made active(comps received callbacks). Only problem is >> outout of "amf-state siass". >> Please share AMFD traces from both the controller. >> >> >> >>> Is there some audit that periodically checks the validity of the HA >> states? >>> >>> Now, when SG-A, SU1 recovers, I did swact the SUs and it corrected the >> HA state. However, if server-1 goes down for an extended period, the HA >> state of SG-A, SU2 will appear as Standby, when it's actually running as >> active. >>> >>> >>> Before the reboot: >>> >>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N >>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >>> saAmfSISUHAState=ACTIVE(1) >>> -- >>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >>> saAmfSISUHAState=STANDBY(2) >>> [root@jenga-56-sysvm-1 ~]# >>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A >>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A >>> saAmfSISUHAState=STANDBY(2) >>> -- >>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A >>> saAmfSISUHAState=ACTIVE(1) >>> [root@sc-2 ~]# >>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B >>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >>> saAmfSISUHAState=STANDBY(2) >>> -- >>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >>> saAmfSISUHAState=ACTIVE(1) >>> [root@sc-2 ~]# >>> >>> >>> >>> After the reboot: >>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N >>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >>> saAmfSISUHAState=STANDBY(2) >>> -- >>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >>> saAmfSISUHAState=ACTIVE(1) >>> [root@sc-2 ~]# >>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A >>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A >>> saAmfSISUHAState=STANDBY(2) >>> -- >>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=DVN,safSi=SG-A,safApp=SG-A >>> saAmfSISUHAState=STANDBY(2) >>> [root@sc-2 ~]# >>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B >>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >>> saAmfSISUHAState=ACTIVE(1) >>> -- >>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >>> saAmfSISUHAState=STANDBY(2) >>> [root@sc-2 ~]# >>> >>> >> > ------------------------------------------------------------------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >>> _______________________________________________ >>> Opensaf-users mailing list >>> Opensaf-users@lists.sourceforge.net > <mailto:Opensaf-users@lists.sourceforge.net> >> <mailto:Opensaf-users@lists.sourceforge.net> >>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>> >> > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users