Analysis:
Active controller SC-1 successfully performed fail-over of SU1 of SG-C 
by making SU2 active. But AMFD could not update this new HA state of SU2 
in IMM as IMMND was already down by that time. AMFD at SC-1 got 
TRY_AGAIN from IMM.

Since this is OpenSAF 4.6 release, it does include fix of #1141.
Ticket #1141 was pushed in 4.6 branch after 4.6 GA. In the fix of #1141, 
standby AMFD updates HA state after becoming active.

Also please include the fix of #2009 when available (which will be on 
top of #1141.)


Messages from AMFD trace of SC-1 which confirms this:
-------------------------------------------------
1)Active AMFD at SC-1 starts failover of Node PL-5 at:
Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >> 
avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'

2)It sends active assignment to SU2 :
Oct 18 17:23:19.986349 osafamfd [1826:sgproc.cc:2114] >> 
avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1

3)It seems local IMMND (also LGS and NTFS) went down before AMFD gets 
successful response:
Oct 18 17:24:24.075904 osafamfd [1826:imma_mds.c:0404] T3 IMMND DOWN

4)AMFD at SC-1 gets response for this successful assignment also:
Oct 18 17:25:19.948061 osafamfd [1826:sgproc.cc:0889] >> 
avd_su_si_assign_evh: id:33, node:2060f, act:5,

'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0
AMFD adds IMM update in job queue:
Oct 18 17:25:19.948758 osafamfd [1826:imm.cc:1506] >> 
avd_saImmOiRtObjectUpdate: 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-

C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState
Oct 18 17:25:19.948781 osafamfd [1826:imm.cc:1525] << 
avd_saImmOiRtObjectUpdate
Oct 18 17:25:19.948787 osafamfd [1826:imm.cc:1506] >> 
avd_saImmOiRtObjectUpdate: 'safCSIComp=safComp=SG-C

\,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C' 
saAmfCSICompHAState
Oct 18 17:25:19.948808 osafamfd [1826:imm.cc:1525] << 
avd_saImmOiRtObjectUpdate

5)But AMFD at SC-1 could not update HA state in IMM:
Oct 18 17:25:19.951838 osafamfd [1826:imm.cc:0151] >> exec: Update 
'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' 
saAmfSISUHAState
Oct 18 17:25:19.951861 osafamfd [1826:imma_oi_api.c:2417] >> 
rt_object_update_common
Oct 18 17:25:19.951866 osafamfd [1826:imma_oi_api.c:2435] T2 
ERR_TRY_AGAIN: IMMND is DOWN
Oct 18 17:25:19.951874 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
Oct 18 17:25:19.951877 osafamfd [1826:imm.cc:0180] << exec
Oct 18 17:25:19.951881 osafamfd [1826:imm.cc:0316] << execute: 2
Oct 18 17:25:20.452499 osafamfd [1826:imm.cc:0312] >> execute
Oct 18 17:25:20.452519 osafamfd [1826:imm.cc:0151] >> exec: Update 
'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' 
saAmfSISUHAState
Oct 18 17:25:20.452526 osafamfd [1826:imma_oi_api.c:2417] >> 
rt_object_update_common
Oct 18 17:25:20.452531 osafamfd [1826:imma_oi_api.c:2435] T2 
ERR_TRY_AGAIN: IMMND is DOWN
Oct 18 17:25:20.452536 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
Oct 18 17:25:20.452540 osafamfd [1826:imm.cc:0180] << exec
Oct 18 17:25:20.452545 osafamfd [1826:imm.cc:0316] << execute: 2
----------------------------------------------------------------


Thanks,
Praveen


On 19-Oct-16 12:28 AM, David Hoyt wrote:
> Attached are the amfd logs from both controllers.
>
>
>
> Server-1: has 3 VMs for SC-1, PL-3 and PL-5. All the SUs within a 2N
> redundancy are Active.
>
> Server-2: has 3 VMs for SC-2, PL-4 and PL-6. All the SUs within a 2N
> redundancy are Standby
>
>
>
> SC-1: opensaf SU1 and SG-A, SU1
>
> SC-2: opensaf SU2 and SG-A, SU2
>
> PL-3: SG-B, SU1
>
> PL-4: SG-B, SU2
>
> PL-5: SG-C, SU1
>
> PL-6: SG-C, SU2
>
>
>
>
>
> Initial conditions: All the SUs within a 2N redundancy are Active on
> Server-1.
>
>
>
> I issued a reboot of Server-1 around Oct 18 17:23. From the amfd logs of
> SC-1, it appears PL-5 began the failover around 17:23:19.
>
> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >>
> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
>
>
>
>
>
> I waited to see if SG-C’s  HA state would correct itself but it did not:
>
>
>
> [root@sc-2 ~]# date ;  amf-state siass | grep -A 1 SG-C
>
> Tue Oct 18 17:36:26 UTC 2016
>
> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>
>         saAmfSISUHAState=STANDBY(2)
>
> --
>
> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>
>         saAmfSISUHAState=STANDBY(2)
>
> [root@sc-2 ~]#
>
>
>
>
>
> Even though the controller indicates that both SUs are standby, the logs
> from the original standby SG-C, SU2 shows it going active:
>
>
>
> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: NO Assigning
> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>
> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: IN Assigning 'all CSIs' ACTIVE
> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'
>
> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
> 'safCSIComp=safComp=AMFWDOG\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=AMFWDOG,safSi=NoRed2,safApp=OpenSAF'
> by Impl-id: 253
>
> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
> 'safCSIComp=safComp=IMMND\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=IMMND,safSi=NoRed2,safApp=OpenSAF'
> by Impl-id: 253
>
> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
> 'safCSIComp=safComp=CLMNA\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=CLMNA,safSi=NoRed2,safApp=OpenSAF'
> by Impl-id: 253
>
> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
> 'safSISU=safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF'
> by Impl-id: 253
>
> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
> 'safCSIComp=safComp=SG-C\#safSu=SU1\#safSg=SG-C\#safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
> by Impl-id: 253
>
> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
> 'safSISU=safSu=SU1\#safSg=SG-C\#safApp=SG-C,safSi=SG-C,safApp=SG-C' by
> Impl-id: 253
>
> Oct 18 17:23:20 SG-C-1 osafdtmd[26332]: NO Lost contact with 'SG-C-0'
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
> message:51732
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
> message type 57 - ignoring
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
> message:51733
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
> message type 57 - ignoring
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Global discard node received
> for nodeId:2010f pid:1739
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 253
> <0, 2010f(down)> (safAmfService)
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 251
> <0, 2010f(down)> (safClmService)
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 250
> <0, 2010f(down)> (safLogService)
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 248
> <0, 2010f(down)> (@OpenSafImmPBE)
>
> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 249
> <0, 2010f(down)> (OsafImmPbeRt_B)
>
> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: IN Assigned 'all CSIs' ACTIVE
> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'
>
> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: NO Assigned
> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>
> Oct 18 17:25:37 SG-C-1 osafdtmd[26332]: NO Lost contact with 'sc-1'
>
>
>
>
>
> Thanks,
>
> Dave
>
>
>
>
>
> *From:*praveen malviya [mailto:praveen.malv...@oracle.com]
> *Sent:* Tuesday, October 18, 2016 8:30 AM
> *To:* David Hoyt <david.h...@genband.com>;
> opensaf-users@lists.sourceforge.net
> *Subject:* Re: [users] both SUs within a 2N Service Group appear as STANDBY
>
>
>
> ------------------------------------------------------------------------
>
> NOTICE: This email was received from an EXTERNAL sender
>
> ------------------------------------------------------------------------
>
>
> Please see response inline with [Praveen].
>
> Thanks,
> Praveen
>
> On 17-Oct-16 11:46 PM, David Hoyt wrote:
>> Hi all,
>>
>> I'm encountering a scenario where opensaf shows the HA state of both
> SUs within a 2N redundancy Service Group as standby.
>> Setup:
>>
>> - Opensaf 4.6 running on RHEL 6.6 VMs with TCP
>>
>> - 2 controllers, 4 payloads
>>
>> - SC-1 & SC-2 are the VMs with the controller nodes (SC-1 is active)
>>
>> - PL-3 & PL4 have SU1 & SU2 from SG-A (2N redundancy)
>>
>> - PL-5 & PL-6 have SU1 & SU2 from SG-B (2N redundancy)
>>
>> - Server-1 has three VMs consisting of SC-1, PL-3 and PL-5
>>
>> - Likewise, server-2 has SC-2, PL-4 and PL-6
>>
>> I reboot server 1 and shortly afterwards, the SG-A SUs begin to
> failover. SU2 on PL-4 goes active.
>> Around the same time, the opensaf 2N SUs failover.
>> After the dust has settled, and server-1 comes back as well as the
> VMs, all appears fine except the SG-A SUs. They both have a standby HA
> state.
>>
>> Is there any way to correct this?
> [Praveen] I think there is no issue from the callback perspectives as
> SU2 on PL-4 was made active(comps received callbacks). Only problem is
> outout of "amf-state siass".
> Please share AMFD traces from both the controller.
>
>
>
>> Is there some audit that periodically checks the validity of the HA
> states?
>>
>> Now, when SG-A, SU1 recovers, I did swact the SUs and it corrected the
> HA state. However, if server-1 goes down for an extended period, the HA
> state of SG-A, SU2 will appear as Standby, when it's actually running as
> active.
>>
>>
>> Before the reboot:
>>
>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N
>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>> saAmfSISUHAState=ACTIVE(1)
>> --
>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>> saAmfSISUHAState=STANDBY(2)
>> [root@jenga-56-sysvm-1 ~]#
>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>> saAmfSISUHAState=STANDBY(2)
>> --
>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>> saAmfSISUHAState=ACTIVE(1)
>> [root@sc-2 ~]#
>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>> saAmfSISUHAState=STANDBY(2)
>> --
>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>> saAmfSISUHAState=ACTIVE(1)
>> [root@sc-2 ~]#
>>
>>
>>
>> After the reboot:
>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N
>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>> saAmfSISUHAState=STANDBY(2)
>> --
>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>> saAmfSISUHAState=ACTIVE(1)
>> [root@sc-2 ~]#
>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>> saAmfSISUHAState=STANDBY(2)
>> --
>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=DVN,safSi=SG-A,safApp=SG-A
>> saAmfSISUHAState=STANDBY(2)
>> [root@sc-2 ~]#
>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>> saAmfSISUHAState=ACTIVE(1)
>> --
>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>> saAmfSISUHAState=STANDBY(2)
>> [root@sc-2 ~]#
>>
>>
> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Opensaf-users mailing list
>> Opensaf-users@lists.sourceforge.net
> <mailto:Opensaf-users@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to