Re: [users] both SUs within a 2N Service Group appear as STANDBY

praveen malviya Thu, 20 Oct 2016 23:37:42 -0700

Hi Dave,

In ticket #1141, there are two patches in 4.6 with changesets:


changeset: 6839:964e043fa545
branch: opensaf-4.6.x
parent: 6836:39806387b4c3
user: [email protected]
date: Thu Sep 17 10:33:32 2015 +0530
summary: amfd: maintain runtime updates for su, comp, si and csi at 
standby [#1141]

and
changeset: 6896:18b4458e357a
branch: opensaf-4.6.x
tag: tip
parent: 6886:22d7d85287e6
user: [email protected]
date: Mon Sep 28 11:51:34 2015 +0530
summary: amfd: include comp and su oper state at standby amfd[#1141]

Please share AMFD traces if you are observing after applying above two 
patches.

Thanks,
Praveen
On 21-Oct-16 4:24 AM, David Hoyt wrote:
> Hi Praveen,
>
>
>
> Since I’m using OpenSAF 4.6.0, I did not have the fix for #1141.
>
> It looks like it was included as part of the 4.6.1 maintenance release.
>
>
>
> So, I looked at the file diff for #1141, made the same code changes,
> rebuilt the rpms, and replaced them in my lab.
>
> However, I’m still seeing the same behavior.
>
>
>
> Are there other changes I should look at adding?
>
>
>
> Thanks,
>
> Dave
>
>
>
>
>
> *From:*praveen malviya [mailto:[email protected]]
> *Sent:* Wednesday, October 19, 2016 4:30 AM
> *To:* David Hoyt <[email protected]>;
> [email protected]
> *Subject:* Re: [users] both SUs within a 2N Service Group appear as STANDBY
>
>
>
> ------------------------------------------------------------------------
>
> NOTICE: This email was received from an EXTERNAL sender
>
> ------------------------------------------------------------------------
>
>
>
> Analysis:
> Active controller SC-1 successfully performed fail-over of SU1 of SG-C
> by making SU2 active. But AMFD could not update this new HA state of SU2
> in IMM as IMMND was already down by that time. AMFD at SC-1 got
> TRY_AGAIN from IMM.
>
> Since this is OpenSAF 4.6 release, it does include fix of #1141.
> Ticket #1141 was pushed in 4.6 branch after 4.6 GA. In the fix of #1141,
> standby AMFD updates HA state after becoming active.
>
> Also please include the fix of #2009 when available (which will be on
> top of #1141.)
>
>
> Messages from AMFD trace of SC-1 which confirms this:
> -------------------------------------------------
> 1)Active AMFD at SC-1 starts failover of Node PL-5 at:
> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >>
> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
>
> 2)It sends active assignment to SU2 :
> Oct 18 17:23:19.986349 osafamfd [1826:sgproc.cc:2114] >>
> avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1
>
> 3)It seems local IMMND (also LGS and NTFS) went down before AMFD gets
> successful response:
> Oct 18 17:24:24.075904 osafamfd [1826:imma_mds.c:0404] T3 IMMND DOWN
>
> 4)AMFD at SC-1 gets response for this successful assignment also:
> Oct 18 17:25:19.948061 osafamfd [1826:sgproc.cc:0889] >>
> avd_su_si_assign_evh: id:33, node:2060f, act:5,
>
> 'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0
> AMFD adds IMM update in job queue:
> Oct 18 17:25:19.948758 osafamfd [1826:imm.cc:1506] >>
> avd_saImmOiRtObjectUpdate: 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-
>
> C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState
> Oct 18 17:25:19.948781 osafamfd [1826:imm.cc:1525] <<
> avd_saImmOiRtObjectUpdate
> Oct 18 17:25:19.948787 osafamfd [1826:imm.cc:1506] >>
> avd_saImmOiRtObjectUpdate: 'safCSIComp=safComp=SG-C
>
> \,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
> saAmfCSICompHAState
> Oct 18 17:25:19.948808 osafamfd [1826:imm.cc:1525] <<
> avd_saImmOiRtObjectUpdate
>
> 5)But AMFD at SC-1 could not update HA state in IMM:
> Oct 18 17:25:19.951838 osafamfd [1826:imm.cc:0151] >> exec: Update
> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
> saAmfSISUHAState
> Oct 18 17:25:19.951861 osafamfd [1826:imma_oi_api.c:2417] >>
> rt_object_update_common
> Oct 18 17:25:19.951866 osafamfd [1826:imma_oi_api.c:2435] T2
> ERR_TRY_AGAIN: IMMND is DOWN
> Oct 18 17:25:19.951874 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
> Oct 18 17:25:19.951877 osafamfd [1826:imm.cc:0180] << exec
> Oct 18 17:25:19.951881 osafamfd [1826:imm.cc:0316] << execute: 2
> Oct 18 17:25:20.452499 osafamfd [1826:imm.cc:0312] >> execute
> Oct 18 17:25:20.452519 osafamfd [1826:imm.cc:0151] >> exec: Update
> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
> saAmfSISUHAState
> Oct 18 17:25:20.452526 osafamfd [1826:imma_oi_api.c:2417] >>
> rt_object_update_common
> Oct 18 17:25:20.452531 osafamfd [1826:imma_oi_api.c:2435] T2
> ERR_TRY_AGAIN: IMMND is DOWN
> Oct 18 17:25:20.452536 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
> Oct 18 17:25:20.452540 osafamfd [1826:imm.cc:0180] << exec
> Oct 18 17:25:20.452545 osafamfd [1826:imm.cc:0316] << execute: 2
> ----------------------------------------------------------------
>
>
> Thanks,
> Praveen
>
>
> On 19-Oct-16 12:28 AM, David Hoyt wrote:
>> Attached are the amfd logs from both controllers.
>>
>>
>>
>> Server-1: has 3 VMs for SC-1, PL-3 and PL-5. All the SUs within a 2N
>> redundancy are Active.
>>
>> Server-2: has 3 VMs for SC-2, PL-4 and PL-6. All the SUs within a 2N
>> redundancy are Standby
>>
>>
>>
>> SC-1: opensaf SU1 and SG-A, SU1
>>
>> SC-2: opensaf SU2 and SG-A, SU2
>>
>> PL-3: SG-B, SU1
>>
>> PL-4: SG-B, SU2
>>
>> PL-5: SG-C, SU1
>>
>> PL-6: SG-C, SU2
>>
>>
>>
>>
>>
>> Initial conditions: All the SUs within a 2N redundancy are Active on
>> Server-1.
>>
>>
>>
>> I issued a reboot of Server-1 around Oct 18 17:23. From the amfd logs of
>> SC-1, it appears PL-5 began the failover around 17:23:19.
>>
>> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >>
>> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
>>
>>
>>
>>
>>
>> I waited to see if SG-C’s HA state would correct itself but it did not:
>>
>>
>>
>> [root@sc-2 ~]# date ; amf-state siass | grep -A 1 SG-C
>>
>> Tue Oct 18 17:36:26 UTC 2016
>>
>> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>>
>> saAmfSISUHAState=STANDBY(2)
>>
>> --
>>
>> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>>
>> saAmfSISUHAState=STANDBY(2)
>>
>> [root@sc-2 ~]#
>>
>>
>>
>>
>>
>> Even though the controller indicates that both SUs are standby, the logs
>> from the original standby SG-C, SU2 shows it going active:
>>
>>
>>
>> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: NO Assigning
>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>>
>> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: IN Assigning 'all CSIs' ACTIVE
>> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'
>>
>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>
> 'safCSIComp=safComp=AMFWDOG\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=AMFWDOG,safSi=NoRed2,safApp=OpenSAF'
>> by Impl-id: 253
>>
>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>
> 'safCSIComp=safComp=IMMND\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=IMMND,safSi=NoRed2,safApp=OpenSAF'
>> by Impl-id: 253
>>
>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>
> 'safCSIComp=safComp=CLMNA\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=CLMNA,safSi=NoRed2,safApp=OpenSAF'
>> by Impl-id: 253
>>
>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>
> 'safSISU=safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF'
>> by Impl-id: 253
>>
>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>
> 'safCSIComp=safComp=SG-C\#safSu=SU1\#safSg=SG-C\#safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
>> by Impl-id: 253
>>
>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>> 'safSISU=safSu=SU1\#safSg=SG-C\#safApp=SG-C,safSi=SG-C,safApp=SG-C' by
>> Impl-id: 253
>>
>> Oct 18 17:23:20 SG-C-1 osafdtmd[26332]: NO Lost contact with 'SG-C-0'
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
>> message:51732
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
>> message type 57 - ignoring
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
>> message:51733
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
>> message type 57 - ignoring
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Global discard node received
>> for nodeId:2010f pid:1739
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 253
>> <0, 2010f(down)> (safAmfService)
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 251
>> <0, 2010f(down)> (safClmService)
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 250
>> <0, 2010f(down)> (safLogService)
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 248
>> <0, 2010f(down)> (@OpenSafImmPBE)
>>
>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 249
>> <0, 2010f(down)> (OsafImmPbeRt_B)
>>
>> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: IN Assigned 'all CSIs' ACTIVE
>> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'
>>
>> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: NO Assigned
>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>>
>> Oct 18 17:25:37 SG-C-1 osafdtmd[26332]: NO Lost contact with 'sc-1'
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Dave
>>
>>
>>
>>
>>
>> *From:*praveen malviya [mailto:[email protected]]
>> *Sent:* Tuesday, October 18, 2016 8:30 AM
>> *To:* David Hoyt <[email protected] <mailto:[email protected]>>;
>> [email protected]
> <mailto:[email protected]>
>> *Subject:* Re: [users] both SUs within a 2N Service Group appear as
> STANDBY
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> NOTICE: This email was received from an EXTERNAL sender
>>
>> ------------------------------------------------------------------------
>>
>>
>> Please see response inline with [Praveen].
>>
>> Thanks,
>> Praveen
>>
>> On 17-Oct-16 11:46 PM, David Hoyt wrote:
>>> Hi all,
>>>
>>> I'm encountering a scenario where opensaf shows the HA state of both
>> SUs within a 2N redundancy Service Group as standby.
>>> Setup:
>>>
>>> - Opensaf 4.6 running on RHEL 6.6 VMs with TCP
>>>
>>> - 2 controllers, 4 payloads
>>>
>>> - SC-1 & SC-2 are the VMs with the controller nodes (SC-1 is active)
>>>
>>> - PL-3 & PL4 have SU1 & SU2 from SG-A (2N redundancy)
>>>
>>> - PL-5 & PL-6 have SU1 & SU2 from SG-B (2N redundancy)
>>>
>>> - Server-1 has three VMs consisting of SC-1, PL-3 and PL-5
>>>
>>> - Likewise, server-2 has SC-2, PL-4 and PL-6
>>>
>>> I reboot server 1 and shortly afterwards, the SG-A SUs begin to
>> failover. SU2 on PL-4 goes active.
>>> Around the same time, the opensaf 2N SUs failover.
>>> After the dust has settled, and server-1 comes back as well as the
>> VMs, all appears fine except the SG-A SUs. They both have a standby HA
>> state.
>>>
>>> Is there any way to correct this?
>> [Praveen] I think there is no issue from the callback perspectives as
>> SU2 on PL-4 was made active(comps received callbacks). Only problem is
>> outout of "amf-state siass".
>> Please share AMFD traces from both the controller.
>>
>>
>>
>>> Is there some audit that periodically checks the validity of the HA
>> states?
>>>
>>> Now, when SG-A, SU1 recovers, I did swact the SUs and it corrected the
>> HA state. However, if server-1 goes down for an extended period, the HA
>> state of SG-A, SU2 will appear as Standby, when it's actually running as
>> active.
>>>
>>>
>>> Before the reboot:
>>>
>>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N
>>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>> saAmfSISUHAState=ACTIVE(1)
>>> --
>>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>> saAmfSISUHAState=STANDBY(2)
>>> [root@jenga-56-sysvm-1 ~]#
>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
>>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>>> saAmfSISUHAState=STANDBY(2)
>>> --
>>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>>> saAmfSISUHAState=ACTIVE(1)
>>> [root@sc-2 ~]#
>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
>>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>> saAmfSISUHAState=STANDBY(2)
>>> --
>>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>> saAmfSISUHAState=ACTIVE(1)
>>> [root@sc-2 ~]#
>>>
>>>
>>>
>>> After the reboot:
>>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N
>>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>> saAmfSISUHAState=STANDBY(2)
>>> --
>>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>> saAmfSISUHAState=ACTIVE(1)
>>> [root@sc-2 ~]#
>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
>>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>>> saAmfSISUHAState=STANDBY(2)
>>> --
>>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=DVN,safSi=SG-A,safApp=SG-A
>>> saAmfSISUHAState=STANDBY(2)
>>> [root@sc-2 ~]#
>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
>>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>> saAmfSISUHAState=ACTIVE(1)
>>> --
>>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>> saAmfSISUHAState=STANDBY(2)
>>> [root@sc-2 ~]#
>>>
>>>
>>
> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Opensaf-users mailing list
>>> [email protected]
> <mailto:[email protected]>
>> <mailto:[email protected]>
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>
>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] both SUs within a 2N Service Group appear as STANDBY

Reply via email to