Re: [users] both SUs within a 2N Service Group appear as STANDBY

David Hoyt Tue, 25 Oct 2016 09:06:46 -0700

Hi Praveen,

Sorry, but the changes you suggested don’t seem to make a difference.


I’m now looking at coming up with a manual procedure so that if we perform 
maintenance on a server, the VMs running on that server should be terminated 
first. That way, I’ll be able to control the VM termination order and hopefully 
avoid all these failovers that happen at the same time.

We were looking at doing this controlled procedure anyways but I’m still 
concerned about the uncontrolled scenario. It doesn’t appear to me that opensaf 
can handle this and then afterwards, it requires manual recovery. ☹

Regards,
David


From: praveen malviya [mailto:praveen.malv...@oracle.com]
Sent: Monday, October 24, 2016 2:56 AM
To: David Hoyt <david.h...@genband.com>
Cc: opensaf-users@lists.sourceforge.net
Subject: Re: [users] both SUs within a 2N Service Group appear as STANDBY

________________________________
NOTICE: This email was received from an EXTERNAL sender
________________________________


Hi Dave,

SC-2 became active in no time at "Oct 21 17:54:35.111631". But somehow
IMM got busy with something and continuously gave TRY_AGAIN untill SC-2
also went down at "Oct 21 17:54:34". When SC-2 came up and also the
payload hosting SU2 joined, SC-2 AMFD reassigned active state and it
updated it in IMM at:

Oct 21 18:00:23.113831 osafamfd [4070:imm.cc:0089] >> exec: Create
safSi=SG-C,safApp=SG-C
Oct 21 18:00:23.113836 osafamfd [4070:imma_oi_api.c:2757] >>
rt_object_create_common
Oct 21 18:00:23.113842 osafamfd [4070:imma_oi_api.c:2863] TR attr:safSISU
Oct 21 18:00:23.113848 osafamfd [4070:imma_oi_api.c:2863] TR
attr:saAmfSISUHAState
Oct 21 18:00:23.113854 osafamfd [4070:imma_oi_api.c:2863] TR
attr:saAmfSISUHAReadinessState

But since SC-2 also went down the case is not longer remains same.
Please figure it out why IMMND got stuck.
But still I see a smaller patch is needed from AMF perspective. I will
try to consider this case in #2009 (already published). Attached is
1.patch that is extracted out of 2009.patch. Please try with 1.patch on
top of #1540 and #1141.

Thanks,
Praveen


Messages from SC-1 AMFD:
1)Node failove of PL-5 started:
Oct 21 17:52:20.232746 osafamfd [2864:ndproc.cc:0923] >>
avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'

2)AMFD starts failover of 'safSu=SU1,safSg=SG-C,safApp=SG-C' hosted on PL-5.
Oct 21 17:52:20.275267 osafamfd [2864:sg_2n_fsm.cc:3262] >> node_fail:
'safSu=SU1,safSg=SG-C,safApp=SG-C', 0

and sends active to its peer:
Oct 21 17:52:20.275511 osafamfd [2864:sgproc.cc:2114] >>
avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1

3)This time SC-1 AMFD itself was able to update in IMM:
Oct 21 17:52:20.363304 osafamfd [2864:imm.cc:0199] >> exec: Delete
safCSIComp=safComp=SG-C\,safSu=SU1\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C
Oct 21 17:52:20.364569 osafamfd [2864:mds_dt_trans.c:0576] >>
mdtm_process_poll_recv_data_tcp
Oct 21 17:52:20.364640 osafamfd [2864:imm.cc:0222] << exec
Oct 21 17:52:20.364651 osafamfd [2864:imm.cc:0334] << execute: 1
Oct 21 17:52:20.364657 osafamfd [2864:imm.cc:0330] >> execute
Oct 21 17:52:20.364662 osafamfd [2864:imm.cc:0199] >> exec: Delete
safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
Oct 21 17:52:20.366127 osafamfd [2864:mds_dt_trans.c:0576] >>
mdtm_process_poll_recv_data_tcp
Oct 21 17:52:20.366201 osafamfd [2864:imm.cc:0222] << exec
Oct 21 17:52:20.366210 osafamfd [2864:imm.cc:0334] << execute: 1

This means, a user will not see atleast above two runtime object using
the query amf-state csiass and amf-state siass respetively.

4) Down events of Director on SC-1 started coming:
Oct 21 17:53:22.303749 osafamfd [2864:ntfa_mds.c:0369] TR NTFS down.
Oct 21 17:53:22.307098 osafamfd [2864:lga_mds.c:0491] TR LGS down
Oct 21 17:53:22.311434 osafamfd [2864:clma_mds.c:0959] TR CLMS down
Local IMMND down:
Oct 21 17:53:22.324784 osafamfd [2864:imma_mds.c:0404] T3 IMMND DOWN

5)SC-1 AMFD get response for safSu=SU2,safSg=SG-C,safApp=SG-C for being
active:
Oct 21 17:54:19.465629 osafamfd [2864:sgproc.cc:0889] >>
avd_su_si_assign_evh: id:28, node:2060f, act:5,
'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0
AMFD generated events for IMM and notification for SU2 being active:
Oct 21 17:54:19.466208 osafamfd [2864:imm.cc:1560] >>
avd_saImmOiRtObjectUpdate:
'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
saAmfSISUHAState
Oct 21 17:54:19.466217 osafamfd [2864:imm.cc:1580] <<
avd_saImmOiRtObjectUpdate
Oct 21 17:54:19.466223 osafamfd [2864:imm.cc:1560] >>
avd_saImmOiRtObjectUpdate:
'safCSIComp=safComp=SG-C\,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
saAmfCSICompHAState
Oct 21 17:54:19.466231 osafamfd [2864:imm.cc:1580] <<
avd_saImmOiRtObjectUpdate
Oct 21 17:54:19.466236 osafamfd [2864:siass.cc:0446] >>
avd_gen_su_ha_state_changed_ntf: 'safSi=SG-C,safApp=SG-C' assigned to
'safSu=SU2,safSg=SG-C,safApp=SG-C' HA state 'ACTIVE'

Since NTFS is down, notification is not sent:
Oct 21 17:54:19.466522 osafamfd [2864:ntf.cc:0717] ER
sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6)

6)SC-1 AMFD could not update new states of SU2 to IMM (local IMMND is down):
Oct 21 17:54:19.469032 osafamfd [2864:imm.cc:0153] >> exec: Update
'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
saAmfSISUHAState
Oct 21 17:54:19.469041 osafamfd [2864:imma_oi_api.c:2417] >>
rt_object_update_common
Oct 21 17:54:19.469048 osafamfd [2864:imma_oi_api.c:2435] T2
ERR_TRY_AGAIN: IMMND is DOWN
Oct 21 17:54:19.469054 osafamfd [2864:imm.cc:0167] TR TRY-AGAIN
Oct 21 17:54:19.469058 osafamfd [2864:imm.cc:0182] << exec
Oct 21 17:54:19.469063 osafamfd [2864:imm.cc:0334] << execute: 2
Oct 21 17:54:19.969631 osafamfd [2864:imm.cc:0330] >> execute
Oct 21 17:54:19.969651 osafamfd [2864:imm.cc:0153] >> exec: Update
'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
saAmfSISUHAState
Oct 21 17:54:19.969691 osafamfd [2864:imma_oi_api.c:2417] >>
rt_object_update_common
Oct 21 17:54:19.969698 osafamfd [2864:imma_oi_api.c:2435] T2
ERR_TRY_AGAIN: IMMND is DOWN

7)SC-1 goes down:
Oct 21 17:54:34.706944 osafamfd [2864:ndfsm.cc:0324] >>
avd_mds_avnd_down_evh: 2010f, 0x1a2ce20



Messages from SC-2 AMFD:
1)Failover successful:
Oct 21 17:54:35.091478 osafamfd [2673:role.cc:0075] >>
avd_role_change_evh: cause=1, role=1
Oct 21 17:54:35.091506 osafamfd [2673:role.cc:0254] >> avd_role_failover
Oct 21 17:54:35.091537 osafamfd [2673:role.cc:0255] NO FAILOVER StandBy
--> Active
Oct 21 17:54:35.091588 osafamfd [2673:mds_dt_trans.c:0576] >>
mdtm_process_poll_recv_dat

....
Oct 21 17:54:35.111631 osafamfd [2673:role.cc:0364] NO FAILOVER StandBy
--> Active DONE!
Oct 21 17:54:35.111637 osafamfd [2673:role.cc:0373] << avd_role_failover
Oct 21 17:54:35.111643 osafamfd [2673:role.cc:0153] <<
avd_role_change_evh: 1

2)After some updates to IMM, AMF is getting TRY_AGAIN:

Oct 21 17:54:35.137208 osafamfd [2673:imma_oi_api.c:2417] >>
rt_object_update_common
Oct 21 17:54:35.137472 osafamfd [2673:mds_dt_trans.c:0576] >>
mdtm_process_poll_recv_data_tcp
Oct 21 17:54:35.137514 osafamfd [2673:mds_dt_trans.c:0576] >>
mdtm_process_poll_recv_data_tcp
Oct 21 17:54:35.137543 osafamfd [2673:imma_proc.c:1346] TR ** Event type:12
Oct 21 17:54:35.137550 osafamfd [2673:imma_proc.c:0752] >>
imma_proc_rt_attr_update
Oct 21 17:54:35.137556 osafamfd [2673:imma_proc.c:0792] TR Posting RT
UPDATE CALLBACK to IPC mailbox
Oct 21 17:54:35.137564 osafamfd [2673:imma_proc.c:0798] <<
imma_proc_rt_attr_update
Oct 21 17:54:35.137569 osafamfd [2673:imma_proc.c:1239] >>
imma_proc_free_pointers
Oct 21 17:54:35.137573 osafamfd [2673:imma_proc.c:1332] <<
imma_proc_free_pointers
Oct 21 17:54:35.137594 osafamfd [2673:imma_oi_api.c:2690] <<
rt_object_update_common
Oct 21 17:54:35.137601 osafamfd [2673:imm.cc:0167] TR TRY-AGAIN
Oct 21 17:54:35.137606 osafamfd [2673:imm.cc:0182] << exec
Oct 21 17:54:35.137611 osafamfd [2673:imm.cc:0334] << execute: 2

3)Node failover of PL-3:
Oct 21 17:55:29.917633 osafamfd [2673:ndproc.cc:0923] >>
avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
Oct 21 17:55:29.922470 osafamfd [2673:ndproc.cc:0929] << avd_node_failover

4)TRY_AGAIN continues till:
Oct 21 17:56:33.996617 osafamfd [2673:imm.cc:0167] TR TRY-AGAIN

5)Down event for directors:
Oct 21 17:56:34.278064 osafamfd [2673:ntfa_mds.c:0369] TR NTFS down.
Oct 21 17:56:34.278561 osafamfd [2673:clma_mds.c:0959] TR CLMS down.
Oct 21 17:56:34.282331 osafamfd [2673:imma_mds.c:0404] T3 IMMND DOWN

6)Now AMFD sees down immnd down:
Oct 21 17:56:34.284237 osafamfd [2673:imma_oi_api.c:2417] >>
rt_object_update_common
Oct 21 17:56:34.284243 osafamfd [2673:imma_oi_api.c:2435] T2
ERR_TRY_AGAIN: IMMND is DOWN
Oct 21 17:56:34.284248 osafamfd [2673:imm.cc:0167] TR TRY-AGA

7)Finally SC-2 goes down and again comes as active:
Oct 21 17:57:47.017327 osafamfd [2673:ndfsm.cc:0324] >>
avd_mds_avnd_down_evh: 2020f, 0x265ceb0

Oct 21 17:58:01.000668 osafamfd [4070:role.cc:0175] >>
avd_active_role_initialization

8)After Payload hosting safSu=SU2,safSg=SG-C,safApp=SG-C comes up this
SU is chosen for active assignments:
Oct 21 18:00:23.108237 osafamfd [4070:siass.cc:0148] >> avd_susi_create:
safSu=SU2,safSg=SG-C,safApp=SG-C safSi=SG-C,safApp=SG-C state=1

9)Now it gets updated in IMM:
Oct 21 18:00:23.113831 osafamfd [4070:imm.cc:0089] >> exec: Create
safSi=SG-C,safApp=SG-C
Oct 21 18:00:23.113836 osafamfd [4070:imma_oi_api.c:2757] >>
rt_object_create_common
Oct 21 18:00:23.113842 osafamfd [4070:imma_oi_api.c:2863] TR attr:safSISU
Oct 21 18:00:23.113848 osafamfd [4070:imma_oi_api.c:2863] TR
attr:saAmfSISUHAState
Oct 21 18:00:23.113854 osafamfd [4070:imma_oi_api.c:2863] TR
attr:saAmfSISUHAReadinessState
Oct 21 18:00:23.114506 osafamfd [4070:mds_dt_trans.c:0576] >>
mdtm_process_poll_recv_data_tcp


On 22-Oct-16 1:33 AM, David Hoyt wrote:
> Hi Praveen,
>
>
>
> First off, thank-you for all your help!!
>
>
>
> After applying the code changes you suggested, it did work but it took a
> long time (over 10 minutes) and not without side effects.
>
> Attached are the amfd logs from each controller.
>
>
>
>
>
> Here’s a summary:
>
> · all 2N service group SUs are Active on Server-1, Standby on
> Server-2
>
> · rebooted Server-1
>
> · PL-5, running with SG-C,SU1 is the first to begin its shutdown
> at 17:51:57
>
> Oct 21 17:51:57 SG-C-0 opensafd: Stopping OpenSAF Services
>
>
>
> · SG-C, SU2 begins to go active at 17:52:20. The startup script
> returns success at 17:54:19
>
> Oct 21 17:52:20 SG-C-1 osafamfnd[19830]: NO Assigning
> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>
> Oct 21 17:52:20 SG-C-1 osafdtmd[19770]: NO Lost contact with 'SG-C-0'
>
> Oct 21 17:54:19 SG-C-1 osafamfnd[19830]: NO Assigned
> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>
> Oct 21 17:54:35 SG-C-1 osafdtmd[19770]: NO Lost contact with 'sc-1'
>
>
>
> · I continued to see it with a Standby HA state for almost
> another 2 minutes
>
> [root@sc-2 ~]# date; amf-state siass | grep -A 1 -i sg-c
>
> Fri Oct 21 17:56:16 UTC 2016
>
> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>
> saAmfSISUHAState=STANDBY(2)
>
> [root@sc-2 ~]#
>
>
>
> · The same is true for the 2N redundancy OpenSAF SUs. I saw
> opensaf detect loss of mate at 17:53:22, initiate failover, and logged
> that it was active at 17:54:35.
>
> Oct 21 17:53:22 sc-2 osafimmd[2594]: WA IMMD lost contact with peer IMMD
> (NCSMDS_RED_DOWN)
>
> Oct 21 17:53:22 sc-2 osafimmd[2594]: WA IMMND DOWN on active controller
> f1 detected at standby immd!! f2. Possible failover
>
> ...
>
> Oct 21 17:54:35 sc-2 osafdtmd[2553]: NO Lost contact with 'sc-1'
>
> Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Node Down event for node id 2010f:
>
> Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Current role: STANDBY
>
> ...
>
> Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Controller Failover: Setting role
> to ACTIVE
>
> Oct 21 17:54:35 sc-2 osafrded[2570]: NO RDE role set to ACTIVE
>
> Oct 21 17:54:35 sc-2 osaflogd[2619]: NO ACTIVE request
>
> Oct 21 17:54:35 sc-2 osafamfd[2673]: NO FAILOVER StandBy --> Active
>
> Oct 21 17:54:35 sc-2 osafntfd[2634]: NO ACTIVE request
>
> Oct 21 17:54:35 sc-2 osafclmd[2647]: NO ACTIVE request
>
> Oct 21 17:54:35 sc-2 osafimmd[2594]: NO ACTIVE request
>
> Oct 21 17:54:35 sc-2 osafimmd[2594]: NO ellect_coord invoke from
> rda_callback ACTIVE
>
> ...
>
> Oct 21 17:54:35 sc-2 osafamfd[2673]: NO Node 'SC-1' left the cluster
>
> Oct 21 17:54:35 sc-2 osafamfd[2673]: NO FAILOVER StandBy --> Active DONE!
>
>
>
> · Again, querying the HA state of opensaf did not reflect this
> until 17:58:12
>
> [root@sc2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2
> safSg=2N
>
> Fri Oct 21 17:56:31 UTC 2016
>
> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>
> saAmfSISUHAState=ACTIVE(1)
>
> --
>
> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>
> saAmfSISUHAState=STANDBY(2)
>
> [root@sc-2 ~]#
>
> …
>
> [root@sc-2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2
> safSg=2N
>
> Fri Oct 21 17:58:12 UTC 2016
>
> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>
> saAmfSISUHAState=ACTIVE(1)
>
> [root@sc-2 ~]#
>
>
>
> · Also, as a heads up, opensaf on SC-2 was restarted at 17:55:53.
>
> I have a script running on both controller nodes. It constantly pings
> the mate controller and if the ping fails, then a counter is started.
> When the count reaches the threshold, the HA state of the opensaf SU on
> the current node is checked. If it’s HA state is standby, then the
> script will initiate an opensaf restart.
>
> Basically, it’s a backup mechanism if the detection of the loss of the
> mate controller takes too long. I found this code was hit today at
> 17:54:53.
>
> Oct 21 17:54:53 sc-2 wait_for_mate_connection: Cannot ping mate controller
>
> Oct 21 17:54:53 sc-2 wait_for_mate_connection: threshold_counter: '1'
>
> ...
>
> Oct 21 17:55:53 sc-2 wait_for_mate_connection: threshold_counter: '15'
>
> Oct 21 17:55:53 sc-2 mate_down_handler: Mate connection down threshold
> exceeded - analyzing data...
>
> Oct 21 17:55:53 sc-2 mate_down_handler: Checking opensaf SU status...
>
> Oct 21 17:55:53 sc-2 mate_down_handler: opensaf SU is Standby, no mate
> available
>
> Oct 21 17:55:54 sc-2 mate_down_handler: Restarting opensaf
>
> Oct 21 17:55:54 sc-2 mate_down_handler: opensaf restart has been
> initiated...
>
>
>
> By 18:05:29, All the HA states were correct.
>
>
>
> [root@sc-2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2
> safSg=2N
>
> Fri Oct 21 18:05:29 UTC 2016
>
> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>
> saAmfSISUHAState=STANDBY(2)
>
> --
>
> --
>
> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>
> saAmfSISUHAState=ACTIVE(1)
>
> [root@sc-2 ~]#
>
>
>
>
>
> [root@sc-2 ~]# date; amf-state siass | grep -A 1 SG-C
>
> Fri Oct 21 18:05:46 UTC 2016
>
> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>
> saAmfSISUHAState=STANDBY(2)
>
> --
>
> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>
> saAmfSISUHAState=ACTIVE(1)
>
> [root@sc-2 ~]#
>
>
>
>
>
> Thanks,
>
> /David/
>
>
>
>
>
> *From:*praveen malviya [mailto:praveen.malv...@oracle.com]
> *Sent:* Friday, October 21, 2016 10:38 AM
> *To:* David Hoyt <david.h...@genband.com<mailto:david.h...@genband.com>>
> *Subject:* Re: [users] both SUs within a 2N Service Group appear as STANDBY
>
>
>
> ------------------------------------------------------------------------
>
> NOTICE: This email was received from an EXTERNAL sender
>
> ------------------------------------------------------------------------
>
>
> One more related ticket need to be included:
>
> changeset: 7029:229ff887c7c8
> branch: opensaf-4.6.x
> parent: 7025:677617e23ea3
> user: praveen.malv...@oracle.com<mailto:praveen.malv...@oracle.com> 
> <mailto:praveen.malv...@oracle.com>
> date: Thu Oct 22 09:57:18 2015 +0530
> summary: amfd: fix deletion of runtime objects of class
> SaAmfSIAssignment [#1540]
>
> Thanks,
> Praveen
> On 21-Oct-16 12:06 PM, praveen malviya wrote:
>> Hi Dave,
>>
>> In ticket #1141, there are two patches in 4.6 with changesets:
>>
>> changeset: 6839:964e043fa545
>> branch: opensaf-4.6.x
>> parent: 6836:39806387b4c3
>> user: praveen.malv...@oracle.com<mailto:praveen.malv...@oracle.com> 
>> <mailto:praveen.malv...@oracle.com>
>> date: Thu Sep 17 10:33:32 2015 +0530
>> summary: amfd: maintain runtime updates for su, comp, si and csi at
>> standby [#1141]
>>
>> and
>> changeset: 6896:18b4458e357a
>> branch: opensaf-4.6.x
>> tag: tip
>> parent: 6886:22d7d85287e6
>> user: praveen.malv...@oracle.com<mailto:praveen.malv...@oracle.com> 
>> <mailto:praveen.malv...@oracle.com>
>> date: Mon Sep 28 11:51:34 2015 +0530
>> summary: amfd: include comp and su oper state at standby amfd[#1141]
>>
>> Please share AMFD traces if you are observing after applying above two
>> patches.
>>
>> Thanks,
>> Praveen
>> On 21-Oct-16 4:24 AM, David Hoyt wrote:
>>> Hi Praveen,
>>>
>>>
>>>
>>> Since I’m using OpenSAF 4.6.0, I did not have the fix for #1141.
>>>
>>> It looks like it was included as part of the 4.6.1 maintenance release.
>>>
>>>
>>>
>>> So, I looked at the file diff for #1141, made the same code changes,
>>> rebuilt the rpms, and replaced them in my lab.
>>>
>>> However, I’m still seeing the same behavior.
>>>
>>>
>>>
>>> Are there other changes I should look at adding?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Dave
>>>
>>>
>>>
>>>
>>>
>>> *From:*praveen malviya [mailto:praveen.malv...@oracle.com]
>>> *Sent:* Wednesday, October 19, 2016 4:30 AM
>>> *To:* David Hoyt <david.h...@genband.com
<mailto:david.h...@genband.com%0b>> <mailto:david.h...@genband.com>>;
>>> opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
> <mailto:opensaf-users@lists.sourceforge.net>
>>> *Subject:* Re: [users] both SUs within a 2N Service Group appear as
> STANDBY
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> NOTICE: This email was received from an EXTERNAL sender
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>> Analysis:
>>> Active controller SC-1 successfully performed fail-over of SU1 of SG-C
>>> by making SU2 active. But AMFD could not update this new HA state of SU2
>>> in IMM as IMMND was already down by that time. AMFD at SC-1 got
>>> TRY_AGAIN from IMM.
>>>
>>> Since this is OpenSAF 4.6 release, it does include fix of #1141.
>>> Ticket #1141 was pushed in 4.6 branch after 4.6 GA. In the fix of #1141,
>>> standby AMFD updates HA state after becoming active.
>>>
>>> Also please include the fix of #2009 when available (which will be on
>>> top of #1141.)
>>>
>>>
>>> Messages from AMFD trace of SC-1 which confirms this:
>>> -------------------------------------------------
>>> 1)Active AMFD at SC-1 starts failover of Node PL-5 at:
>>> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >>
>>> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
>>>
>>> 2)It sends active assignment to SU2 :
>>> Oct 18 17:23:19.986349 osafamfd [1826:sgproc.cc:2114] >>
>>> avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1
>>>
>>> 3)It seems local IMMND (also LGS and NTFS) went down before AMFD gets
>>> successful response:
>>> Oct 18 17:24:24.075904 osafamfd [1826:imma_mds.c:0404] T3 IMMND DOWN
>>>
>>> 4)AMFD at SC-1 gets response for this successful assignment also:
>>> Oct 18 17:25:19.948061 osafamfd [1826:sgproc.cc:0889] >>
>>> avd_su_si_assign_evh: id:33, node:2060f, act:5,
>>>
>>> 'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0
>>> AMFD adds IMM update in job queue:
>>> Oct 18 17:25:19.948758 osafamfd [1826:imm.cc:1506] >>
>>> avd_saImmOiRtObjectUpdate: 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-
>>>
>>> C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState
>>> Oct 18 17:25:19.948781 osafamfd [1826:imm.cc:1525] <<
>>> avd_saImmOiRtObjectUpdate
>>> Oct 18 17:25:19.948787 osafamfd [1826:imm.cc:1506] >>
>>> avd_saImmOiRtObjectUpdate: 'safCSIComp=safComp=SG-C
>>>
>>> \,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
>>> saAmfCSICompHAState
>>> Oct 18 17:25:19.948808 osafamfd [1826:imm.cc:1525] <<
>>> avd_saImmOiRtObjectUpdate
>>>
>>> 5)But AMFD at SC-1 could not update HA state in IMM:
>>> Oct 18 17:25:19.951838 osafamfd [1826:imm.cc:0151] >> exec: Update
>>> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
>>> saAmfSISUHAState
>>> Oct 18 17:25:19.951861 osafamfd [1826:imma_oi_api.c:2417] >>
>>> rt_object_update_common
>>> Oct 18 17:25:19.951866 osafamfd [1826:imma_oi_api.c:2435] T2
>>> ERR_TRY_AGAIN: IMMND is DOWN
>>> Oct 18 17:25:19.951874 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
>>> Oct 18 17:25:19.951877 osafamfd [1826:imm.cc:0180] << exec
>>> Oct 18 17:25:19.951881 osafamfd [1826:imm.cc:0316] << execute: 2
>>> Oct 18 17:25:20.452499 osafamfd [1826:imm.cc:0312] >> execute
>>> Oct 18 17:25:20.452519 osafamfd [1826:imm.cc:0151] >> exec: Update
>>> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
>>> saAmfSISUHAState
>>> Oct 18 17:25:20.452526 osafamfd [1826:imma_oi_api.c:2417] >>
>>> rt_object_update_common
>>> Oct 18 17:25:20.452531 osafamfd [1826:imma_oi_api.c:2435] T2
>>> ERR_TRY_AGAIN: IMMND is DOWN
>>> Oct 18 17:25:20.452536 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
>>> Oct 18 17:25:20.452540 osafamfd [1826:imm.cc:0180] << exec
>>> Oct 18 17:25:20.452545 osafamfd [1826:imm.cc:0316] << execute: 2
>>> ----------------------------------------------------------------
>>>
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>> On 19-Oct-16 12:28 AM, David Hoyt wrote:
>>>> Attached are the amfd logs from both controllers.
>>>>
>>>>
>>>>
>>>> Server-1: has 3 VMs for SC-1, PL-3 and PL-5. All the SUs within a 2N
>>>> redundancy are Active.
>>>>
>>>> Server-2: has 3 VMs for SC-2, PL-4 and PL-6. All the SUs within a 2N
>>>> redundancy are Standby
>>>>
>>>>
>>>>
>>>> SC-1: opensaf SU1 and SG-A, SU1
>>>>
>>>> SC-2: opensaf SU2 and SG-A, SU2
>>>>
>>>> PL-3: SG-B, SU1
>>>>
>>>> PL-4: SG-B, SU2
>>>>
>>>> PL-5: SG-C, SU1
>>>>
>>>> PL-6: SG-C, SU2
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Initial conditions: All the SUs within a 2N redundancy are Active on
>>>> Server-1.
>>>>
>>>>
>>>>
>>>> I issued a reboot of Server-1 around Oct 18 17:23. From the amfd logs of
>>>> SC-1, it appears PL-5 began the failover around 17:23:19.
>>>>
>>>> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >>
>>>> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I waited to see if SG-C’s HA state would correct itself but it did not:
>>>>
>>>>
>>>>
>>>> [root@sc-2 ~]# date ; amf-state siass | grep -A 1 SG-C
>>>>
>>>> Tue Oct 18 17:36:26 UTC 2016
>>>>
>>>> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>>>>
>>>> saAmfSISUHAState=STANDBY(2)
>>>>
>>>> --
>>>>
>>>> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>>>>
>>>> saAmfSISUHAState=STANDBY(2)
>>>>
>>>> [root@sc-2 ~]#
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Even though the controller indicates that both SUs are standby, the logs
>>>> from the original standby SG-C, SU2 shows it going active:
>>>>
>>>>
>>>>
>>>> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: NO Assigning
>>>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>>>>
>>>> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: IN Assigning 'all CSIs' ACTIVE
>>>> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'
>>>>
>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>
>>>
> 'safCSIComp=safComp=AMFWDOG\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=AMFWDOG,safSi=NoRed2,safApp=OpenSAF'
>>>> by Impl-id: 253
>>>>
>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>
>>>
> 'safCSIComp=safComp=IMMND\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=IMMND,safSi=NoRed2,safApp=OpenSAF'
>>>> by Impl-id: 253
>>>>
>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>
>>>
> 'safCSIComp=safComp=CLMNA\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=CLMNA,safSi=NoRed2,safApp=OpenSAF'
>>>> by Impl-id: 253
>>>>
>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>
>>>
> 'safSISU=safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF'
>>>> by Impl-id: 253
>>>>
>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>
>>>
> 'safCSIComp=safComp=SG-C\#safSu=SU1\#safSg=SG-C\#safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
>>>> by Impl-id: 253
>>>>
>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>> 'safSISU=safSu=SU1\#safSg=SG-C\#safApp=SG-C,safSi=SG-C,safApp=SG-C' by
>>>> Impl-id: 253
>>>>
>>>> Oct 18 17:23:20 SG-C-1 osafdtmd[26332]: NO Lost contact with 'SG-C-0'
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
>>>> message:51732
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
>>>> message type 57 - ignoring
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
>>>> message:51733
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
>>>> message type 57 - ignoring
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Global discard node received
>>>> for nodeId:2010f pid:1739
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 253
>>>> <0, 2010f(down)> (safAmfService)
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 251
>>>> <0, 2010f(down)> (safClmService)
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 250
>>>> <0, 2010f(down)> (safLogService)
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 248
>>>> <0, 2010f(down)> (@OpenSafImmPBE)
>>>>
>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 249
>>>> <0, 2010f(down)> (OsafImmPbeRt_B)
>>>>
>>>> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: IN Assigned 'all CSIs' ACTIVE
>>>> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'
>>>>
>>>> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: NO Assigned
>>>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>>>>
>>>> Oct 18 17:25:37 SG-C-1 osafdtmd[26332]: NO Lost contact with 'sc-1'
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Dave
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:*praveen malviya [mailto:praveen.malv...@oracle.com]
>>>> *Sent:* Tuesday, October 18, 2016 8:30 AM
>>>> *To:* David Hoyt <david.h...@genband.com
<mailto:david.h...@genband.com%0b>> <mailto:david.h...@genband.com
<mailto:david.h...@genband.com%0b>> 
<mailto:david.h...@genband.com%20%3cmailto:david.h...@genband.com>>>;
>>>> opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
> <mailto:opensaf-users@lists.sourceforge.net>
>>> <mailto:opensaf-users@lists.sourceforge.net>
>>>> *Subject:* Re: [users] both SUs within a 2N Service Group appear as
>>> STANDBY
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> NOTICE: This email was received from an EXTERNAL sender
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> Please see response inline with [Praveen].
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>> On 17-Oct-16 11:46 PM, David Hoyt wrote:
>>>>> Hi all,
>>>>>
>>>>> I'm encountering a scenario where opensaf shows the HA state of both
>>>> SUs within a 2N redundancy Service Group as standby.
>>>>> Setup:
>>>>>
>>>>> - Opensaf 4.6 running on RHEL 6.6 VMs with TCP
>>>>>
>>>>> - 2 controllers, 4 payloads
>>>>>
>>>>> - SC-1 & SC-2 are the VMs with the controller nodes (SC-1 is active)
>>>>>
>>>>> - PL-3 & PL4 have SU1 & SU2 from SG-A (2N redundancy)
>>>>>
>>>>> - PL-5 & PL-6 have SU1 & SU2 from SG-B (2N redundancy)
>>>>>
>>>>> - Server-1 has three VMs consisting of SC-1, PL-3 and PL-5
>>>>>
>>>>> - Likewise, server-2 has SC-2, PL-4 and PL-6
>>>>>
>>>>> I reboot server 1 and shortly afterwards, the SG-A SUs begin to
>>>> failover. SU2 on PL-4 goes active.
>>>>> Around the same time, the opensaf 2N SUs failover.
>>>>> After the dust has settled, and server-1 comes back as well as the
>>>> VMs, all appears fine except the SG-A SUs. They both have a standby HA
>>>> state.
>>>>>
>>>>> Is there any way to correct this?
>>>> [Praveen] I think there is no issue from the callback perspectives as
>>>> SU2 on PL-4 was made active(comps received callbacks). Only problem is
>>>> outout of "amf-state siass".
>>>> Please share AMFD traces from both the controller.
>>>>
>>>>
>>>>
>>>>> Is there some audit that periodically checks the validity of the HA
>>>> states?
>>>>>
>>>>> Now, when SG-A, SU1 recovers, I did swact the SUs and it corrected the
>>>> HA state. However, if server-1 goes down for an extended period, the HA
>>>> state of SG-A, SU2 will appear as Standby, when it's actually running as
>>>> active.
>>>>>
>>>>>
>>>>> Before the reboot:
>>>>>
>>>>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N
>>>>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>> --
>>>>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>>>> saAmfSISUHAState=STANDBY(2)
>>>>> [root@jenga-56-sysvm-1 ~]#
>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
>>>>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>>>>> saAmfSISUHAState=STANDBY(2)
>>>>> --
>>>>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>> [root@sc-2 ~]#
>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
>>>>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>>>> saAmfSISUHAState=STANDBY(2)
>>>>> --
>>>>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>> [root@sc-2 ~]#
>>>>>
>>>>>
>>>>>
>>>>> After the reboot:
>>>>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N
>>>>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>>>> saAmfSISUHAState=STANDBY(2)
>>>>> --
>>>>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>> [root@sc-2 ~]#
>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
>>>>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>>>>> saAmfSISUHAState=STANDBY(2)
>>>>> --
>>>>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=DVN,safSi=SG-A,safApp=SG-A
>>>>> saAmfSISUHAState=STANDBY(2)
>>>>> [root@sc-2 ~]#
>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
>>>>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>> --
>>>>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>>>> saAmfSISUHAState=STANDBY(2)
>>>>> [root@sc-2 ~]#
>>>>>
>>>>>
>>>>
>>>
> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Opensaf-users mailing list
>>>>> Opensaf-users@lists.sourceforge.net<mailto:Opensaf-users@lists.sourceforge.net>
> <mailto:Opensaf-users@lists.sourceforge.net>
>>> <mailto:Opensaf-users@lists.sourceforge.net>
>>>> <mailto:Opensaf-users@lists.sourceforge.net>
>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>
>>>>
>>>
>>
>>
> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Opensaf-users mailing list
>> Opensaf-users@lists.sourceforge.net<mailto:Opensaf-users@lists.sourceforge.net>
> <mailto:Opensaf-users@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>
>
------------------------------------------------------------------------------
The Command Line: Reinvented for Modern Developers
Did the resurgence of CLI tooling catch you by surprise?
Reconnect with the command line and become more productive. 
Learn the new .NET and ASP.NET CLI. Get your free copy!
http://sdm.link/telerik
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] both SUs within a 2N Service Group appear as STANDBY

Reply via email to