Hi Dave,

SC-2 became active in no time at "Oct 21 17:54:35.111631". But somehow IMM got busy with something and continuously gave TRY_AGAIN untill SC-2 also went down at "Oct 21 17:54:34". When SC-2 came up and also the payload hosting SU2 joined, SC-2 AMFD reassigned active state and it updated it in IMM at:

Oct 21 18:00:23.113831 osafamfd [4070:imm.cc:0089] >> exec: Create safSi=SG-C,safApp=SG-C Oct 21 18:00:23.113836 osafamfd [4070:imma_oi_api.c:2757] >> rt_object_create_common
Oct 21 18:00:23.113842 osafamfd [4070:imma_oi_api.c:2863] TR attr:safSISU
Oct 21 18:00:23.113848 osafamfd [4070:imma_oi_api.c:2863] TR attr:saAmfSISUHAState Oct 21 18:00:23.113854 osafamfd [4070:imma_oi_api.c:2863] TR attr:saAmfSISUHAReadinessState

But since SC-2 also went down the case is not longer remains same. Please figure it out why IMMND got stuck. But still I see a smaller patch is needed from AMF perspective. I will try to consider this case in #2009 (already published). Attached is 1.patch that is extracted out of 2009.patch. Please try with 1.patch on top of #1540 and #1141.

Thanks,
Praveen


Messages from SC-1 AMFD:
1)Node failove of PL-5 started:
Oct 21 17:52:20.232746 osafamfd [2864:ndproc.cc:0923] >> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'

2)AMFD starts failover of 'safSu=SU1,safSg=SG-C,safApp=SG-C' hosted on PL-5.
Oct 21 17:52:20.275267 osafamfd [2864:sg_2n_fsm.cc:3262] >> node_fail: 'safSu=SU1,safSg=SG-C,safApp=SG-C', 0

and sends active to its peer:
Oct 21 17:52:20.275511 osafamfd [2864:sgproc.cc:2114] >> avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1

3)This time SC-1 AMFD itself was able to update in IMM:
Oct 21 17:52:20.363304 osafamfd [2864:imm.cc:0199] >> exec: Delete safCSIComp=safComp=SG-C\,safSu=SU1\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C Oct 21 17:52:20.364569 osafamfd [2864:mds_dt_trans.c:0576] >> mdtm_process_poll_recv_data_tcp
Oct 21 17:52:20.364640 osafamfd [2864:imm.cc:0222] << exec
Oct 21 17:52:20.364651 osafamfd [2864:imm.cc:0334] << execute: 1
Oct 21 17:52:20.364657 osafamfd [2864:imm.cc:0330] >> execute
Oct 21 17:52:20.364662 osafamfd [2864:imm.cc:0199] >> exec: Delete safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C Oct 21 17:52:20.366127 osafamfd [2864:mds_dt_trans.c:0576] >> mdtm_process_poll_recv_data_tcp
Oct 21 17:52:20.366201 osafamfd [2864:imm.cc:0222] << exec
Oct 21 17:52:20.366210 osafamfd [2864:imm.cc:0334] << execute: 1

This means, a user will not see atleast above two runtime object using the query amf-state csiass and amf-state siass respetively.

4) Down events of Director on SC-1 started coming:
Oct 21 17:53:22.303749 osafamfd [2864:ntfa_mds.c:0369] TR NTFS down.
Oct 21 17:53:22.307098 osafamfd [2864:lga_mds.c:0491] TR LGS down
Oct 21 17:53:22.311434 osafamfd [2864:clma_mds.c:0959] TR CLMS down
Local IMMND down:
Oct 21 17:53:22.324784 osafamfd [2864:imma_mds.c:0404] T3 IMMND DOWN

5)SC-1 AMFD get response for safSu=SU2,safSg=SG-C,safApp=SG-C for being active: Oct 21 17:54:19.465629 osafamfd [2864:sgproc.cc:0889] >> avd_su_si_assign_evh: id:28, node:2060f, act:5, 'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0
AMFD generated events for IMM and notification for SU2 being active:
Oct 21 17:54:19.466208 osafamfd [2864:imm.cc:1560] >> avd_saImmOiRtObjectUpdate: 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState Oct 21 17:54:19.466217 osafamfd [2864:imm.cc:1580] << avd_saImmOiRtObjectUpdate Oct 21 17:54:19.466223 osafamfd [2864:imm.cc:1560] >> avd_saImmOiRtObjectUpdate: 'safCSIComp=safComp=SG-C\,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C' saAmfCSICompHAState Oct 21 17:54:19.466231 osafamfd [2864:imm.cc:1580] << avd_saImmOiRtObjectUpdate Oct 21 17:54:19.466236 osafamfd [2864:siass.cc:0446] >> avd_gen_su_ha_state_changed_ntf: 'safSi=SG-C,safApp=SG-C' assigned to 'safSu=SU2,safSg=SG-C,safApp=SG-C' HA state 'ACTIVE'

Since NTFS is down, notification is not sent:
Oct 21 17:54:19.466522 osafamfd [2864:ntf.cc:0717] ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6)

6)SC-1 AMFD could not update new states of SU2 to IMM (local IMMND is down):
Oct 21 17:54:19.469032 osafamfd [2864:imm.cc:0153] >> exec: Update 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState Oct 21 17:54:19.469041 osafamfd [2864:imma_oi_api.c:2417] >> rt_object_update_common Oct 21 17:54:19.469048 osafamfd [2864:imma_oi_api.c:2435] T2 ERR_TRY_AGAIN: IMMND is DOWN
Oct 21 17:54:19.469054 osafamfd [2864:imm.cc:0167] TR TRY-AGAIN
Oct 21 17:54:19.469058 osafamfd [2864:imm.cc:0182] << exec
Oct 21 17:54:19.469063 osafamfd [2864:imm.cc:0334] << execute: 2
Oct 21 17:54:19.969631 osafamfd [2864:imm.cc:0330] >> execute
Oct 21 17:54:19.969651 osafamfd [2864:imm.cc:0153] >> exec: Update 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState Oct 21 17:54:19.969691 osafamfd [2864:imma_oi_api.c:2417] >> rt_object_update_common Oct 21 17:54:19.969698 osafamfd [2864:imma_oi_api.c:2435] T2 ERR_TRY_AGAIN: IMMND is DOWN

7)SC-1 goes down:
Oct 21 17:54:34.706944 osafamfd [2864:ndfsm.cc:0324] >> avd_mds_avnd_down_evh: 2010f, 0x1a2ce20



Messages from SC-2 AMFD:
1)Failover successful:
Oct 21 17:54:35.091478 osafamfd [2673:role.cc:0075] >> avd_role_change_evh: cause=1, role=1
Oct 21 17:54:35.091506 osafamfd [2673:role.cc:0254] >> avd_role_failover
Oct 21 17:54:35.091537 osafamfd [2673:role.cc:0255] NO FAILOVER StandBy --> Active Oct 21 17:54:35.091588 osafamfd [2673:mds_dt_trans.c:0576] >> mdtm_process_poll_recv_dat

....
Oct 21 17:54:35.111631 osafamfd [2673:role.cc:0364] NO FAILOVER StandBy --> Active DONE!
Oct 21 17:54:35.111637 osafamfd [2673:role.cc:0373] << avd_role_failover
Oct 21 17:54:35.111643 osafamfd [2673:role.cc:0153] << avd_role_change_evh: 1

2)After some updates to IMM, AMF is getting TRY_AGAIN:

Oct 21 17:54:35.137208 osafamfd [2673:imma_oi_api.c:2417] >> rt_object_update_common Oct 21 17:54:35.137472 osafamfd [2673:mds_dt_trans.c:0576] >> mdtm_process_poll_recv_data_tcp Oct 21 17:54:35.137514 osafamfd [2673:mds_dt_trans.c:0576] >> mdtm_process_poll_recv_data_tcp
Oct 21 17:54:35.137543 osafamfd [2673:imma_proc.c:1346] TR ** Event type:12
Oct 21 17:54:35.137550 osafamfd [2673:imma_proc.c:0752] >> imma_proc_rt_attr_update Oct 21 17:54:35.137556 osafamfd [2673:imma_proc.c:0792] TR Posting RT UPDATE CALLBACK to IPC mailbox Oct 21 17:54:35.137564 osafamfd [2673:imma_proc.c:0798] << imma_proc_rt_attr_update Oct 21 17:54:35.137569 osafamfd [2673:imma_proc.c:1239] >> imma_proc_free_pointers Oct 21 17:54:35.137573 osafamfd [2673:imma_proc.c:1332] << imma_proc_free_pointers Oct 21 17:54:35.137594 osafamfd [2673:imma_oi_api.c:2690] << rt_object_update_common
Oct 21 17:54:35.137601 osafamfd [2673:imm.cc:0167] TR TRY-AGAIN
Oct 21 17:54:35.137606 osafamfd [2673:imm.cc:0182] << exec
Oct 21 17:54:35.137611 osafamfd [2673:imm.cc:0334] << execute: 2

3)Node failover of PL-3:
Oct 21 17:55:29.917633 osafamfd [2673:ndproc.cc:0923] >> avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
Oct 21 17:55:29.922470 osafamfd [2673:ndproc.cc:0929] << avd_node_failover

4)TRY_AGAIN continues till:
Oct 21 17:56:33.996617 osafamfd [2673:imm.cc:0167] TR TRY-AGAIN

5)Down event for directors:
Oct 21 17:56:34.278064 osafamfd [2673:ntfa_mds.c:0369] TR NTFS down.
Oct 21 17:56:34.278561 osafamfd [2673:clma_mds.c:0959] TR CLMS down.
Oct 21 17:56:34.282331 osafamfd [2673:imma_mds.c:0404] T3 IMMND DOWN

6)Now AMFD sees down immnd down:
Oct 21 17:56:34.284237 osafamfd [2673:imma_oi_api.c:2417] >> rt_object_update_common Oct 21 17:56:34.284243 osafamfd [2673:imma_oi_api.c:2435] T2 ERR_TRY_AGAIN: IMMND is DOWN
Oct 21 17:56:34.284248 osafamfd [2673:imm.cc:0167] TR TRY-AGA

7)Finally SC-2 goes down and again comes as active:
Oct 21 17:57:47.017327 osafamfd [2673:ndfsm.cc:0324] >> avd_mds_avnd_down_evh: 2020f, 0x265ceb0

Oct 21 17:58:01.000668 osafamfd [4070:role.cc:0175] >> avd_active_role_initialization

8)After Payload hosting safSu=SU2,safSg=SG-C,safApp=SG-C comes up this SU is chosen for active assignments: Oct 21 18:00:23.108237 osafamfd [4070:siass.cc:0148] >> avd_susi_create: safSu=SU2,safSg=SG-C,safApp=SG-C safSi=SG-C,safApp=SG-C state=1

9)Now it gets updated in IMM:
Oct 21 18:00:23.113831 osafamfd [4070:imm.cc:0089] >> exec: Create safSi=SG-C,safApp=SG-C Oct 21 18:00:23.113836 osafamfd [4070:imma_oi_api.c:2757] >> rt_object_create_common
Oct 21 18:00:23.113842 osafamfd [4070:imma_oi_api.c:2863] TR attr:safSISU
Oct 21 18:00:23.113848 osafamfd [4070:imma_oi_api.c:2863] TR attr:saAmfSISUHAState Oct 21 18:00:23.113854 osafamfd [4070:imma_oi_api.c:2863] TR attr:saAmfSISUHAReadinessState Oct 21 18:00:23.114506 osafamfd [4070:mds_dt_trans.c:0576] >> mdtm_process_poll_recv_data_tcp


On 22-Oct-16 1:33 AM, David Hoyt wrote:
Hi Praveen,



First off, thank-you for all your help!!



After applying the code changes you suggested, it did work but it took a
long time (over 10 minutes) and not without side effects.

Attached are the amfd logs from each controller.





Here’s a summary:

·         all 2N service group SUs are Active on Server-1, Standby on
Server-2

·         rebooted Server-1

·         PL-5, running with SG-C,SU1 is the first to begin its shutdown
at 17:51:57

Oct 21 17:51:57 SG-C-0 opensafd: Stopping OpenSAF Services



·         SG-C, SU2 begins to go active at 17:52:20. The startup script
returns success at 17:54:19

Oct 21 17:52:20 SG-C-1 osafamfnd[19830]: NO Assigning
'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'

Oct 21 17:52:20 SG-C-1 osafdtmd[19770]: NO Lost contact with 'SG-C-0'

Oct 21 17:54:19 SG-C-1 osafamfnd[19830]: NO Assigned
'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'

Oct 21 17:54:35 SG-C-1 osafdtmd[19770]: NO Lost contact with 'sc-1'



·         I continued to see it with a Standby HA state for almost
another 2 minutes

[root@sc-2 ~]# date; amf-state siass | grep -A 1 -i sg-c

Fri Oct 21 17:56:16 UTC 2016

safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C

saAmfSISUHAState=STANDBY(2)

[root@sc-2 ~]#



·         The same is true for the 2N redundancy OpenSAF SUs. I saw
opensaf detect loss of mate at 17:53:22, initiate failover, and logged
that it was active at 17:54:35.

Oct 21 17:53:22 sc-2 osafimmd[2594]: WA IMMD lost contact with peer IMMD
(NCSMDS_RED_DOWN)

Oct 21 17:53:22 sc-2 osafimmd[2594]: WA IMMND DOWN on active controller
f1 detected at standby immd!! f2. Possible failover

...

Oct 21 17:54:35 sc-2 osafdtmd[2553]: NO Lost contact with 'sc-1'

Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Node Down event for node id 2010f:

Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Current role: STANDBY

...

Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Controller Failover: Setting role
to ACTIVE

Oct 21 17:54:35 sc-2 osafrded[2570]: NO RDE role set to ACTIVE

Oct 21 17:54:35 sc-2 osaflogd[2619]: NO ACTIVE request

Oct 21 17:54:35 sc-2 osafamfd[2673]: NO FAILOVER StandBy --> Active

Oct 21 17:54:35 sc-2 osafntfd[2634]: NO ACTIVE request

Oct 21 17:54:35 sc-2 osafclmd[2647]: NO ACTIVE request

Oct 21 17:54:35 sc-2 osafimmd[2594]: NO ACTIVE request

Oct 21 17:54:35 sc-2 osafimmd[2594]: NO ellect_coord invoke from
rda_callback ACTIVE

...

Oct 21 17:54:35 sc-2 osafamfd[2673]: NO Node 'SC-1' left the cluster

Oct 21 17:54:35 sc-2 osafamfd[2673]: NO FAILOVER StandBy --> Active DONE!



·         Again, querying the HA state of opensaf did not reflect this
until 17:58:12

[root@sc2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2
safSg=2N

Fri Oct 21 17:56:31 UTC 2016

safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF

            saAmfSISUHAState=ACTIVE(1)

--

safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF

            saAmfSISUHAState=STANDBY(2)

[root@sc-2 ~]#

…

[root@sc-2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2
safSg=2N

Fri Oct 21 17:58:12 UTC 2016

safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF

        saAmfSISUHAState=ACTIVE(1)

[root@sc-2 ~]#



·         Also, as a heads up, opensaf on SC-2 was restarted at 17:55:53.

I have a script running on both controller nodes. It constantly pings
the mate controller and if the ping fails, then a counter is started.
When the count reaches the threshold, the HA state of the opensaf SU on
the current node is checked. If it’s HA state is standby, then the
script will initiate an opensaf restart.

Basically, it’s a backup mechanism if the detection of the loss of the
mate controller takes too long. I found this code was hit today at
17:54:53.

Oct 21 17:54:53 sc-2 wait_for_mate_connection: Cannot ping mate controller

Oct 21 17:54:53 sc-2 wait_for_mate_connection: threshold_counter: '1'

...

Oct 21 17:55:53 sc-2 wait_for_mate_connection: threshold_counter: '15'

Oct 21 17:55:53 sc-2 mate_down_handler: Mate connection down threshold
exceeded - analyzing data...

Oct 21 17:55:53 sc-2 mate_down_handler: Checking opensaf SU status...

Oct 21 17:55:53 sc-2 mate_down_handler: opensaf SU is Standby, no mate
available

Oct 21 17:55:54 sc-2 mate_down_handler: Restarting opensaf

Oct 21 17:55:54 sc-2 mate_down_handler: opensaf restart has been
initiated...



By 18:05:29, All the HA states were correct.



[root@sc-2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2
safSg=2N

Fri Oct 21 18:05:29 UTC 2016

safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF

        saAmfSISUHAState=STANDBY(2)

--

--

safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF

        saAmfSISUHAState=ACTIVE(1)

[root@sc-2 ~]#





[root@sc-2 ~]# date; amf-state siass | grep -A 1 SG-C

Fri Oct 21 18:05:46 UTC 2016

safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C

        saAmfSISUHAState=STANDBY(2)

--

safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C

        saAmfSISUHAState=ACTIVE(1)

[root@sc-2 ~]#





Thanks,

/David/





*From:*praveen malviya [mailto:praveen.malv...@oracle.com]
*Sent:* Friday, October 21, 2016 10:38 AM
*To:* David Hoyt <david.h...@genband.com>
*Subject:* Re: [users] both SUs within a 2N Service Group appear as STANDBY



------------------------------------------------------------------------

NOTICE: This email was received from an EXTERNAL sender

------------------------------------------------------------------------


One more related ticket need to be included:

changeset: 7029:229ff887c7c8
branch: opensaf-4.6.x
parent: 7025:677617e23ea3
user: praveen.malv...@oracle.com <mailto:praveen.malv...@oracle.com>
date: Thu Oct 22 09:57:18 2015 +0530
summary: amfd: fix deletion of runtime objects of class
SaAmfSIAssignment [#1540]

Thanks,
Praveen
On 21-Oct-16 12:06 PM, praveen malviya wrote:
Hi Dave,

In ticket #1141, there are two patches in 4.6 with changesets:

changeset: 6839:964e043fa545
branch: opensaf-4.6.x
parent: 6836:39806387b4c3
user: praveen.malv...@oracle.com <mailto:praveen.malv...@oracle.com>
date: Thu Sep 17 10:33:32 2015 +0530
summary: amfd: maintain runtime updates for su, comp, si and csi at
standby [#1141]

and
changeset: 6896:18b4458e357a
branch: opensaf-4.6.x
tag: tip
parent: 6886:22d7d85287e6
user: praveen.malv...@oracle.com <mailto:praveen.malv...@oracle.com>
date: Mon Sep 28 11:51:34 2015 +0530
summary: amfd: include comp and su oper state at standby amfd[#1141]

Please share AMFD traces if you are observing after applying above two
patches.

Thanks,
Praveen
On 21-Oct-16 4:24 AM, David Hoyt wrote:
Hi Praveen,



Since I’m using OpenSAF 4.6.0, I did not have the fix for #1141.

It looks like it was included as part of the 4.6.1 maintenance release.



So, I looked at the file diff for #1141, made the same code changes,
rebuilt the rpms, and replaced them in my lab.

However, I’m still seeing the same behavior.



Are there other changes I should look at adding?



Thanks,

Dave





*From:*praveen malviya [mailto:praveen.malv...@oracle.com]
*Sent:* Wednesday, October 19, 2016 4:30 AM
*To:* David Hoyt <david.h...@genband.com
<mailto:david.h...@genband.com>>;
opensaf-users@lists.sourceforge.net
<mailto:opensaf-users@lists.sourceforge.net>
*Subject:* Re: [users] both SUs within a 2N Service Group appear as
STANDBY



------------------------------------------------------------------------

NOTICE: This email was received from an EXTERNAL sender

------------------------------------------------------------------------



Analysis:
Active controller SC-1 successfully performed fail-over of SU1 of SG-C
by making SU2 active. But AMFD could not update this new HA state of SU2
in IMM as IMMND was already down by that time. AMFD at SC-1 got
TRY_AGAIN from IMM.

Since this is OpenSAF 4.6 release, it does include fix of #1141.
Ticket #1141 was pushed in 4.6 branch after 4.6 GA. In the fix of #1141,
standby AMFD updates HA state after becoming active.

Also please include the fix of #2009 when available (which will be on
top of #1141.)


Messages from AMFD trace of SC-1 which confirms this:
-------------------------------------------------
1)Active AMFD at SC-1 starts failover of Node PL-5 at:
Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >>
avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'

2)It sends active assignment to SU2 :
Oct 18 17:23:19.986349 osafamfd [1826:sgproc.cc:2114] >>
avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1

3)It seems local IMMND (also LGS and NTFS) went down before AMFD gets
successful response:
Oct 18 17:24:24.075904 osafamfd [1826:imma_mds.c:0404] T3 IMMND DOWN

4)AMFD at SC-1 gets response for this successful assignment also:
Oct 18 17:25:19.948061 osafamfd [1826:sgproc.cc:0889] >>
avd_su_si_assign_evh: id:33, node:2060f, act:5,

'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0
AMFD adds IMM update in job queue:
Oct 18 17:25:19.948758 osafamfd [1826:imm.cc:1506] >>
avd_saImmOiRtObjectUpdate: 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-

C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState
Oct 18 17:25:19.948781 osafamfd [1826:imm.cc:1525] <<
avd_saImmOiRtObjectUpdate
Oct 18 17:25:19.948787 osafamfd [1826:imm.cc:1506] >>
avd_saImmOiRtObjectUpdate: 'safCSIComp=safComp=SG-C

\,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
saAmfCSICompHAState
Oct 18 17:25:19.948808 osafamfd [1826:imm.cc:1525] <<
avd_saImmOiRtObjectUpdate

5)But AMFD at SC-1 could not update HA state in IMM:
Oct 18 17:25:19.951838 osafamfd [1826:imm.cc:0151] >> exec: Update
'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
saAmfSISUHAState
Oct 18 17:25:19.951861 osafamfd [1826:imma_oi_api.c:2417] >>
rt_object_update_common
Oct 18 17:25:19.951866 osafamfd [1826:imma_oi_api.c:2435] T2
ERR_TRY_AGAIN: IMMND is DOWN
Oct 18 17:25:19.951874 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
Oct 18 17:25:19.951877 osafamfd [1826:imm.cc:0180] << exec
Oct 18 17:25:19.951881 osafamfd [1826:imm.cc:0316] << execute: 2
Oct 18 17:25:20.452499 osafamfd [1826:imm.cc:0312] >> execute
Oct 18 17:25:20.452519 osafamfd [1826:imm.cc:0151] >> exec: Update
'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
saAmfSISUHAState
Oct 18 17:25:20.452526 osafamfd [1826:imma_oi_api.c:2417] >>
rt_object_update_common
Oct 18 17:25:20.452531 osafamfd [1826:imma_oi_api.c:2435] T2
ERR_TRY_AGAIN: IMMND is DOWN
Oct 18 17:25:20.452536 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
Oct 18 17:25:20.452540 osafamfd [1826:imm.cc:0180] << exec
Oct 18 17:25:20.452545 osafamfd [1826:imm.cc:0316] << execute: 2
----------------------------------------------------------------


Thanks,
Praveen


On 19-Oct-16 12:28 AM, David Hoyt wrote:
Attached are the amfd logs from both controllers.



Server-1: has 3 VMs for SC-1, PL-3 and PL-5. All the SUs within a 2N
redundancy are Active.

Server-2: has 3 VMs for SC-2, PL-4 and PL-6. All the SUs within a 2N
redundancy are Standby



SC-1: opensaf SU1 and SG-A, SU1

SC-2: opensaf SU2 and SG-A, SU2

PL-3: SG-B, SU1

PL-4: SG-B, SU2

PL-5: SG-C, SU1

PL-6: SG-C, SU2





Initial conditions: All the SUs within a 2N redundancy are Active on
Server-1.



I issued a reboot of Server-1 around Oct 18 17:23. From the amfd logs of
SC-1, it appears PL-5 began the failover around 17:23:19.

Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >>
avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'





I waited to see if SG-C’s HA state would correct itself but it did not:



[root@sc-2 ~]# date ; amf-state siass | grep -A 1 SG-C

Tue Oct 18 17:36:26 UTC 2016

safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C

saAmfSISUHAState=STANDBY(2)

--

safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C

saAmfSISUHAState=STANDBY(2)

[root@sc-2 ~]#





Even though the controller indicates that both SUs are standby, the logs
from the original standby SG-C, SU2 shows it going active:



Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: NO Assigning
'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'

Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: IN Assigning 'all CSIs' ACTIVE
to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'

Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object


'safCSIComp=safComp=AMFWDOG\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=AMFWDOG,safSi=NoRed2,safApp=OpenSAF'
by Impl-id: 253

Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object


'safCSIComp=safComp=IMMND\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=IMMND,safSi=NoRed2,safApp=OpenSAF'
by Impl-id: 253

Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object


'safCSIComp=safComp=CLMNA\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=CLMNA,safSi=NoRed2,safApp=OpenSAF'
by Impl-id: 253

Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object


'safSISU=safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF'
by Impl-id: 253

Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object


'safCSIComp=safComp=SG-C\#safSu=SU1\#safSg=SG-C\#safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
by Impl-id: 253

Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
'safSISU=safSu=SU1\#safSg=SG-C\#safApp=SG-C,safSi=SG-C,safApp=SG-C' by
Impl-id: 253

Oct 18 17:23:20 SG-C-1 osafdtmd[26332]: NO Lost contact with 'SG-C-0'

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
message:51732

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
message type 57 - ignoring

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
message:51733

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
message type 57 - ignoring

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Global discard node received
for nodeId:2010f pid:1739

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 253
<0, 2010f(down)> (safAmfService)

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 251
<0, 2010f(down)> (safClmService)

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 250
<0, 2010f(down)> (safLogService)

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 248
<0, 2010f(down)> (@OpenSafImmPBE)

Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer disconnected 249
<0, 2010f(down)> (OsafImmPbeRt_B)

Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: IN Assigned 'all CSIs' ACTIVE
to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'

Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: NO Assigned
'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'

Oct 18 17:25:37 SG-C-1 osafdtmd[26332]: NO Lost contact with 'sc-1'





Thanks,

Dave





*From:*praveen malviya [mailto:praveen.malv...@oracle.com]
*Sent:* Tuesday, October 18, 2016 8:30 AM
*To:* David Hoyt <david.h...@genband.com
<mailto:david.h...@genband.com
<mailto:david.h...@genband.com%20%3cmailto:david.h...@genband.com>>>;
opensaf-users@lists.sourceforge.net
<mailto:opensaf-users@lists.sourceforge.net>
<mailto:opensaf-users@lists.sourceforge.net>
*Subject:* Re: [users] both SUs within a 2N Service Group appear as
STANDBY



------------------------------------------------------------------------

NOTICE: This email was received from an EXTERNAL sender

------------------------------------------------------------------------


Please see response inline with [Praveen].

Thanks,
Praveen

On 17-Oct-16 11:46 PM, David Hoyt wrote:
Hi all,

I'm encountering a scenario where opensaf shows the HA state of both
SUs within a 2N redundancy Service Group as standby.
Setup:

- Opensaf 4.6 running on RHEL 6.6 VMs with TCP

- 2 controllers, 4 payloads

- SC-1 & SC-2 are the VMs with the controller nodes (SC-1 is active)

- PL-3 & PL4 have SU1 & SU2 from SG-A (2N redundancy)

- PL-5 & PL-6 have SU1 & SU2 from SG-B (2N redundancy)

- Server-1 has three VMs consisting of SC-1, PL-3 and PL-5

- Likewise, server-2 has SC-2, PL-4 and PL-6

I reboot server 1 and shortly afterwards, the SG-A SUs begin to
failover. SU2 on PL-4 goes active.
Around the same time, the opensaf 2N SUs failover.
After the dust has settled, and server-1 comes back as well as the
VMs, all appears fine except the SG-A SUs. They both have a standby HA
state.

Is there any way to correct this?
[Praveen] I think there is no issue from the callback perspectives as
SU2 on PL-4 was made active(comps received callbacks). Only problem is
outout of "amf-state siass".
Please share AMFD traces from both the controller.



Is there some audit that periodically checks the validity of the HA
states?

Now, when SG-A, SU1 recovers, I did swact the SUs and it corrected the
HA state. However, if server-1 goes down for an extended period, the HA
state of SG-A, SU2 will appear as Standby, when it's actually running as
active.


Before the reboot:

[root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
--
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=STANDBY(2)
[root@jenga-56-sysvm-1 ~]#
[root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
safSISU=safSu=SU2\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
saAmfSISUHAState=STANDBY(2)
--
safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
saAmfSISUHAState=ACTIVE(1)
[root@sc-2 ~]#
[root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
saAmfSISUHAState=STANDBY(2)
--
safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
saAmfSISUHAState=ACTIVE(1)
[root@sc-2 ~]#



After the reboot:
[root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 safSg=2N
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=STANDBY(2)
--
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
[root@sc-2 ~]#
[root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
saAmfSISUHAState=STANDBY(2)
--
safSISU=safSu=SU2\,safSg=SG-A\,safApp=DVN,safSi=SG-A,safApp=SG-A
saAmfSISUHAState=STANDBY(2)
[root@sc-2 ~]#
[root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
saAmfSISUHAState=ACTIVE(1)
--
safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
saAmfSISUHAState=STANDBY(2)
[root@sc-2 ~]#




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
<mailto:Opensaf-users@lists.sourceforge.net>
<mailto:Opensaf-users@lists.sourceforge.net>
<mailto:Opensaf-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/opensaf-users





------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
<mailto:Opensaf-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/opensaf-users


diff --git a/osaf/services/saf/amf/amfd/ckpt_updt.cc 
b/osaf/services/saf/amf/amfd/ckpt_updt.cc
--- a/osaf/services/saf/amf/amfd/ckpt_updt.cc
+++ b/osaf/services/saf/amf/amfd/ckpt_updt.cc
@@ -459,6 +459,10 @@ uint32_t avd_ckpt_siass(AVD_CL_CB *cb, A
                        } else {
                                su_si_rel_ptr->comp_name = "";
                                su_si_rel_ptr->csi_name = "";
+                               //This is update of SUSI because of 
modification of assignment. 
+                               if (action == NCS_MBCSV_ACT_UPDATE) {
+                                       avd_susi_update(su_si_rel_ptr, 
su_si_rel_ptr->state);
+                               }
                        }
                } else {
                        LOG_ER("%s:%u", __FUNCTION__, __LINE__);
diff --git a/osaf/services/saf/amf/amfd/siass.cc 
b/osaf/services/saf/amf/amfd/siass.cc
--- a/osaf/services/saf/amf/amfd/siass.cc
+++ b/osaf/services/saf/amf/amfd/siass.cc
@@ -127,9 +127,10 @@ void avd_susi_update(AVD_SU_SI_REL *susi
 
        TRACE("HA State %s of %s for %s", avd_ha_state[ha_state],
        susi->su->name.c_str(), susi->si->name.c_str());
-       saflog(LOG_NOTICE, amfSvcUsrName, "HA State %s of %s for %s",
-               avd_ha_state[ha_state], susi->su->name.c_str(), 
susi->si->name.c_str());
-
+       if (avd_cb->avail_state_avd == SA_AMF_HA_ACTIVE) 
+               saflog(LOG_NOTICE, amfSvcUsrName, "HA State %s of %s for %s",
+                               avd_ha_state[ha_state], susi->su->name.c_str(), 
susi->si->name.c_str());
+        
        avd_saImmOiRtObjectUpdate(Amf::to_string(&dn), "saAmfSISUHAState",
           SA_IMM_ATTR_SAUINT32T, &ha_state);
        osaf_extended_name_free(&dn);
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to