Hi Dave,

Please share latest traces.

Thanks,
Praveen


On 25-Oct-16 9:09 PM, David Hoyt wrote:
> Hi Praveen,
>
>
>
> Sorry, but the changes you suggested don’t seem to make a difference.
>
>
>
> I’m now looking at coming up with a manual procedure so that if we
> perform maintenance on a server, the VMs running on that server should
> be terminated first. That way, I’ll be able to control the VM
> termination order and hopefully avoid all these failovers that happen at
> the same time.
>
>
>
> We were looking at doing this controlled procedure anyways but I’m still
> concerned about the uncontrolled scenario. It doesn’t appear to me that
> opensaf can handle this and then afterwards, it requires manual recovery. L
>
>
>
> Regards,
>
> David
>
>
>
>
>
> *From:*praveen malviya [mailto:praveen.malv...@oracle.com]
> *Sent:* Monday, October 24, 2016 2:56 AM
> *To:* David Hoyt <david.h...@genband.com>
> *Cc:* opensaf-users@lists.sourceforge.net
> *Subject:* Re: [users] both SUs within a 2N Service Group appear as STANDBY
>
>
>
> ------------------------------------------------------------------------
>
> NOTICE: This email was received from an EXTERNAL sender
>
> ------------------------------------------------------------------------
>
>
>
> Hi Dave,
>
> SC-2 became active in no time at "Oct 21 17:54:35.111631". But somehow
> IMM got busy with something and continuously gave TRY_AGAIN untill SC-2
> also went down at "Oct 21 17:54:34". When SC-2 came up and also the
> payload hosting SU2 joined, SC-2 AMFD reassigned active state and it
> updated it in IMM at:
>
> Oct 21 18:00:23.113831 osafamfd [4070:imm.cc:0089] >> exec: Create
> safSi=SG-C,safApp=SG-C
> Oct 21 18:00:23.113836 osafamfd [4070:imma_oi_api.c:2757] >>
> rt_object_create_common
> Oct 21 18:00:23.113842 osafamfd [4070:imma_oi_api.c:2863] TR attr:safSISU
> Oct 21 18:00:23.113848 osafamfd [4070:imma_oi_api.c:2863] TR
> attr:saAmfSISUHAState
> Oct 21 18:00:23.113854 osafamfd [4070:imma_oi_api.c:2863] TR
> attr:saAmfSISUHAReadinessState
>
> But since SC-2 also went down the case is not longer remains same.
> Please figure it out why IMMND got stuck.
> But still I see a smaller patch is needed from AMF perspective. I will
> try to consider this case in #2009 (already published). Attached is
> 1.patch that is extracted out of 2009.patch. Please try with 1.patch on
> top of #1540 and #1141.
>
> Thanks,
> Praveen
>
>
> Messages from SC-1 AMFD:
> 1)Node failove of PL-5 started:
> Oct 21 17:52:20.232746 osafamfd [2864:ndproc.cc:0923] >>
> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
>
> 2)AMFD starts failover of 'safSu=SU1,safSg=SG-C,safApp=SG-C' hosted on PL-5.
> Oct 21 17:52:20.275267 osafamfd [2864:sg_2n_fsm.cc:3262] >> node_fail:
> 'safSu=SU1,safSg=SG-C,safApp=SG-C', 0
>
> and sends active to its peer:
> Oct 21 17:52:20.275511 osafamfd [2864:sgproc.cc:2114] >>
> avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1
>
> 3)This time SC-1 AMFD itself was able to update in IMM:
> Oct 21 17:52:20.363304 osafamfd [2864:imm.cc:0199] >> exec: Delete
> safCSIComp=safComp=SG-C\,safSu=SU1\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C
> Oct 21 17:52:20.364569 osafamfd [2864:mds_dt_trans.c:0576] >>
> mdtm_process_poll_recv_data_tcp
> Oct 21 17:52:20.364640 osafamfd [2864:imm.cc:0222] << exec
> Oct 21 17:52:20.364651 osafamfd [2864:imm.cc:0334] << execute: 1
> Oct 21 17:52:20.364657 osafamfd [2864:imm.cc:0330] >> execute
> Oct 21 17:52:20.364662 osafamfd [2864:imm.cc:0199] >> exec: Delete
> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
> Oct 21 17:52:20.366127 osafamfd [2864:mds_dt_trans.c:0576] >>
> mdtm_process_poll_recv_data_tcp
> Oct 21 17:52:20.366201 osafamfd [2864:imm.cc:0222] << exec
> Oct 21 17:52:20.366210 osafamfd [2864:imm.cc:0334] << execute: 1
>
> This means, a user will not see atleast above two runtime object using
> the query amf-state csiass and amf-state siass respetively.
>
> 4) Down events of Director on SC-1 started coming:
> Oct 21 17:53:22.303749 osafamfd [2864:ntfa_mds.c:0369] TR NTFS down.
> Oct 21 17:53:22.307098 osafamfd [2864:lga_mds.c:0491] TR LGS down
> Oct 21 17:53:22.311434 osafamfd [2864:clma_mds.c:0959] TR CLMS down
> Local IMMND down:
> Oct 21 17:53:22.324784 osafamfd [2864:imma_mds.c:0404] T3 IMMND DOWN
>
> 5)SC-1 AMFD get response for safSu=SU2,safSg=SG-C,safApp=SG-C for being
> active:
> Oct 21 17:54:19.465629 osafamfd [2864:sgproc.cc:0889] >>
> avd_su_si_assign_evh: id:28, node:2060f, act:5,
> 'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0
> AMFD generated events for IMM and notification for SU2 being active:
> Oct 21 17:54:19.466208 osafamfd [2864:imm.cc:1560] >>
> avd_saImmOiRtObjectUpdate:
> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
> saAmfSISUHAState
> Oct 21 17:54:19.466217 osafamfd [2864:imm.cc:1580] <<
> avd_saImmOiRtObjectUpdate
> Oct 21 17:54:19.466223 osafamfd [2864:imm.cc:1560] >>
> avd_saImmOiRtObjectUpdate:
> 'safCSIComp=safComp=SG-C\,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
>
> saAmfCSICompHAState
> Oct 21 17:54:19.466231 osafamfd [2864:imm.cc:1580] <<
> avd_saImmOiRtObjectUpdate
> Oct 21 17:54:19.466236 osafamfd [2864:siass.cc:0446] >>
> avd_gen_su_ha_state_changed_ntf: 'safSi=SG-C,safApp=SG-C' assigned to
> 'safSu=SU2,safSg=SG-C,safApp=SG-C' HA state 'ACTIVE'
>
> Since NTFS is down, notification is not sent:
> Oct 21 17:54:19.466522 osafamfd [2864:ntf.cc:0717] ER
> sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6)
>
> 6)SC-1 AMFD could not update new states of SU2 to IMM (local IMMND is down):
> Oct 21 17:54:19.469032 osafamfd [2864:imm.cc:0153] >> exec: Update
> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
> saAmfSISUHAState
> Oct 21 17:54:19.469041 osafamfd [2864:imma_oi_api.c:2417] >>
> rt_object_update_common
> Oct 21 17:54:19.469048 osafamfd [2864:imma_oi_api.c:2435] T2
> ERR_TRY_AGAIN: IMMND is DOWN
> Oct 21 17:54:19.469054 osafamfd [2864:imm.cc:0167] TR TRY-AGAIN
> Oct 21 17:54:19.469058 osafamfd [2864:imm.cc:0182] << exec
> Oct 21 17:54:19.469063 osafamfd [2864:imm.cc:0334] << execute: 2
> Oct 21 17:54:19.969631 osafamfd [2864:imm.cc:0330] >> execute
> Oct 21 17:54:19.969651 osafamfd [2864:imm.cc:0153] >> exec: Update
> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
> saAmfSISUHAState
> Oct 21 17:54:19.969691 osafamfd [2864:imma_oi_api.c:2417] >>
> rt_object_update_common
> Oct 21 17:54:19.969698 osafamfd [2864:imma_oi_api.c:2435] T2
> ERR_TRY_AGAIN: IMMND is DOWN
>
> 7)SC-1 goes down:
> Oct 21 17:54:34.706944 osafamfd [2864:ndfsm.cc:0324] >>
> avd_mds_avnd_down_evh: 2010f, 0x1a2ce20
>
>
>
> Messages from SC-2 AMFD:
> 1)Failover successful:
> Oct 21 17:54:35.091478 osafamfd [2673:role.cc:0075] >>
> avd_role_change_evh: cause=1, role=1
> Oct 21 17:54:35.091506 osafamfd [2673:role.cc:0254] >> avd_role_failover
> Oct 21 17:54:35.091537 osafamfd [2673:role.cc:0255] NO FAILOVER StandBy
> --> Active
> Oct 21 17:54:35.091588 osafamfd [2673:mds_dt_trans.c:0576] >>
> mdtm_process_poll_recv_dat
>
> ....
> Oct 21 17:54:35.111631 osafamfd [2673:role.cc:0364] NO FAILOVER StandBy
> --> Active DONE!
> Oct 21 17:54:35.111637 osafamfd [2673:role.cc:0373] << avd_role_failover
> Oct 21 17:54:35.111643 osafamfd [2673:role.cc:0153] <<
> avd_role_change_evh: 1
>
> 2)After some updates to IMM, AMF is getting TRY_AGAIN:
>
> Oct 21 17:54:35.137208 osafamfd [2673:imma_oi_api.c:2417] >>
> rt_object_update_common
> Oct 21 17:54:35.137472 osafamfd [2673:mds_dt_trans.c:0576] >>
> mdtm_process_poll_recv_data_tcp
> Oct 21 17:54:35.137514 osafamfd [2673:mds_dt_trans.c:0576] >>
> mdtm_process_poll_recv_data_tcp
> Oct 21 17:54:35.137543 osafamfd [2673:imma_proc.c:1346] TR ** Event type:12
> Oct 21 17:54:35.137550 osafamfd [2673:imma_proc.c:0752] >>
> imma_proc_rt_attr_update
> Oct 21 17:54:35.137556 osafamfd [2673:imma_proc.c:0792] TR Posting RT
> UPDATE CALLBACK to IPC mailbox
> Oct 21 17:54:35.137564 osafamfd [2673:imma_proc.c:0798] <<
> imma_proc_rt_attr_update
> Oct 21 17:54:35.137569 osafamfd [2673:imma_proc.c:1239] >>
> imma_proc_free_pointers
> Oct 21 17:54:35.137573 osafamfd [2673:imma_proc.c:1332] <<
> imma_proc_free_pointers
> Oct 21 17:54:35.137594 osafamfd [2673:imma_oi_api.c:2690] <<
> rt_object_update_common
> Oct 21 17:54:35.137601 osafamfd [2673:imm.cc:0167] TR TRY-AGAIN
> Oct 21 17:54:35.137606 osafamfd [2673:imm.cc:0182] << exec
> Oct 21 17:54:35.137611 osafamfd [2673:imm.cc:0334] << execute: 2
>
> 3)Node failover of PL-3:
> Oct 21 17:55:29.917633 osafamfd [2673:ndproc.cc:0923] >>
> avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
> Oct 21 17:55:29.922470 osafamfd [2673:ndproc.cc:0929] << avd_node_failover
>
> 4)TRY_AGAIN continues till:
> Oct 21 17:56:33.996617 osafamfd [2673:imm.cc:0167] TR TRY-AGAIN
>
> 5)Down event for directors:
> Oct 21 17:56:34.278064 osafamfd [2673:ntfa_mds.c:0369] TR NTFS down.
> Oct 21 17:56:34.278561 osafamfd [2673:clma_mds.c:0959] TR CLMS down.
> Oct 21 17:56:34.282331 osafamfd [2673:imma_mds.c:0404] T3 IMMND DOWN
>
> 6)Now AMFD sees down immnd down:
> Oct 21 17:56:34.284237 osafamfd [2673:imma_oi_api.c:2417] >>
> rt_object_update_common
> Oct 21 17:56:34.284243 osafamfd [2673:imma_oi_api.c:2435] T2
> ERR_TRY_AGAIN: IMMND is DOWN
> Oct 21 17:56:34.284248 osafamfd [2673:imm.cc:0167] TR TRY-AGA
>
> 7)Finally SC-2 goes down and again comes as active:
> Oct 21 17:57:47.017327 osafamfd [2673:ndfsm.cc:0324] >>
> avd_mds_avnd_down_evh: 2020f, 0x265ceb0
>
> Oct 21 17:58:01.000668 osafamfd [4070:role.cc:0175] >>
> avd_active_role_initialization
>
> 8)After Payload hosting safSu=SU2,safSg=SG-C,safApp=SG-C comes up this
> SU is chosen for active assignments:
> Oct 21 18:00:23.108237 osafamfd [4070:siass.cc:0148] >> avd_susi_create:
> safSu=SU2,safSg=SG-C,safApp=SG-C safSi=SG-C,safApp=SG-C state=1
>
> 9)Now it gets updated in IMM:
> Oct 21 18:00:23.113831 osafamfd [4070:imm.cc:0089] >> exec: Create
> safSi=SG-C,safApp=SG-C
> Oct 21 18:00:23.113836 osafamfd [4070:imma_oi_api.c:2757] >>
> rt_object_create_common
> Oct 21 18:00:23.113842 osafamfd [4070:imma_oi_api.c:2863] TR attr:safSISU
> Oct 21 18:00:23.113848 osafamfd [4070:imma_oi_api.c:2863] TR
> attr:saAmfSISUHAState
> Oct 21 18:00:23.113854 osafamfd [4070:imma_oi_api.c:2863] TR
> attr:saAmfSISUHAReadinessState
> Oct 21 18:00:23.114506 osafamfd [4070:mds_dt_trans.c:0576] >>
> mdtm_process_poll_recv_data_tcp
>
>
> On 22-Oct-16 1:33 AM, David Hoyt wrote:
>> Hi Praveen,
>>
>>
>>
>> First off, thank-you for all your help!!
>>
>>
>>
>> After applying the code changes you suggested, it did work but it took a
>> long time (over 10 minutes) and not without side effects.
>>
>> Attached are the amfd logs from each controller.
>>
>>
>>
>>
>>
>> Here’s a summary:
>>
>> · all 2N service group SUs are Active on Server-1, Standby on
>> Server-2
>>
>> · rebooted Server-1
>>
>> · PL-5, running with SG-C,SU1 is the first to begin its shutdown
>> at 17:51:57
>>
>> Oct 21 17:51:57 SG-C-0 opensafd: Stopping OpenSAF Services
>>
>>
>>
>> · SG-C, SU2 begins to go active at 17:52:20. The startup script
>> returns success at 17:54:19
>>
>> Oct 21 17:52:20 SG-C-1 osafamfnd[19830]: NO Assigning
>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>>
>> Oct 21 17:52:20 SG-C-1 osafdtmd[19770]: NO Lost contact with 'SG-C-0'
>>
>> Oct 21 17:54:19 SG-C-1 osafamfnd[19830]: NO Assigned
>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>>
>> Oct 21 17:54:35 SG-C-1 osafdtmd[19770]: NO Lost contact with 'sc-1'
>>
>>
>>
>> · I continued to see it with a Standby HA state for almost
>> another 2 minutes
>>
>> [root@sc-2 ~]# date; amf-state siass | grep -A 1 -i sg-c
>>
>> Fri Oct 21 17:56:16 UTC 2016
>>
>> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>>
>> saAmfSISUHAState=STANDBY(2)
>>
>> [root@sc-2 ~]#
>>
>>
>>
>> · The same is true for the 2N redundancy OpenSAF SUs. I saw
>> opensaf detect loss of mate at 17:53:22, initiate failover, and logged
>> that it was active at 17:54:35.
>>
>> Oct 21 17:53:22 sc-2 osafimmd[2594]: WA IMMD lost contact with peer IMMD
>> (NCSMDS_RED_DOWN)
>>
>> Oct 21 17:53:22 sc-2 osafimmd[2594]: WA IMMND DOWN on active controller
>> f1 detected at standby immd!! f2. Possible failover
>>
>> ...
>>
>> Oct 21 17:54:35 sc-2 osafdtmd[2553]: NO Lost contact with 'sc-1'
>>
>> Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Node Down event for node id 2010f:
>>
>> Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Current role: STANDBY
>>
>> ...
>>
>> Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Controller Failover: Setting role
>> to ACTIVE
>>
>> Oct 21 17:54:35 sc-2 osafrded[2570]: NO RDE role set to ACTIVE
>>
>> Oct 21 17:54:35 sc-2 osaflogd[2619]: NO ACTIVE request
>>
>> Oct 21 17:54:35 sc-2 osafamfd[2673]: NO FAILOVER StandBy --> Active
>>
>> Oct 21 17:54:35 sc-2 osafntfd[2634]: NO ACTIVE request
>>
>> Oct 21 17:54:35 sc-2 osafclmd[2647]: NO ACTIVE request
>>
>> Oct 21 17:54:35 sc-2 osafimmd[2594]: NO ACTIVE request
>>
>> Oct 21 17:54:35 sc-2 osafimmd[2594]: NO ellect_coord invoke from
>> rda_callback ACTIVE
>>
>> ...
>>
>> Oct 21 17:54:35 sc-2 osafamfd[2673]: NO Node 'SC-1' left the cluster
>>
>> Oct 21 17:54:35 sc-2 osafamfd[2673]: NO FAILOVER StandBy --> Active DONE!
>>
>>
>>
>> · Again, querying the HA state of opensaf did not reflect this
>> until 17:58:12
>>
>> [root@sc2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2
>> safSg=2N
>>
>> Fri Oct 21 17:56:31 UTC 2016
>>
>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>
>> saAmfSISUHAState=ACTIVE(1)
>>
>> --
>>
>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>
>> saAmfSISUHAState=STANDBY(2)
>>
>> [root@sc-2 ~]#
>>
>> …
>>
>> [root@sc-2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2
>> safSg=2N
>>
>> Fri Oct 21 17:58:12 UTC 2016
>>
>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>
>> saAmfSISUHAState=ACTIVE(1)
>>
>> [root@sc-2 ~]#
>>
>>
>>
>> · Also, as a heads up, opensaf on SC-2 was restarted at 17:55:53.
>>
>> I have a script running on both controller nodes. It constantly pings
>> the mate controller and if the ping fails, then a counter is started.
>> When the count reaches the threshold, the HA state of the opensaf SU on
>> the current node is checked. If it’s HA state is standby, then the
>> script will initiate an opensaf restart.
>>
>> Basically, it’s a backup mechanism if the detection of the loss of the
>> mate controller takes too long. I found this code was hit today at
>> 17:54:53.
>>
>> Oct 21 17:54:53 sc-2 wait_for_mate_connection: Cannot ping mate controller
>>
>> Oct 21 17:54:53 sc-2 wait_for_mate_connection: threshold_counter: '1'
>>
>> ...
>>
>> Oct 21 17:55:53 sc-2 wait_for_mate_connection: threshold_counter: '15'
>>
>> Oct 21 17:55:53 sc-2 mate_down_handler: Mate connection down threshold
>> exceeded - analyzing data...
>>
>> Oct 21 17:55:53 sc-2 mate_down_handler: Checking opensaf SU status...
>>
>> Oct 21 17:55:53 sc-2 mate_down_handler: opensaf SU is Standby, no mate
>> available
>>
>> Oct 21 17:55:54 sc-2 mate_down_handler: Restarting opensaf
>>
>> Oct 21 17:55:54 sc-2 mate_down_handler: opensaf restart has been
>> initiated...
>>
>>
>>
>> By 18:05:29, All the HA states were correct.
>>
>>
>>
>> [root@sc-2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2
>> safSg=2N
>>
>> Fri Oct 21 18:05:29 UTC 2016
>>
>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>
>> saAmfSISUHAState=STANDBY(2)
>>
>> --
>>
>> --
>>
>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>
>> saAmfSISUHAState=ACTIVE(1)
>>
>> [root@sc-2 ~]#
>>
>>
>>
>>
>>
>> [root@sc-2 ~]# date; amf-state siass | grep -A 1 SG-C
>>
>> Fri Oct 21 18:05:46 UTC 2016
>>
>> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>>
>> saAmfSISUHAState=STANDBY(2)
>>
>> --
>>
>> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>>
>> saAmfSISUHAState=ACTIVE(1)
>>
>> [root@sc-2 ~]#
>>
>>
>>
>>
>>
>> Thanks,
>>
>> /David/
>>
>>
>>
>>
>>
>> *From:*praveen malviya [mailto:praveen.malv...@oracle.com]
>> *Sent:* Friday, October 21, 2016 10:38 AM
>> *To:* David Hoyt <david.h...@genband.com <mailto:david.h...@genband.com>>
>> *Subject:* Re: [users] both SUs within a 2N Service Group appear as
> STANDBY
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> NOTICE: This email was received from an EXTERNAL sender
>>
>> ------------------------------------------------------------------------
>>
>>
>> One more related ticket need to be included:
>>
>> changeset: 7029:229ff887c7c8
>> branch: opensaf-4.6.x
>> parent: 7025:677617e23ea3
>> user: praveen.malv...@oracle.com <mailto:praveen.malv...@oracle.com>
> <mailto:praveen.malv...@oracle.com>
>> date: Thu Oct 22 09:57:18 2015 +0530
>> summary: amfd: fix deletion of runtime objects of class
>> SaAmfSIAssignment [#1540]
>>
>> Thanks,
>> Praveen
>> On 21-Oct-16 12:06 PM, praveen malviya wrote:
>>> Hi Dave,
>>>
>>> In ticket #1141, there are two patches in 4.6 with changesets:
>>>
>>> changeset: 6839:964e043fa545
>>> branch: opensaf-4.6.x
>>> parent: 6836:39806387b4c3
>>> user: praveen.malv...@oracle.com <mailto:praveen.malv...@oracle.com>
> <mailto:praveen.malv...@oracle.com>
>>> date: Thu Sep 17 10:33:32 2015 +0530
>>> summary: amfd: maintain runtime updates for su, comp, si and csi at
>>> standby [#1141]
>>>
>>> and
>>> changeset: 6896:18b4458e357a
>>> branch: opensaf-4.6.x
>>> tag: tip
>>> parent: 6886:22d7d85287e6
>>> user: praveen.malv...@oracle.com <mailto:praveen.malv...@oracle.com>
> <mailto:praveen.malv...@oracle.com>
>>> date: Mon Sep 28 11:51:34 2015 +0530
>>> summary: amfd: include comp and su oper state at standby amfd[#1141]
>>>
>>> Please share AMFD traces if you are observing after applying above two
>>> patches.
>>>
>>> Thanks,
>>> Praveen
>>> On 21-Oct-16 4:24 AM, David Hoyt wrote:
>>>> Hi Praveen,
>>>>
>>>>
>>>>
>>>> Since I’m using OpenSAF 4.6.0, I did not have the fix for #1141.
>>>>
>>>> It looks like it was included as part of the 4.6.1 maintenance release.
>>>>
>>>>
>>>>
>>>> So, I looked at the file diff for #1141, made the same code changes,
>>>> rebuilt the rpms, and replaced them in my lab.
>>>>
>>>> However, I’m still seeing the same behavior.
>>>>
>>>>
>>>>
>>>> Are there other changes I should look at adding?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Dave
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:*praveen malviya [mailto:praveen.malv...@oracle.com]
>>>> *Sent:* Wednesday, October 19, 2016 4:30 AM
>>>> *To:* David Hoyt <david.h...@genband.com
> <mailto:david.h...@genband.com%0b>> <mailto:david.h...@genband.com>>;
>>>> opensaf-users@lists.sourceforge.net
> <mailto:opensaf-users@lists.sourceforge.net>
>> <mailto:opensaf-users@lists.sourceforge.net>
>>>> *Subject:* Re: [users] both SUs within a 2N Service Group appear as
>> STANDBY
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> NOTICE: This email was received from an EXTERNAL sender
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> Analysis:
>>>> Active controller SC-1 successfully performed fail-over of SU1 of SG-C
>>>> by making SU2 active. But AMFD could not update this new HA state of SU2
>>>> in IMM as IMMND was already down by that time. AMFD at SC-1 got
>>>> TRY_AGAIN from IMM.
>>>>
>>>> Since this is OpenSAF 4.6 release, it does include fix of #1141.
>>>> Ticket #1141 was pushed in 4.6 branch after 4.6 GA. In the fix of #1141,
>>>> standby AMFD updates HA state after becoming active.
>>>>
>>>> Also please include the fix of #2009 when available (which will be on
>>>> top of #1141.)
>>>>
>>>>
>>>> Messages from AMFD trace of SC-1 which confirms this:
>>>> -------------------------------------------------
>>>> 1)Active AMFD at SC-1 starts failover of Node PL-5 at:
>>>> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >>
>>>> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
>>>>
>>>> 2)It sends active assignment to SU2 :
>>>> Oct 18 17:23:19.986349 osafamfd [1826:sgproc.cc:2114] >>
>>>> avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1
>>>>
>>>> 3)It seems local IMMND (also LGS and NTFS) went down before AMFD gets
>>>> successful response:
>>>> Oct 18 17:24:24.075904 osafamfd [1826:imma_mds.c:0404] T3 IMMND DOWN
>>>>
>>>> 4)AMFD at SC-1 gets response for this successful assignment also:
>>>> Oct 18 17:25:19.948061 osafamfd [1826:sgproc.cc:0889] >>
>>>> avd_su_si_assign_evh: id:33, node:2060f, act:5,
>>>>
>>>> 'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0
>>>> AMFD adds IMM update in job queue:
>>>> Oct 18 17:25:19.948758 osafamfd [1826:imm.cc:1506] >>
>>>> avd_saImmOiRtObjectUpdate: 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-
>>>>
>>>> C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState
>>>> Oct 18 17:25:19.948781 osafamfd [1826:imm.cc:1525] <<
>>>> avd_saImmOiRtObjectUpdate
>>>> Oct 18 17:25:19.948787 osafamfd [1826:imm.cc:1506] >>
>>>> avd_saImmOiRtObjectUpdate: 'safCSIComp=safComp=SG-C
>>>>
>>>> \,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
>>>> saAmfCSICompHAState
>>>> Oct 18 17:25:19.948808 osafamfd [1826:imm.cc:1525] <<
>>>> avd_saImmOiRtObjectUpdate
>>>>
>>>> 5)But AMFD at SC-1 could not update HA state in IMM:
>>>> Oct 18 17:25:19.951838 osafamfd [1826:imm.cc:0151] >> exec: Update
>>>> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
>>>> saAmfSISUHAState
>>>> Oct 18 17:25:19.951861 osafamfd [1826:imma_oi_api.c:2417] >>
>>>> rt_object_update_common
>>>> Oct 18 17:25:19.951866 osafamfd [1826:imma_oi_api.c:2435] T2
>>>> ERR_TRY_AGAIN: IMMND is DOWN
>>>> Oct 18 17:25:19.951874 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
>>>> Oct 18 17:25:19.951877 osafamfd [1826:imm.cc:0180] << exec
>>>> Oct 18 17:25:19.951881 osafamfd [1826:imm.cc:0316] << execute: 2
>>>> Oct 18 17:25:20.452499 osafamfd [1826:imm.cc:0312] >> execute
>>>> Oct 18 17:25:20.452519 osafamfd [1826:imm.cc:0151] >> exec: Update
>>>> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C'
>>>> saAmfSISUHAState
>>>> Oct 18 17:25:20.452526 osafamfd [1826:imma_oi_api.c:2417] >>
>>>> rt_object_update_common
>>>> Oct 18 17:25:20.452531 osafamfd [1826:imma_oi_api.c:2435] T2
>>>> ERR_TRY_AGAIN: IMMND is DOWN
>>>> Oct 18 17:25:20.452536 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN
>>>> Oct 18 17:25:20.452540 osafamfd [1826:imm.cc:0180] << exec
>>>> Oct 18 17:25:20.452545 osafamfd [1826:imm.cc:0316] << execute: 2
>>>> ----------------------------------------------------------------
>>>>
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>>
>>>> On 19-Oct-16 12:28 AM, David Hoyt wrote:
>>>>> Attached are the amfd logs from both controllers.
>>>>>
>>>>>
>>>>>
>>>>> Server-1: has 3 VMs for SC-1, PL-3 and PL-5. All the SUs within a 2N
>>>>> redundancy are Active.
>>>>>
>>>>> Server-2: has 3 VMs for SC-2, PL-4 and PL-6. All the SUs within a 2N
>>>>> redundancy are Standby
>>>>>
>>>>>
>>>>>
>>>>> SC-1: opensaf SU1 and SG-A, SU1
>>>>>
>>>>> SC-2: opensaf SU2 and SG-A, SU2
>>>>>
>>>>> PL-3: SG-B, SU1
>>>>>
>>>>> PL-4: SG-B, SU2
>>>>>
>>>>> PL-5: SG-C, SU1
>>>>>
>>>>> PL-6: SG-C, SU2
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Initial conditions: All the SUs within a 2N redundancy are Active on
>>>>> Server-1.
>>>>>
>>>>>
>>>>>
>>>>> I issued a reboot of Server-1 around Oct 18 17:23. From the amfd
> logs of
>>>>> SC-1, it appears PL-5 began the failover around 17:23:19.
>>>>>
>>>>> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >>
>>>>> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I waited to see if SG-C’s HA state would correct itself but it did not:
>>>>>
>>>>>
>>>>>
>>>>> [root@sc-2 ~]# date ; amf-state siass | grep -A 1 SG-C
>>>>>
>>>>> Tue Oct 18 17:36:26 UTC 2016
>>>>>
>>>>> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>>>>>
>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>
>>>>> --
>>>>>
>>>>> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C
>>>>>
>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>
>>>>> [root@sc-2 ~]#
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Even though the controller indicates that both SUs are standby, the
> logs
>>>>> from the original standby SG-C, SU2 shows it going active:
>>>>>
>>>>>
>>>>>
>>>>> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: NO Assigning
>>>>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>>>>>
>>>>> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: IN Assigning 'all CSIs' ACTIVE
>>>>> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'
>>>>>
>>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>>
>>>>
>>
> 'safCSIComp=safComp=AMFWDOG\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=AMFWDOG,safSi=NoRed2,safApp=OpenSAF'
>>>>> by Impl-id: 253
>>>>>
>>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>>
>>>>
>>
> 'safCSIComp=safComp=IMMND\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=IMMND,safSi=NoRed2,safApp=OpenSAF'
>>>>> by Impl-id: 253
>>>>>
>>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>>
>>>>
>>
> 'safCSIComp=safComp=CLMNA\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=CLMNA,safSi=NoRed2,safApp=OpenSAF'
>>>>> by Impl-id: 253
>>>>>
>>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>>
>>>>
>>
> 'safSISU=safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF'
>>>>> by Impl-id: 253
>>>>>
>>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>>
>>>>
>>
> 'safCSIComp=safComp=SG-C\#safSu=SU1\#safSg=SG-C\#safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C'
>>>>> by Impl-id: 253
>>>>>
>>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object
>>>>> 'safSISU=safSu=SU1\#safSg=SG-C\#safApp=SG-C,safSi=SG-C,safApp=SG-C' by
>>>>> Impl-id: 253
>>>>>
>>>>> Oct 18 17:23:20 SG-C-1 osafdtmd[26332]: NO Lost contact with 'SG-C-0'
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
>>>>> message:51732
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
>>>>> message type 57 - ignoring
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS
>>>>> message:51733
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for
>>>>> message type 57 - ignoring
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Global discard node
> received
>>>>> for nodeId:2010f pid:1739
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer
> disconnected 253
>>>>> <0, 2010f(down)> (safAmfService)
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer
> disconnected 251
>>>>> <0, 2010f(down)> (safClmService)
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer
> disconnected 250
>>>>> <0, 2010f(down)> (safLogService)
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer
> disconnected 248
>>>>> <0, 2010f(down)> (@OpenSafImmPBE)
>>>>>
>>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer
> disconnected 249
>>>>> <0, 2010f(down)> (OsafImmPbeRt_B)
>>>>>
>>>>> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: IN Assigned 'all CSIs' ACTIVE
>>>>> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C'
>>>>>
>>>>> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: NO Assigned
>>>>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C'
>>>>>
>>>>> Oct 18 17:25:37 SG-C-1 osafdtmd[26332]: NO Lost contact with 'sc-1'
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dave
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *From:*praveen malviya [mailto:praveen.malv...@oracle.com]
>>>>> *Sent:* Tuesday, October 18, 2016 8:30 AM
>>>>> *To:* David Hoyt <david.h...@genband.com
> <mailto:david.h...@genband.com%0b>> <mailto:david.h...@genband.com
> <mailto:david.h...@genband.com%0b>>
> <mailto:david.h...@genband.com%20%3cmailto:david.h...@genband.com>>>;
>>>>> opensaf-users@lists.sourceforge.net
> <mailto:opensaf-users@lists.sourceforge.net>
>> <mailto:opensaf-users@lists.sourceforge.net>
>>>> <mailto:opensaf-users@lists.sourceforge.net>
>>>>> *Subject:* Re: [users] both SUs within a 2N Service Group appear as
>>>> STANDBY
>>>>>
>>>>>
>>>>>
>>>>>
> ------------------------------------------------------------------------
>>>>>
>>>>> NOTICE: This email was received from an EXTERNAL sender
>>>>>
>>>>>
> ------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> Please see response inline with [Praveen].
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>>>>> On 17-Oct-16 11:46 PM, David Hoyt wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I'm encountering a scenario where opensaf shows the HA state of both
>>>>> SUs within a 2N redundancy Service Group as standby.
>>>>>> Setup:
>>>>>>
>>>>>> - Opensaf 4.6 running on RHEL 6.6 VMs with TCP
>>>>>>
>>>>>> - 2 controllers, 4 payloads
>>>>>>
>>>>>> - SC-1 & SC-2 are the VMs with the controller nodes (SC-1 is active)
>>>>>>
>>>>>> - PL-3 & PL4 have SU1 & SU2 from SG-A (2N redundancy)
>>>>>>
>>>>>> - PL-5 & PL-6 have SU1 & SU2 from SG-B (2N redundancy)
>>>>>>
>>>>>> - Server-1 has three VMs consisting of SC-1, PL-3 and PL-5
>>>>>>
>>>>>> - Likewise, server-2 has SC-2, PL-4 and PL-6
>>>>>>
>>>>>> I reboot server 1 and shortly afterwards, the SG-A SUs begin to
>>>>> failover. SU2 on PL-4 goes active.
>>>>>> Around the same time, the opensaf 2N SUs failover.
>>>>>> After the dust has settled, and server-1 comes back as well as the
>>>>> VMs, all appears fine except the SG-A SUs. They both have a standby HA
>>>>> state.
>>>>>>
>>>>>> Is there any way to correct this?
>>>>> [Praveen] I think there is no issue from the callback perspectives as
>>>>> SU2 on PL-4 was made active(comps received callbacks). Only problem is
>>>>> outout of "amf-state siass".
>>>>> Please share AMFD traces from both the controller.
>>>>>
>>>>>
>>>>>
>>>>>> Is there some audit that periodically checks the validity of the HA
>>>>> states?
>>>>>>
>>>>>> Now, when SG-A, SU1 recovers, I did swact the SUs and it corrected the
>>>>> HA state. However, if server-1 goes down for an extended period, the HA
>>>>> state of SG-A, SU2 will appear as Standby, when it's actually
> running as
>>>>> active.
>>>>>>
>>>>>>
>>>>>> Before the reboot:
>>>>>>
>>>>>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1
> safSg=2N
>>>>>>
> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>>> --
>>>>>>
> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>> [root@jenga-56-sysvm-1 ~]#
>>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
>>>>>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>> --
>>>>>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>>> [root@sc-2 ~]#
>>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
>>>>>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>> --
>>>>>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>>> [root@sc-2 ~]#
>>>>>>
>>>>>>
>>>>>>
>>>>>> After the reboot:
>>>>>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1
> safSg=2N
>>>>>>
> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>> --
>>>>>>
> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>>> [root@sc-2 ~]#
>>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A
>>>>>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A
>>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>> --
>>>>>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=DVN,safSi=SG-A,safApp=SG-A
>>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>> [root@sc-2 ~]#
>>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B
>>>>>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>>>>> saAmfSISUHAState=ACTIVE(1)
>>>>>> --
>>>>>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B
>>>>>> saAmfSISUHAState=STANDBY(2)
>>>>>> [root@sc-2 ~]#
>>>>>>
>>>>>>
>>>>>
>>>>
>>
> ------------------------------------------------------------------------------
>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>>> _______________________________________________
>>>>>> Opensaf-users mailing list
>>>>>> Opensaf-users@lists.sourceforge.net
> <mailto:Opensaf-users@lists.sourceforge.net>
>> <mailto:Opensaf-users@lists.sourceforge.net>
>>>> <mailto:Opensaf-users@lists.sourceforge.net>
>>>>> <mailto:Opensaf-users@lists.sourceforge.net>
>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Opensaf-users mailing list
>>> Opensaf-users@lists.sourceforge.net
> <mailto:Opensaf-users@lists.sourceforge.net>
>> <mailto:Opensaf-users@lists.sourceforge.net>
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>
>>
>

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to