Hi Dave, Please share latest traces.
Thanks, Praveen On 25-Oct-16 9:09 PM, David Hoyt wrote: > Hi Praveen, > > > > Sorry, but the changes you suggested don’t seem to make a difference. > > > > I’m now looking at coming up with a manual procedure so that if we > perform maintenance on a server, the VMs running on that server should > be terminated first. That way, I’ll be able to control the VM > termination order and hopefully avoid all these failovers that happen at > the same time. > > > > We were looking at doing this controlled procedure anyways but I’m still > concerned about the uncontrolled scenario. It doesn’t appear to me that > opensaf can handle this and then afterwards, it requires manual recovery. L > > > > Regards, > > David > > > > > > *From:*praveen malviya [mailto:praveen.malv...@oracle.com] > *Sent:* Monday, October 24, 2016 2:56 AM > *To:* David Hoyt <david.h...@genband.com> > *Cc:* opensaf-users@lists.sourceforge.net > *Subject:* Re: [users] both SUs within a 2N Service Group appear as STANDBY > > > > ------------------------------------------------------------------------ > > NOTICE: This email was received from an EXTERNAL sender > > ------------------------------------------------------------------------ > > > > Hi Dave, > > SC-2 became active in no time at "Oct 21 17:54:35.111631". But somehow > IMM got busy with something and continuously gave TRY_AGAIN untill SC-2 > also went down at "Oct 21 17:54:34". When SC-2 came up and also the > payload hosting SU2 joined, SC-2 AMFD reassigned active state and it > updated it in IMM at: > > Oct 21 18:00:23.113831 osafamfd [4070:imm.cc:0089] >> exec: Create > safSi=SG-C,safApp=SG-C > Oct 21 18:00:23.113836 osafamfd [4070:imma_oi_api.c:2757] >> > rt_object_create_common > Oct 21 18:00:23.113842 osafamfd [4070:imma_oi_api.c:2863] TR attr:safSISU > Oct 21 18:00:23.113848 osafamfd [4070:imma_oi_api.c:2863] TR > attr:saAmfSISUHAState > Oct 21 18:00:23.113854 osafamfd [4070:imma_oi_api.c:2863] TR > attr:saAmfSISUHAReadinessState > > But since SC-2 also went down the case is not longer remains same. > Please figure it out why IMMND got stuck. > But still I see a smaller patch is needed from AMF perspective. I will > try to consider this case in #2009 (already published). Attached is > 1.patch that is extracted out of 2009.patch. Please try with 1.patch on > top of #1540 and #1141. > > Thanks, > Praveen > > > Messages from SC-1 AMFD: > 1)Node failove of PL-5 started: > Oct 21 17:52:20.232746 osafamfd [2864:ndproc.cc:0923] >> > avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster' > > 2)AMFD starts failover of 'safSu=SU1,safSg=SG-C,safApp=SG-C' hosted on PL-5. > Oct 21 17:52:20.275267 osafamfd [2864:sg_2n_fsm.cc:3262] >> node_fail: > 'safSu=SU1,safSg=SG-C,safApp=SG-C', 0 > > and sends active to its peer: > Oct 21 17:52:20.275511 osafamfd [2864:sgproc.cc:2114] >> > avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1 > > 3)This time SC-1 AMFD itself was able to update in IMM: > Oct 21 17:52:20.363304 osafamfd [2864:imm.cc:0199] >> exec: Delete > safCSIComp=safComp=SG-C\,safSu=SU1\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C > Oct 21 17:52:20.364569 osafamfd [2864:mds_dt_trans.c:0576] >> > mdtm_process_poll_recv_data_tcp > Oct 21 17:52:20.364640 osafamfd [2864:imm.cc:0222] << exec > Oct 21 17:52:20.364651 osafamfd [2864:imm.cc:0334] << execute: 1 > Oct 21 17:52:20.364657 osafamfd [2864:imm.cc:0330] >> execute > Oct 21 17:52:20.364662 osafamfd [2864:imm.cc:0199] >> exec: Delete > safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C > Oct 21 17:52:20.366127 osafamfd [2864:mds_dt_trans.c:0576] >> > mdtm_process_poll_recv_data_tcp > Oct 21 17:52:20.366201 osafamfd [2864:imm.cc:0222] << exec > Oct 21 17:52:20.366210 osafamfd [2864:imm.cc:0334] << execute: 1 > > This means, a user will not see atleast above two runtime object using > the query amf-state csiass and amf-state siass respetively. > > 4) Down events of Director on SC-1 started coming: > Oct 21 17:53:22.303749 osafamfd [2864:ntfa_mds.c:0369] TR NTFS down. > Oct 21 17:53:22.307098 osafamfd [2864:lga_mds.c:0491] TR LGS down > Oct 21 17:53:22.311434 osafamfd [2864:clma_mds.c:0959] TR CLMS down > Local IMMND down: > Oct 21 17:53:22.324784 osafamfd [2864:imma_mds.c:0404] T3 IMMND DOWN > > 5)SC-1 AMFD get response for safSu=SU2,safSg=SG-C,safApp=SG-C for being > active: > Oct 21 17:54:19.465629 osafamfd [2864:sgproc.cc:0889] >> > avd_su_si_assign_evh: id:28, node:2060f, act:5, > 'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0 > AMFD generated events for IMM and notification for SU2 being active: > Oct 21 17:54:19.466208 osafamfd [2864:imm.cc:1560] >> > avd_saImmOiRtObjectUpdate: > 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' > saAmfSISUHAState > Oct 21 17:54:19.466217 osafamfd [2864:imm.cc:1580] << > avd_saImmOiRtObjectUpdate > Oct 21 17:54:19.466223 osafamfd [2864:imm.cc:1560] >> > avd_saImmOiRtObjectUpdate: > 'safCSIComp=safComp=SG-C\,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C' > > saAmfCSICompHAState > Oct 21 17:54:19.466231 osafamfd [2864:imm.cc:1580] << > avd_saImmOiRtObjectUpdate > Oct 21 17:54:19.466236 osafamfd [2864:siass.cc:0446] >> > avd_gen_su_ha_state_changed_ntf: 'safSi=SG-C,safApp=SG-C' assigned to > 'safSu=SU2,safSg=SG-C,safApp=SG-C' HA state 'ACTIVE' > > Since NTFS is down, notification is not sent: > Oct 21 17:54:19.466522 osafamfd [2864:ntf.cc:0717] ER > sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) > > 6)SC-1 AMFD could not update new states of SU2 to IMM (local IMMND is down): > Oct 21 17:54:19.469032 osafamfd [2864:imm.cc:0153] >> exec: Update > 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' > saAmfSISUHAState > Oct 21 17:54:19.469041 osafamfd [2864:imma_oi_api.c:2417] >> > rt_object_update_common > Oct 21 17:54:19.469048 osafamfd [2864:imma_oi_api.c:2435] T2 > ERR_TRY_AGAIN: IMMND is DOWN > Oct 21 17:54:19.469054 osafamfd [2864:imm.cc:0167] TR TRY-AGAIN > Oct 21 17:54:19.469058 osafamfd [2864:imm.cc:0182] << exec > Oct 21 17:54:19.469063 osafamfd [2864:imm.cc:0334] << execute: 2 > Oct 21 17:54:19.969631 osafamfd [2864:imm.cc:0330] >> execute > Oct 21 17:54:19.969651 osafamfd [2864:imm.cc:0153] >> exec: Update > 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' > saAmfSISUHAState > Oct 21 17:54:19.969691 osafamfd [2864:imma_oi_api.c:2417] >> > rt_object_update_common > Oct 21 17:54:19.969698 osafamfd [2864:imma_oi_api.c:2435] T2 > ERR_TRY_AGAIN: IMMND is DOWN > > 7)SC-1 goes down: > Oct 21 17:54:34.706944 osafamfd [2864:ndfsm.cc:0324] >> > avd_mds_avnd_down_evh: 2010f, 0x1a2ce20 > > > > Messages from SC-2 AMFD: > 1)Failover successful: > Oct 21 17:54:35.091478 osafamfd [2673:role.cc:0075] >> > avd_role_change_evh: cause=1, role=1 > Oct 21 17:54:35.091506 osafamfd [2673:role.cc:0254] >> avd_role_failover > Oct 21 17:54:35.091537 osafamfd [2673:role.cc:0255] NO FAILOVER StandBy > --> Active > Oct 21 17:54:35.091588 osafamfd [2673:mds_dt_trans.c:0576] >> > mdtm_process_poll_recv_dat > > .... > Oct 21 17:54:35.111631 osafamfd [2673:role.cc:0364] NO FAILOVER StandBy > --> Active DONE! > Oct 21 17:54:35.111637 osafamfd [2673:role.cc:0373] << avd_role_failover > Oct 21 17:54:35.111643 osafamfd [2673:role.cc:0153] << > avd_role_change_evh: 1 > > 2)After some updates to IMM, AMF is getting TRY_AGAIN: > > Oct 21 17:54:35.137208 osafamfd [2673:imma_oi_api.c:2417] >> > rt_object_update_common > Oct 21 17:54:35.137472 osafamfd [2673:mds_dt_trans.c:0576] >> > mdtm_process_poll_recv_data_tcp > Oct 21 17:54:35.137514 osafamfd [2673:mds_dt_trans.c:0576] >> > mdtm_process_poll_recv_data_tcp > Oct 21 17:54:35.137543 osafamfd [2673:imma_proc.c:1346] TR ** Event type:12 > Oct 21 17:54:35.137550 osafamfd [2673:imma_proc.c:0752] >> > imma_proc_rt_attr_update > Oct 21 17:54:35.137556 osafamfd [2673:imma_proc.c:0792] TR Posting RT > UPDATE CALLBACK to IPC mailbox > Oct 21 17:54:35.137564 osafamfd [2673:imma_proc.c:0798] << > imma_proc_rt_attr_update > Oct 21 17:54:35.137569 osafamfd [2673:imma_proc.c:1239] >> > imma_proc_free_pointers > Oct 21 17:54:35.137573 osafamfd [2673:imma_proc.c:1332] << > imma_proc_free_pointers > Oct 21 17:54:35.137594 osafamfd [2673:imma_oi_api.c:2690] << > rt_object_update_common > Oct 21 17:54:35.137601 osafamfd [2673:imm.cc:0167] TR TRY-AGAIN > Oct 21 17:54:35.137606 osafamfd [2673:imm.cc:0182] << exec > Oct 21 17:54:35.137611 osafamfd [2673:imm.cc:0334] << execute: 2 > > 3)Node failover of PL-3: > Oct 21 17:55:29.917633 osafamfd [2673:ndproc.cc:0923] >> > avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster' > Oct 21 17:55:29.922470 osafamfd [2673:ndproc.cc:0929] << avd_node_failover > > 4)TRY_AGAIN continues till: > Oct 21 17:56:33.996617 osafamfd [2673:imm.cc:0167] TR TRY-AGAIN > > 5)Down event for directors: > Oct 21 17:56:34.278064 osafamfd [2673:ntfa_mds.c:0369] TR NTFS down. > Oct 21 17:56:34.278561 osafamfd [2673:clma_mds.c:0959] TR CLMS down. > Oct 21 17:56:34.282331 osafamfd [2673:imma_mds.c:0404] T3 IMMND DOWN > > 6)Now AMFD sees down immnd down: > Oct 21 17:56:34.284237 osafamfd [2673:imma_oi_api.c:2417] >> > rt_object_update_common > Oct 21 17:56:34.284243 osafamfd [2673:imma_oi_api.c:2435] T2 > ERR_TRY_AGAIN: IMMND is DOWN > Oct 21 17:56:34.284248 osafamfd [2673:imm.cc:0167] TR TRY-AGA > > 7)Finally SC-2 goes down and again comes as active: > Oct 21 17:57:47.017327 osafamfd [2673:ndfsm.cc:0324] >> > avd_mds_avnd_down_evh: 2020f, 0x265ceb0 > > Oct 21 17:58:01.000668 osafamfd [4070:role.cc:0175] >> > avd_active_role_initialization > > 8)After Payload hosting safSu=SU2,safSg=SG-C,safApp=SG-C comes up this > SU is chosen for active assignments: > Oct 21 18:00:23.108237 osafamfd [4070:siass.cc:0148] >> avd_susi_create: > safSu=SU2,safSg=SG-C,safApp=SG-C safSi=SG-C,safApp=SG-C state=1 > > 9)Now it gets updated in IMM: > Oct 21 18:00:23.113831 osafamfd [4070:imm.cc:0089] >> exec: Create > safSi=SG-C,safApp=SG-C > Oct 21 18:00:23.113836 osafamfd [4070:imma_oi_api.c:2757] >> > rt_object_create_common > Oct 21 18:00:23.113842 osafamfd [4070:imma_oi_api.c:2863] TR attr:safSISU > Oct 21 18:00:23.113848 osafamfd [4070:imma_oi_api.c:2863] TR > attr:saAmfSISUHAState > Oct 21 18:00:23.113854 osafamfd [4070:imma_oi_api.c:2863] TR > attr:saAmfSISUHAReadinessState > Oct 21 18:00:23.114506 osafamfd [4070:mds_dt_trans.c:0576] >> > mdtm_process_poll_recv_data_tcp > > > On 22-Oct-16 1:33 AM, David Hoyt wrote: >> Hi Praveen, >> >> >> >> First off, thank-you for all your help!! >> >> >> >> After applying the code changes you suggested, it did work but it took a >> long time (over 10 minutes) and not without side effects. >> >> Attached are the amfd logs from each controller. >> >> >> >> >> >> Here’s a summary: >> >> · all 2N service group SUs are Active on Server-1, Standby on >> Server-2 >> >> · rebooted Server-1 >> >> · PL-5, running with SG-C,SU1 is the first to begin its shutdown >> at 17:51:57 >> >> Oct 21 17:51:57 SG-C-0 opensafd: Stopping OpenSAF Services >> >> >> >> · SG-C, SU2 begins to go active at 17:52:20. The startup script >> returns success at 17:54:19 >> >> Oct 21 17:52:20 SG-C-1 osafamfnd[19830]: NO Assigning >> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C' >> >> Oct 21 17:52:20 SG-C-1 osafdtmd[19770]: NO Lost contact with 'SG-C-0' >> >> Oct 21 17:54:19 SG-C-1 osafamfnd[19830]: NO Assigned >> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C' >> >> Oct 21 17:54:35 SG-C-1 osafdtmd[19770]: NO Lost contact with 'sc-1' >> >> >> >> · I continued to see it with a Standby HA state for almost >> another 2 minutes >> >> [root@sc-2 ~]# date; amf-state siass | grep -A 1 -i sg-c >> >> Fri Oct 21 17:56:16 UTC 2016 >> >> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C >> >> saAmfSISUHAState=STANDBY(2) >> >> [root@sc-2 ~]# >> >> >> >> · The same is true for the 2N redundancy OpenSAF SUs. I saw >> opensaf detect loss of mate at 17:53:22, initiate failover, and logged >> that it was active at 17:54:35. >> >> Oct 21 17:53:22 sc-2 osafimmd[2594]: WA IMMD lost contact with peer IMMD >> (NCSMDS_RED_DOWN) >> >> Oct 21 17:53:22 sc-2 osafimmd[2594]: WA IMMND DOWN on active controller >> f1 detected at standby immd!! f2. Possible failover >> >> ... >> >> Oct 21 17:54:35 sc-2 osafdtmd[2553]: NO Lost contact with 'sc-1' >> >> Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Node Down event for node id 2010f: >> >> Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Current role: STANDBY >> >> ... >> >> Oct 21 17:54:35 sc-2 osaffmd[2582]: NO Controller Failover: Setting role >> to ACTIVE >> >> Oct 21 17:54:35 sc-2 osafrded[2570]: NO RDE role set to ACTIVE >> >> Oct 21 17:54:35 sc-2 osaflogd[2619]: NO ACTIVE request >> >> Oct 21 17:54:35 sc-2 osafamfd[2673]: NO FAILOVER StandBy --> Active >> >> Oct 21 17:54:35 sc-2 osafntfd[2634]: NO ACTIVE request >> >> Oct 21 17:54:35 sc-2 osafclmd[2647]: NO ACTIVE request >> >> Oct 21 17:54:35 sc-2 osafimmd[2594]: NO ACTIVE request >> >> Oct 21 17:54:35 sc-2 osafimmd[2594]: NO ellect_coord invoke from >> rda_callback ACTIVE >> >> ... >> >> Oct 21 17:54:35 sc-2 osafamfd[2673]: NO Node 'SC-1' left the cluster >> >> Oct 21 17:54:35 sc-2 osafamfd[2673]: NO FAILOVER StandBy --> Active DONE! >> >> >> >> · Again, querying the HA state of opensaf did not reflect this >> until 17:58:12 >> >> [root@sc2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2 >> safSg=2N >> >> Fri Oct 21 17:56:31 UTC 2016 >> >> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> >> saAmfSISUHAState=ACTIVE(1) >> >> -- >> >> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> >> saAmfSISUHAState=STANDBY(2) >> >> [root@sc-2 ~]# >> >> … >> >> [root@sc-2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2 >> safSg=2N >> >> Fri Oct 21 17:58:12 UTC 2016 >> >> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> >> saAmfSISUHAState=ACTIVE(1) >> >> [root@sc-2 ~]# >> >> >> >> · Also, as a heads up, opensaf on SC-2 was restarted at 17:55:53. >> >> I have a script running on both controller nodes. It constantly pings >> the mate controller and if the ping fails, then a counter is started. >> When the count reaches the threshold, the HA state of the opensaf SU on >> the current node is checked. If it’s HA state is standby, then the >> script will initiate an opensaf restart. >> >> Basically, it’s a backup mechanism if the detection of the loss of the >> mate controller takes too long. I found this code was hit today at >> 17:54:53. >> >> Oct 21 17:54:53 sc-2 wait_for_mate_connection: Cannot ping mate controller >> >> Oct 21 17:54:53 sc-2 wait_for_mate_connection: threshold_counter: '1' >> >> ... >> >> Oct 21 17:55:53 sc-2 wait_for_mate_connection: threshold_counter: '15' >> >> Oct 21 17:55:53 sc-2 mate_down_handler: Mate connection down threshold >> exceeded - analyzing data... >> >> Oct 21 17:55:53 sc-2 mate_down_handler: Checking opensaf SU status... >> >> Oct 21 17:55:53 sc-2 mate_down_handler: opensaf SU is Standby, no mate >> available >> >> Oct 21 17:55:54 sc-2 mate_down_handler: Restarting opensaf >> >> Oct 21 17:55:54 sc-2 mate_down_handler: opensaf restart has been >> initiated... >> >> >> >> By 18:05:29, All the HA states were correct. >> >> >> >> [root@sc-2 ~]# date; amf-state siass | grep -A 1 -i opensaf | grep -A 2 >> safSg=2N >> >> Fri Oct 21 18:05:29 UTC 2016 >> >> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> >> saAmfSISUHAState=STANDBY(2) >> >> -- >> >> -- >> >> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >> >> saAmfSISUHAState=ACTIVE(1) >> >> [root@sc-2 ~]# >> >> >> >> >> >> [root@sc-2 ~]# date; amf-state siass | grep -A 1 SG-C >> >> Fri Oct 21 18:05:46 UTC 2016 >> >> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C >> >> saAmfSISUHAState=STANDBY(2) >> >> -- >> >> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C >> >> saAmfSISUHAState=ACTIVE(1) >> >> [root@sc-2 ~]# >> >> >> >> >> >> Thanks, >> >> /David/ >> >> >> >> >> >> *From:*praveen malviya [mailto:praveen.malv...@oracle.com] >> *Sent:* Friday, October 21, 2016 10:38 AM >> *To:* David Hoyt <david.h...@genband.com <mailto:david.h...@genband.com>> >> *Subject:* Re: [users] both SUs within a 2N Service Group appear as > STANDBY >> >> >> >> ------------------------------------------------------------------------ >> >> NOTICE: This email was received from an EXTERNAL sender >> >> ------------------------------------------------------------------------ >> >> >> One more related ticket need to be included: >> >> changeset: 7029:229ff887c7c8 >> branch: opensaf-4.6.x >> parent: 7025:677617e23ea3 >> user: praveen.malv...@oracle.com <mailto:praveen.malv...@oracle.com> > <mailto:praveen.malv...@oracle.com> >> date: Thu Oct 22 09:57:18 2015 +0530 >> summary: amfd: fix deletion of runtime objects of class >> SaAmfSIAssignment [#1540] >> >> Thanks, >> Praveen >> On 21-Oct-16 12:06 PM, praveen malviya wrote: >>> Hi Dave, >>> >>> In ticket #1141, there are two patches in 4.6 with changesets: >>> >>> changeset: 6839:964e043fa545 >>> branch: opensaf-4.6.x >>> parent: 6836:39806387b4c3 >>> user: praveen.malv...@oracle.com <mailto:praveen.malv...@oracle.com> > <mailto:praveen.malv...@oracle.com> >>> date: Thu Sep 17 10:33:32 2015 +0530 >>> summary: amfd: maintain runtime updates for su, comp, si and csi at >>> standby [#1141] >>> >>> and >>> changeset: 6896:18b4458e357a >>> branch: opensaf-4.6.x >>> tag: tip >>> parent: 6886:22d7d85287e6 >>> user: praveen.malv...@oracle.com <mailto:praveen.malv...@oracle.com> > <mailto:praveen.malv...@oracle.com> >>> date: Mon Sep 28 11:51:34 2015 +0530 >>> summary: amfd: include comp and su oper state at standby amfd[#1141] >>> >>> Please share AMFD traces if you are observing after applying above two >>> patches. >>> >>> Thanks, >>> Praveen >>> On 21-Oct-16 4:24 AM, David Hoyt wrote: >>>> Hi Praveen, >>>> >>>> >>>> >>>> Since I’m using OpenSAF 4.6.0, I did not have the fix for #1141. >>>> >>>> It looks like it was included as part of the 4.6.1 maintenance release. >>>> >>>> >>>> >>>> So, I looked at the file diff for #1141, made the same code changes, >>>> rebuilt the rpms, and replaced them in my lab. >>>> >>>> However, I’m still seeing the same behavior. >>>> >>>> >>>> >>>> Are there other changes I should look at adding? >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Dave >>>> >>>> >>>> >>>> >>>> >>>> *From:*praveen malviya [mailto:praveen.malv...@oracle.com] >>>> *Sent:* Wednesday, October 19, 2016 4:30 AM >>>> *To:* David Hoyt <david.h...@genband.com > <mailto:david.h...@genband.com%0b>> <mailto:david.h...@genband.com>>; >>>> opensaf-users@lists.sourceforge.net > <mailto:opensaf-users@lists.sourceforge.net> >> <mailto:opensaf-users@lists.sourceforge.net> >>>> *Subject:* Re: [users] both SUs within a 2N Service Group appear as >> STANDBY >>>> >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> NOTICE: This email was received from an EXTERNAL sender >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> >>>> Analysis: >>>> Active controller SC-1 successfully performed fail-over of SU1 of SG-C >>>> by making SU2 active. But AMFD could not update this new HA state of SU2 >>>> in IMM as IMMND was already down by that time. AMFD at SC-1 got >>>> TRY_AGAIN from IMM. >>>> >>>> Since this is OpenSAF 4.6 release, it does include fix of #1141. >>>> Ticket #1141 was pushed in 4.6 branch after 4.6 GA. In the fix of #1141, >>>> standby AMFD updates HA state after becoming active. >>>> >>>> Also please include the fix of #2009 when available (which will be on >>>> top of #1141.) >>>> >>>> >>>> Messages from AMFD trace of SC-1 which confirms this: >>>> ------------------------------------------------- >>>> 1)Active AMFD at SC-1 starts failover of Node PL-5 at: >>>> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >> >>>> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster' >>>> >>>> 2)It sends active assignment to SU2 : >>>> Oct 18 17:23:19.986349 osafamfd [1826:sgproc.cc:2114] >> >>>> avd_sg_su_si_mod_snd: 'safSu=SU2,safSg=SG-C,safApp=SG-C', state 1 >>>> >>>> 3)It seems local IMMND (also LGS and NTFS) went down before AMFD gets >>>> successful response: >>>> Oct 18 17:24:24.075904 osafamfd [1826:imma_mds.c:0404] T3 IMMND DOWN >>>> >>>> 4)AMFD at SC-1 gets response for this successful assignment also: >>>> Oct 18 17:25:19.948061 osafamfd [1826:sgproc.cc:0889] >> >>>> avd_su_si_assign_evh: id:33, node:2060f, act:5, >>>> >>>> 'safSu=SU2,safSg=SG-C,safApp=SG-C', '', ha:1, err:1, single:0 >>>> AMFD adds IMM update in job queue: >>>> Oct 18 17:25:19.948758 osafamfd [1826:imm.cc:1506] >> >>>> avd_saImmOiRtObjectUpdate: 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG- >>>> >>>> C,safSi=SG-C,safApp=SG-C' saAmfSISUHAState >>>> Oct 18 17:25:19.948781 osafamfd [1826:imm.cc:1525] << >>>> avd_saImmOiRtObjectUpdate >>>> Oct 18 17:25:19.948787 osafamfd [1826:imm.cc:1506] >> >>>> avd_saImmOiRtObjectUpdate: 'safCSIComp=safComp=SG-C >>>> >>>> \,safSu=SU2\,safSg=SG-C\,safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C' >>>> saAmfCSICompHAState >>>> Oct 18 17:25:19.948808 osafamfd [1826:imm.cc:1525] << >>>> avd_saImmOiRtObjectUpdate >>>> >>>> 5)But AMFD at SC-1 could not update HA state in IMM: >>>> Oct 18 17:25:19.951838 osafamfd [1826:imm.cc:0151] >> exec: Update >>>> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' >>>> saAmfSISUHAState >>>> Oct 18 17:25:19.951861 osafamfd [1826:imma_oi_api.c:2417] >> >>>> rt_object_update_common >>>> Oct 18 17:25:19.951866 osafamfd [1826:imma_oi_api.c:2435] T2 >>>> ERR_TRY_AGAIN: IMMND is DOWN >>>> Oct 18 17:25:19.951874 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN >>>> Oct 18 17:25:19.951877 osafamfd [1826:imm.cc:0180] << exec >>>> Oct 18 17:25:19.951881 osafamfd [1826:imm.cc:0316] << execute: 2 >>>> Oct 18 17:25:20.452499 osafamfd [1826:imm.cc:0312] >> execute >>>> Oct 18 17:25:20.452519 osafamfd [1826:imm.cc:0151] >> exec: Update >>>> 'safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C' >>>> saAmfSISUHAState >>>> Oct 18 17:25:20.452526 osafamfd [1826:imma_oi_api.c:2417] >> >>>> rt_object_update_common >>>> Oct 18 17:25:20.452531 osafamfd [1826:imma_oi_api.c:2435] T2 >>>> ERR_TRY_AGAIN: IMMND is DOWN >>>> Oct 18 17:25:20.452536 osafamfd [1826:imm.cc:0165] TR TRY-AGAIN >>>> Oct 18 17:25:20.452540 osafamfd [1826:imm.cc:0180] << exec >>>> Oct 18 17:25:20.452545 osafamfd [1826:imm.cc:0316] << execute: 2 >>>> ---------------------------------------------------------------- >>>> >>>> >>>> Thanks, >>>> Praveen >>>> >>>> >>>> On 19-Oct-16 12:28 AM, David Hoyt wrote: >>>>> Attached are the amfd logs from both controllers. >>>>> >>>>> >>>>> >>>>> Server-1: has 3 VMs for SC-1, PL-3 and PL-5. All the SUs within a 2N >>>>> redundancy are Active. >>>>> >>>>> Server-2: has 3 VMs for SC-2, PL-4 and PL-6. All the SUs within a 2N >>>>> redundancy are Standby >>>>> >>>>> >>>>> >>>>> SC-1: opensaf SU1 and SG-A, SU1 >>>>> >>>>> SC-2: opensaf SU2 and SG-A, SU2 >>>>> >>>>> PL-3: SG-B, SU1 >>>>> >>>>> PL-4: SG-B, SU2 >>>>> >>>>> PL-5: SG-C, SU1 >>>>> >>>>> PL-6: SG-C, SU2 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Initial conditions: All the SUs within a 2N redundancy are Active on >>>>> Server-1. >>>>> >>>>> >>>>> >>>>> I issued a reboot of Server-1 around Oct 18 17:23. From the amfd > logs of >>>>> SC-1, it appears PL-5 began the failover around 17:23:19. >>>>> >>>>> Oct 18 17:23:19.931170 osafamfd [1826:ndproc.cc:0923] >> >>>>> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster' >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> I waited to see if SG-C’s HA state would correct itself but it did not: >>>>> >>>>> >>>>> >>>>> [root@sc-2 ~]# date ; amf-state siass | grep -A 1 SG-C >>>>> >>>>> Tue Oct 18 17:36:26 UTC 2016 >>>>> >>>>> safSISU=safSu=SU1\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C >>>>> >>>>> saAmfSISUHAState=STANDBY(2) >>>>> >>>>> -- >>>>> >>>>> safSISU=safSu=SU2\,safSg=SG-C\,safApp=SG-C,safSi=SG-C,safApp=SG-C >>>>> >>>>> saAmfSISUHAState=STANDBY(2) >>>>> >>>>> [root@sc-2 ~]# >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Even though the controller indicates that both SUs are standby, the > logs >>>>> from the original standby SG-C, SU2 shows it going active: >>>>> >>>>> >>>>> >>>>> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: NO Assigning >>>>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C' >>>>> >>>>> Oct 18 17:23:20 SG-C-1 osafamfnd[26392]: IN Assigning 'all CSIs' ACTIVE >>>>> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C' >>>>> >>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >>>>> >>>> >> > 'safCSIComp=safComp=AMFWDOG\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=AMFWDOG,safSi=NoRed2,safApp=OpenSAF' >>>>> by Impl-id: 253 >>>>> >>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >>>>> >>>> >> > 'safCSIComp=safComp=IMMND\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=IMMND,safSi=NoRed2,safApp=OpenSAF' >>>>> by Impl-id: 253 >>>>> >>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >>>>> >>>> >> > 'safCSIComp=safComp=CLMNA\#safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safCsi=CLMNA,safSi=NoRed2,safApp=OpenSAF' >>>>> by Impl-id: 253 >>>>> >>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >>>>> >>>> >> > 'safSISU=safSu=PL-5\#safSg=NoRed\#safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF' >>>>> by Impl-id: 253 >>>>> >>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >>>>> >>>> >> > 'safCSIComp=safComp=SG-C\#safSu=SU1\#safSg=SG-C\#safApp=SG-C,safCsi=SG-C,safSi=SG-C,safApp=SG-C' >>>>> by Impl-id: 253 >>>>> >>>>> Oct 18 17:23:20 SG-C-1 osafimmnd[26354]: IN Delete runtime object >>>>> 'safSISU=safSu=SU1\#safSg=SG-C\#safApp=SG-C,safSi=SG-C,safApp=SG-C' by >>>>> Impl-id: 253 >>>>> >>>>> Oct 18 17:23:20 SG-C-1 osafdtmd[26332]: NO Lost contact with 'SG-C-0' >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS >>>>> message:51732 >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for >>>>> message type 57 - ignoring >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA DISCARD DUPLICATE FEVS >>>>> message:51733 >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: WA Error code 2 returned for >>>>> message type 57 - ignoring >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Global discard node > received >>>>> for nodeId:2010f pid:1739 >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer > disconnected 253 >>>>> <0, 2010f(down)> (safAmfService) >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer > disconnected 251 >>>>> <0, 2010f(down)> (safClmService) >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer > disconnected 250 >>>>> <0, 2010f(down)> (safLogService) >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer > disconnected 248 >>>>> <0, 2010f(down)> (@OpenSafImmPBE) >>>>> >>>>> Oct 18 17:24:24 SG-C-1 osafimmnd[26354]: NO Implementer > disconnected 249 >>>>> <0, 2010f(down)> (OsafImmPbeRt_B) >>>>> >>>>> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: IN Assigned 'all CSIs' ACTIVE >>>>> to 'safComp=SG-C,safSu=SU2,safSg=SG-C,safApp=SG-C' >>>>> >>>>> Oct 18 17:25:19 SG-C-1 osafamfnd[26392]: NO Assigned >>>>> 'safSi=SG-C,safApp=SG-C' ACTIVE to 'safSu=SU2,safSg=SG-C,safApp=SG-C' >>>>> >>>>> Oct 18 17:25:37 SG-C-1 osafdtmd[26332]: NO Lost contact with 'sc-1' >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Dave >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *From:*praveen malviya [mailto:praveen.malv...@oracle.com] >>>>> *Sent:* Tuesday, October 18, 2016 8:30 AM >>>>> *To:* David Hoyt <david.h...@genband.com > <mailto:david.h...@genband.com%0b>> <mailto:david.h...@genband.com > <mailto:david.h...@genband.com%0b>> > <mailto:david.h...@genband.com%20%3cmailto:david.h...@genband.com>>>; >>>>> opensaf-users@lists.sourceforge.net > <mailto:opensaf-users@lists.sourceforge.net> >> <mailto:opensaf-users@lists.sourceforge.net> >>>> <mailto:opensaf-users@lists.sourceforge.net> >>>>> *Subject:* Re: [users] both SUs within a 2N Service Group appear as >>>> STANDBY >>>>> >>>>> >>>>> >>>>> > ------------------------------------------------------------------------ >>>>> >>>>> NOTICE: This email was received from an EXTERNAL sender >>>>> >>>>> > ------------------------------------------------------------------------ >>>>> >>>>> >>>>> Please see response inline with [Praveen]. >>>>> >>>>> Thanks, >>>>> Praveen >>>>> >>>>> On 17-Oct-16 11:46 PM, David Hoyt wrote: >>>>>> Hi all, >>>>>> >>>>>> I'm encountering a scenario where opensaf shows the HA state of both >>>>> SUs within a 2N redundancy Service Group as standby. >>>>>> Setup: >>>>>> >>>>>> - Opensaf 4.6 running on RHEL 6.6 VMs with TCP >>>>>> >>>>>> - 2 controllers, 4 payloads >>>>>> >>>>>> - SC-1 & SC-2 are the VMs with the controller nodes (SC-1 is active) >>>>>> >>>>>> - PL-3 & PL4 have SU1 & SU2 from SG-A (2N redundancy) >>>>>> >>>>>> - PL-5 & PL-6 have SU1 & SU2 from SG-B (2N redundancy) >>>>>> >>>>>> - Server-1 has three VMs consisting of SC-1, PL-3 and PL-5 >>>>>> >>>>>> - Likewise, server-2 has SC-2, PL-4 and PL-6 >>>>>> >>>>>> I reboot server 1 and shortly afterwards, the SG-A SUs begin to >>>>> failover. SU2 on PL-4 goes active. >>>>>> Around the same time, the opensaf 2N SUs failover. >>>>>> After the dust has settled, and server-1 comes back as well as the >>>>> VMs, all appears fine except the SG-A SUs. They both have a standby HA >>>>> state. >>>>>> >>>>>> Is there any way to correct this? >>>>> [Praveen] I think there is no issue from the callback perspectives as >>>>> SU2 on PL-4 was made active(comps received callbacks). Only problem is >>>>> outout of "amf-state siass". >>>>> Please share AMFD traces from both the controller. >>>>> >>>>> >>>>> >>>>>> Is there some audit that periodically checks the validity of the HA >>>>> states? >>>>>> >>>>>> Now, when SG-A, SU1 recovers, I did swact the SUs and it corrected the >>>>> HA state. However, if server-1 goes down for an extended period, the HA >>>>> state of SG-A, SU2 will appear as Standby, when it's actually > running as >>>>> active. >>>>>> >>>>>> >>>>>> Before the reboot: >>>>>> >>>>>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 > safSg=2N >>>>>> > safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> -- >>>>>> > safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> [root@jenga-56-sysvm-1 ~]# >>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A >>>>>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> -- >>>>>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> [root@sc-2 ~]# >>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B >>>>>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> -- >>>>>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> [root@sc-2 ~]# >>>>>> >>>>>> >>>>>> >>>>>> After the reboot: >>>>>> [root@sc-2 ~]# amf-state siass | grep -A 2 OpenSAF | grep -A 1 > safSg=2N >>>>>> > safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> -- >>>>>> > safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> [root@sc-2 ~]# >>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-A >>>>>> safSISU=safSu=SU1\,safSg=SG-A\,safApp=SG-A,safSi=SG-A,safApp=SG-A >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> -- >>>>>> safSISU=safSu=SU2\,safSg=SG-A\,safApp=DVN,safSi=SG-A,safApp=SG-A >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> [root@sc-2 ~]# >>>>>> [root@sc-2 ~]# amf-state siass | grep -A 1 SG-B >>>>>> safSISU=safSu=SU2\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >>>>>> saAmfSISUHAState=ACTIVE(1) >>>>>> -- >>>>>> safSISU=safSu=SU1\,safSg=SG-B\,safApp=SG-B,safSi=SG-B,safApp=SG-B >>>>>> saAmfSISUHAState=STANDBY(2) >>>>>> [root@sc-2 ~]# >>>>>> >>>>>> >>>>> >>>> >> > ------------------------------------------------------------------------------ >>>>>> Check out the vibrant tech community on one of the world's most >>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >>>>>> _______________________________________________ >>>>>> Opensaf-users mailing list >>>>>> Opensaf-users@lists.sourceforge.net > <mailto:Opensaf-users@lists.sourceforge.net> >> <mailto:Opensaf-users@lists.sourceforge.net> >>>> <mailto:Opensaf-users@lists.sourceforge.net> >>>>> <mailto:Opensaf-users@lists.sourceforge.net> >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>> >>>>> >>>> >>> >>> >> > ------------------------------------------------------------------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >>> _______________________________________________ >>> Opensaf-users mailing list >>> Opensaf-users@lists.sourceforge.net > <mailto:Opensaf-users@lists.sourceforge.net> >> <mailto:Opensaf-users@lists.sourceforge.net> >>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>> >> > ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users