The problem reported  in this ticket was given as comment on #3028 patch, but 
it was not addressed and patch was pushed. Snaps below:
----------------------------------------------------------
I checked these two test cases without your fix and the second case exists 
without this patch. So, can you fix the first case.

Thanks
-Nagu

-----Original Message-----
From: Nagendra Kumar 
Sent: 28 March 2013 18:47
To: Hans Feldt; [email protected]
Cc: [email protected]
Subject: Re: [devel] [PATCH 1 of 1] avsv: keep IMM jobs in queue in noactive 
state (#3028)

Hi,

The following test cases failed during Swap and looks serious:

1.      Act Cont going to Quisced and Std went for reboot. It never resets the 
flag.
2.      Act Cont going to Quisced and Std becomes Act and Quisced controller 
reboots. The following commands fails at newly Act controller:
[root@AMF_SC-2 avd]# /etc/init.d/opensafd status 
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
        saAmfSISUHAState=<Empty>
safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
        saAmfSISUHAState=<Empty>
safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
        saAmfSISUHAState=<Empty>
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
        saAmfSISUHAState=<Empty>


[root@AMF_SC-2 avd]# amf-state su
safSu=SC-2,safSg=NoRed,safApp=OpenSAF
        saAmfSUAdminState=UNLOCKED(1)
        saAmfSUOperState=<Empty>
        saAmfSUPresenceState=<Empty>
        saAmfSUReadinessState=<Empty>
safSu=SC-2,safSg=2N,safApp=OpenSAF
        saAmfSUAdminState=UNLOCKED(1)
        saAmfSUOperState=<Empty>
        saAmfSUPresenceState=<Empty>
        saAmfSUReadinessState=<Empty>
safSu=SC-1,safSg=2N,safApp=OpenSAF
        saAmfSUAdminState=UNLOCKED(1)
        saAmfSUOperState=<Empty>
        saAmfSUPresenceState=<Empty>
        saAmfSUReadinessState=<Empty>

Also when Stdby controller tries to rejoin it get the below error:
Mar 28 18:38:30 AMF_SC-1 kernel: TIPC: Established link <1.1.1:eth0-1.1.2:eth0> 
on network plane A Mar 28 18:38:30 AMF_SC-1 osafrded[23261]: NO rde@2020f has 
standby state => possible fail over, waiting...

Please check.


Thanks
-Nagu



---

** [tickets:#483] subsequent si-swap of controllers fail when the standby 
controller fails to become active during si-swap**

**Status:** assigned
**Created:** Wed Jul 03, 2013 06:23 AM UTC by Sirisha Alla
**Last Updated:** Thu Jul 18, 2013 11:22 AM UTC
**Owner:** Nagendra Kumar

The issue is seen with changeset 4325 on a 4 node SLES VM setup. IMM is loaded 
with 25k objects.

Switchovers of controllers are happening continuously with IMM continuous 
operations. The following is observed in the syslogs during 92nd switchover:

On SC-1:

Jul  2 19:15:52 SLES-64BIT-SLOT1 osafamfd[2478]: NO safSi=SC-2N,safApp=OpenSAF 
Swap initiated
Jul  2 19:15:52 SLES-64BIT-SLOT1 osafamfd[2478]: NO Controller switch over 
initiated
Jul  2 19:15:52 SLES-64BIT-SLOT1 osafamfnd[2488]: NO Assigning 
'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'

On SC-2:

Jul  2 19:15:54 SLES-64BIT-SLOT2 osafamfnd[2442]: NO Assigning 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafntfimcnd[5014]: NO exiting on signal 15
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafimmd[2366]: WA IMMD not re-electing coord 
for switch-over (si-swap) coord at (2010f)
Jul  2 19:15:54 SLES-64BIT-SLOT2 osaflckd[2561]: CR pthread_create FAILED: 
Resource temporarily unavailable
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafamfnd[2442]: NO 
'safComp=GLD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafamfnd[2442]: ER 
safComp=GLD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafamfnd[2442]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafevtd[2529]: ER pthread create failed: 
Resource temporarily unavailable
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafimmnd[2376]: NO Implementer disconnected 
1114 <1709, 2020f> (@OpenSafImmReplicatorB)

Because of which SC-2 that has to become Active went for reboot.

On SC-1:

Jul  2 19:15:59 SLES-64BIT-SLOT1 osaffmd[2392]: NO Role: QUIESCED, Node Down 
for node id: 2020f
Jul  2 19:15:59 SLES-64BIT-SLOT1 osaffmd[2392]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Received Node Down for Active peer, OwnNodeId = 
131343, SupervisionTime = 60
Jul  2 19:15:59 SLES-64BIT-SLOT1 kernel: [ 4258.864161] TIPC: Resetting link 
<1.1.1:eth0-1.1.2:eth1>, peer not responding
Jul  2 19:15:59 SLES-64BIT-SLOT1 kernel: [ 4258.864170] TIPC: Lost link 
<1.1.1:eth0-1.1.2:eth1> on network plane A
Jul  2 19:15:59 SLES-64BIT-SLOT1 kernel: [ 4258.864176] TIPC: Lost contact with 
<1.1.2>
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safAmfNode=SC-2,safAmfCluster=myAmfCluster OperState ENABLED => DISABLED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safAmfNode=SC-2,safAmfCluster=myAmfCluster'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: NO Node 'SC-2' left the cluster
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=NoRed,safApp=OpenSAF OperState ENABLED => DISABLED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=NoRed,safApp=OpenSAF PresenceState INSTANTIATED => 
UNINSTANTIATED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=NoRed,safApp=OpenSAF ReadinessState IN_SERVICE => 
OUT_OF_SERVICE
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: ER Alarm lost for 
safSi=NoRed2,safApp=OpenSAF
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=2N,safApp=OpenSAF OperState ENABLED => DISABLED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=2N,safApp=OpenSAF PresenceState INSTANTIATED => UNINSTANTIATED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=2N,safApp=OpenSAF ReadinessState IN_SERVICE => OUT_OF_SERVICE
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafclmd[2457]: 
safNode=SC-2,safCluster=myClmCluster LEFT, init view=2, cluster view=5
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafclmd[2457]: ER clms_node_exit_ntf failed 2
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafclmd[2457]: ER saImmOiRtObjectUpdate 
FAILED 9, 'safNode=SC-2,safCluster=myClmCluster'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafclmd[2457]: ER saImmOiRtObjectUpdate 
FAILED 9, 'safCluster=myClmCluster'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafimmd[2401]: WA IMMND DOWN on active 
controller f2 detected at standby immd!! f1. Possible failover
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafimmnd[2411]: WA DISCARD DUPLICATE FEVS 
message:40047
.............
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmd[2401]: NO Coord re-elected, resides 
at 2010f
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO This IMMND re-elected 
coord redundantly, failover ?
Jul  2 19:16:00 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafrded[2383]: NO rde_rde_set_role: role set 
to 1
Jul  2 19:16:00 SLES-64BIT-SLOT1 osaflogd[2432]: NO ACTIVE request
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafclmd[2457]: NO ACTIVE request
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1123 (safSmfService) <302, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1124 (safMsgGrpService) <307, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1125 (safCheckPointService) <322, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1126 (safLogService) <4, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1127 (safLckService) <308, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1128 (safEvtService) <309, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1129 (MsgQueueService131599) <2028, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer locally 
disconnected. Marking it as doomed 1129 <2028, 2010f> (MsgQueueService131599)
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1130 (safClmService) <17, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer disconnected 
1129 <2028, 2010f> (MsgQueueService131599)
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafsmfd[2508]: NO Starting SmfCbkUtilThread
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafamfnd[2488]: NO Assigned 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'

Jul  2 19:16:26 SLES-64BIT-SLOT1 kernel: [ 4285.539563] TIPC: Established link 
<1.1.1:eth0-1.1.2:eth1> on network plane A

Jul  2 19:16:42 SLES-64BIT-SLOT1 osafamfd[2478]: NO Node 'SC-2' joined the 
cluster

SC-1 has become Active from Quiesced and SC-2 became Standby. Now when the next 
switchover is issued it fails with the following message in the syslog:

Jul  2 19:33:24 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress
Jul  2 19:33:25 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress
Jul  2 19:33:26 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress
Jul  2 19:33:27 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress
Jul  2 19:33:28 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress

Traces of AMFD and AMFND are attached along with syslogs.

The same issue has been observed twice after 90 switchovers. Looks like timing 
issue.



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to