---

** [tickets:#483] subsequent si-swap of controllers fail when the standby 
controller fails to become active during si-swap**

**Status:** unassigned
**Created:** Wed Jul 03, 2013 06:23 AM UTC by Sirisha Alla
**Last Updated:** Wed Jul 03, 2013 06:23 AM UTC
**Owner:** nobody

The issue is seen with changeset 4325 on a 4 node SLES VM setup. IMM is loaded 
with 25k objects.

Switchovers of controllers are happening continuously with IMM continuous 
operations. The following is observed in the syslogs during 92nd switchover:

On SC-1:

Jul  2 19:15:52 SLES-64BIT-SLOT1 osafamfd[2478]: NO safSi=SC-2N,safApp=OpenSAF 
Swap initiated
Jul  2 19:15:52 SLES-64BIT-SLOT1 osafamfd[2478]: NO Controller switch over 
initiated
Jul  2 19:15:52 SLES-64BIT-SLOT1 osafamfnd[2488]: NO Assigning 
'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'

On SC-2:

Jul  2 19:15:54 SLES-64BIT-SLOT2 osafamfnd[2442]: NO Assigning 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafntfimcnd[5014]: NO exiting on signal 15
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafimmd[2366]: WA IMMD not re-electing coord 
for switch-over (si-swap) coord at (2010f)
Jul  2 19:15:54 SLES-64BIT-SLOT2 osaflckd[2561]: CR pthread_create FAILED: 
Resource temporarily unavailable
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafamfnd[2442]: NO 
'safComp=GLD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafamfnd[2442]: ER 
safComp=GLD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafamfnd[2442]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafevtd[2529]: ER pthread create failed: 
Resource temporarily unavailable
Jul  2 19:15:54 SLES-64BIT-SLOT2 osafimmnd[2376]: NO Implementer disconnected 
1114 <1709, 2020f> (@OpenSafImmReplicatorB)

Because of which SC-2 that has to become Active went for reboot.

On SC-1:

Jul  2 19:15:59 SLES-64BIT-SLOT1 osaffmd[2392]: NO Role: QUIESCED, Node Down 
for node id: 2020f
Jul  2 19:15:59 SLES-64BIT-SLOT1 osaffmd[2392]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Received Node Down for Active peer, OwnNodeId = 
131343, SupervisionTime = 60
Jul  2 19:15:59 SLES-64BIT-SLOT1 kernel: [ 4258.864161] TIPC: Resetting link 
<1.1.1:eth0-1.1.2:eth1>, peer not responding
Jul  2 19:15:59 SLES-64BIT-SLOT1 kernel: [ 4258.864170] TIPC: Lost link 
<1.1.1:eth0-1.1.2:eth1> on network plane A
Jul  2 19:15:59 SLES-64BIT-SLOT1 kernel: [ 4258.864176] TIPC: Lost contact with 
<1.1.2>
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safAmfNode=SC-2,safAmfCluster=myAmfCluster OperState ENABLED => DISABLED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safAmfNode=SC-2,safAmfCluster=myAmfCluster'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: NO Node 'SC-2' left the cluster
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=NoRed,safApp=OpenSAF OperState ENABLED => DISABLED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=NoRed,safApp=OpenSAF PresenceState INSTANTIATED => 
UNINSTANTIATED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=NoRed,safApp=OpenSAF ReadinessState IN_SERVICE => 
OUT_OF_SERVICE
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: ER Alarm lost for 
safSi=NoRed2,safApp=OpenSAF
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=2N,safApp=OpenSAF OperState ENABLED => DISABLED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=2N,safApp=OpenSAF PresenceState INSTANTIATED => UNINSTANTIATED
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: WA State change notification 
lost for 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafamfd[2478]: 
safSu=SC-2,safSg=2N,safApp=OpenSAF ReadinessState IN_SERVICE => OUT_OF_SERVICE
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafclmd[2457]: 
safNode=SC-2,safCluster=myClmCluster LEFT, init view=2, cluster view=5
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafclmd[2457]: ER clms_node_exit_ntf failed 2
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafclmd[2457]: ER saImmOiRtObjectUpdate 
FAILED 9, 'safNode=SC-2,safCluster=myClmCluster'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafclmd[2457]: ER saImmOiRtObjectUpdate 
FAILED 9, 'safCluster=myClmCluster'
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafimmd[2401]: WA IMMND DOWN on active 
controller f2 detected at standby immd!! f1. Possible failover
Jul  2 19:15:59 SLES-64BIT-SLOT1 osafimmnd[2411]: WA DISCARD DUPLICATE FEVS 
message:40047
.............
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmd[2401]: NO Coord re-elected, resides 
at 2010f
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO This IMMND re-elected 
coord redundantly, failover ?
Jul  2 19:16:00 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafrded[2383]: NO rde_rde_set_role: role set 
to 1
Jul  2 19:16:00 SLES-64BIT-SLOT1 osaflogd[2432]: NO ACTIVE request
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafclmd[2457]: NO ACTIVE request
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1123 (safSmfService) <302, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1124 (safMsgGrpService) <307, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1125 (safCheckPointService) <322, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1126 (safLogService) <4, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1127 (safLckService) <308, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1128 (safEvtService) <309, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1129 (MsgQueueService131599) <2028, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer locally 
disconnected. Marking it as doomed 1129 <2028, 2010f> (MsgQueueService131599)
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer connected: 
1130 (safClmService) <17, 2010f>
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafimmnd[2411]: NO Implementer disconnected 
1129 <2028, 2010f> (MsgQueueService131599)
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafsmfd[2508]: NO Starting SmfCbkUtilThread
Jul  2 19:16:00 SLES-64BIT-SLOT1 osafamfnd[2488]: NO Assigned 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'

Jul  2 19:16:26 SLES-64BIT-SLOT1 kernel: [ 4285.539563] TIPC: Established link 
<1.1.1:eth0-1.1.2:eth1> on network plane A

Jul  2 19:16:42 SLES-64BIT-SLOT1 osafamfd[2478]: NO Node 'SC-2' joined the 
cluster

SC-1 has become Active from Quiesced and SC-2 became Standby. Now when the next 
switchover is issued it fails with the following message in the syslog:

Jul  2 19:33:24 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress
Jul  2 19:33:25 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress
Jul  2 19:33:26 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress
Jul  2 19:33:27 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress
Jul  2 19:33:28 SLES-64BIT-SLOT1 osafamfd[2478]: NO SI Swap not possible, 
Controller role switch under progress

Traces of AMFD and AMFND are attached along with syslogs.

The same issue has been observed twice after 90 switchovers. Looks like timing 
issue.



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to