- **Component**: osaf --> clm


---

** [tickets:#528] order of service not guaranteed during failover**

**Status:** unassigned
**Created:** Tue Jul 30, 2013 09:43 AM UTC by Sirisha Alla
**Last Updated:** Mon Oct 21, 2013 07:13 AM UTC
**Owner:** nobody

The issue is seen on changeset 4325 on SLES 4 node VMs.

SC-1 is active, SC-2 is standby. Failover is triggered by killing FMD on SC-1.

Jul 29 21:28:18 SLES-64BIT-SLOT1 root: killing osaffmd from invoke_failover.sh
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: NO 
'safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: ER 
safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60
Jul 29 21:28:18 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; 
timeout=60

SC-2 tried becoming Active but failed since CLMD reported ERR_EXIST on 
implementer set. The reason is IMMND has not yet disconnected the old 
implementer on 2010f. The following is the syslog which shows the sequence.

Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [  101.408188] TIPC: Resetting link 
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [  101.408194] TIPC: Lost link 
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [  101.408198] TIPC: Lost contact with 
<1.1.1>
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS 
message:283106
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafrded[2332]: NO rde_rde_set_role: role set 
to 1
Jul 29 21:28:31 SLES-64BIT-SLOT2 osaflogd[2369]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafntfd[2379]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfd[2416]: NO FAILOVER StandBy --> Active
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: ER saImmOiImplementerSet 
failed rc:14, exiting
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: NO 
'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: ER 
safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS 
message:283107
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Global discard node 
received for nodeId:2010f pid:2382
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2353 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2350 <0, 2010f(down)> (safEvtService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2349 <0, 2010f(down)> (safLckService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2348 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2347 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2342 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2352 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2346 <0, 2010f(down)> (safSmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2343 <0, 2010f(down)> (safClmService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2344 <0, 2010f(down)> (safAmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2345 <0, 2010f(down)> (safLogService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs 
message 283106 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs 
message 283107 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO ACTIVE request

Implementer is disconnected at IMMND after CLMD reported ERR_EXIST and went for 
recovery. Ideally IMMND implementers on the old node should get disconnected 
first before other opensaf processes tries to reuse the same implementer name. 
Here the order needs to be guaranteed for the failover to always succeed. In 
this test cluster went for reboot. This issue is very much time intensive and 
difficult to reproduce. 

AMF and IMM traces are available and can be provided on request. Currently 
attaching the syslogs. SC-2 is ahead by 7 seconds in time.

Syslog during successful failover:

Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS 
message:279866
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS 
message:279867
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Global discard node 
received for nodeId:2010f pid:2372
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2323 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2320 <0, 2010f(down)> (safEvtService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2319 <0, 2010f(down)> (safLckService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2318 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2317 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2312 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2321 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2316 <0, 2010f(down)> (safSmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2314 <0, 2010f(down)> (safClmService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2313 <0, 2010f(down)> (safAmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2315 <0, 2010f(down)> (safLogService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs 
message 279866 since it has recently been resent.
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs 
message 279867 since it has recently been resent.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: NO Role: STANDBY, Node Down for 
node id: 2010f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for Active peer, OwnNodeId = 
131599, SupervisionTime = 60
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [  102.192084] TIPC: Resetting link 
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [  102.192093] TIPC: Lost link 
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [  102.192100] TIPC: Lost contact with 
<1.1.1>
Jul 29 21:24:46 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafrded[2338]: NO rde_rde_set_role: role set 
to 1
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaflogd[2379]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafntfd[2389]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafclmd[2403]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO New coord elected, resides 
at 2020f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO This IMMND is now the NEW 
Coord
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO STARTING persistent back 
end process.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2326 <11, 2020f> (@safAmfService2020f)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2324 <3, 2020f> (@safLogService)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer connected: 
2328 (safAmfService) <11, 2020f>
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO Node 'SC-1' left the cluster
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active 
DONE!



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to