See also duplicate ticket #599
https://sourceforge.net/p/opensaf/tickets/599/
---
** [tickets:#528] order of service not guaranteed during failover**
**Status:** unassigned
**Created:** Tue Jul 30, 2013 09:43 AM UTC by Sirisha Alla
**Last Updated:** Tue Jul 30, 2013 09:44 AM UTC
**Owner:** nobody
The issue is seen on changeset 4325 on SLES 4 node VMs.
SC-1 is active, SC-2 is standby. Failover is triggered by killing FMD on SC-1.
Jul 29 21:28:18 SLES-64BIT-SLOT1 root: killing osaffmd from invoke_failover.sh
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: NO
'safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' :
Recovery is 'nodeFailfast'
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: ER
safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery
is:nodeFailfast
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131343, SupervisionTime = 60
Jul 29 21:28:18 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
timeout=60
SC-2 tried becoming Active but failed since CLMD reported ERR_EXIST on
implementer set. The reason is IMMND has not yet disconnected the old
implementer on 2010f. The following is the syslog which shows the sequence.
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408188] TIPC: Resetting link
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408194] TIPC: Lost link
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408198] TIPC: Lost contact with
<1.1.1>
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMND DOWN on active
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS
message:283106
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafrded[2332]: NO rde_rde_set_role: role set
to 1
Jul 29 21:28:31 SLES-64BIT-SLOT2 osaflogd[2369]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafntfd[2379]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfd[2416]: NO FAILOVER StandBy --> Active
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: ER saImmOiImplementerSet
failed rc:14, exiting
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: NO
'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' :
Recovery is 'nodeFailfast'
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: ER
safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery
is:nodeFailfast
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131599, SupervisionTime = 60
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS
message:283107
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Global discard node
received for nodeId:2010f pid:2382
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2353 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2350 <0, 2010f(down)> (safEvtService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2349 <0, 2010f(down)> (safLckService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2348 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2347 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2342 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2352 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2346 <0, 2010f(down)> (safSmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2343 <0, 2010f(down)> (safClmService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2344 <0, 2010f(down)> (safAmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2345 <0, 2010f(down)> (safLogService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs
message 283106 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs
message 283107 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO ACTIVE request
Implementer is disconnected at IMMND after CLMD reported ERR_EXIST and went for
recovery. Ideally IMMND implementers on the old node should get disconnected
first before other opensaf processes tries to reuse the same implementer name.
Here the order needs to be guaranteed for the failover to always succeed. In
this test cluster went for reboot. This issue is very much time intensive and
difficult to reproduce.
AMF and IMM traces are available and can be provided on request. Currently
attaching the syslogs. SC-2 is ahead by 7 seconds in time.
Syslog during successful failover:
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMND DOWN on active
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS
message:279866
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS
message:279867
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Global discard node
received for nodeId:2010f pid:2372
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2323 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2320 <0, 2010f(down)> (safEvtService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2319 <0, 2010f(down)> (safLckService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2318 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2317 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2312 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2321 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2316 <0, 2010f(down)> (safSmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2314 <0, 2010f(down)> (safClmService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2313 <0, 2010f(down)> (safAmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2315 <0, 2010f(down)> (safLogService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs
message 279866 since it has recently been resent.
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs
message 279867 since it has recently been resent.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: NO Role: STANDBY, Node Down for
node id: 2010f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Received Node Down for Active peer, OwnNodeId =
131599, SupervisionTime = 60
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192084] TIPC: Resetting link
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192093] TIPC: Lost link
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192100] TIPC: Lost contact with
<1.1.1>
Jul 29 21:24:46 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafrded[2338]: NO rde_rde_set_role: role set
to 1
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaflogd[2379]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafntfd[2389]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafclmd[2403]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO New coord elected, resides
at 2020f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO This IMMND is now the NEW
Coord
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO STARTING persistent back
end process.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2326 <11, 2020f> (@safAmfService2020f)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2324 <3, 2020f> (@safLogService)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer connected:
2328 (safAmfService) <11, 2020f>
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO Node 'SC-1' left the cluster
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
DONE!
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets