[tickets] [opensaf:tickets] #1112 2pbe: immnd crashed on all nodes and led to cluster reset

Anders Bjornerstedt Thu, 18 Sep 2014 07:57:14 -0700

I have looked at the only syslog provided and what I do see is first
a long sequence of successfull failovers back and forth.
The normal sequence, for failover, is:


1) FMD reports node down event for peer.
Sep 18 13:49:35 SC-2 osaffmd[2791]: NO Node Down event for node id 2010f:

2) TIPC link loss:
Sep 18 13:49:35 SC-2 kernel: [  117.720128] TIPC: Resetting link 
<1.1.2:eth2-1.1.1:eth3>, peer not responding
Sep 18 13:49:35 SC-2 kernel: [  117.720141] TIPC: Lost link 
<1.1.2:eth2-1.1.1:eth3> on network plane A
Sep 18 13:49:35 SC-2 kernel: [  117.720150] TIPC: Lost contact with <1.1.1>

3) IMMD gets down for IMMND over MDS:
Sep 18 13:49:35 SC-2 osafimmd[2801]: WA IMMND DOWN on active controller f1 
detected at standby immd!! f2. Possible failover

4)IMMND ignores two duplicate fevs messages sent by local IMMD
Sep 18 13:49:35 SC-2 osafimmnd[2811]: WA DISCARD DUPLICATE FEVS message:9589
Sep 18 13:49:35 SC-2 osafimmnd[2811]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:49:35 SC-2 osafimmnd[2811]: WA DISCARD DUPLICATE FEVS message:9590
Sep 18 13:49:35 SC-2 osafimmnd[2811]: WA Error code 2 returned for message type 
57 - ignoring

5) IMMD gets DOWN even for peer IMMD:
Sep 18 13:49:35 SC-2 osafimmd[2801]: WA IMMD lost contact with peer IMMD 
(NCSMDS_RED_DOWN)
Sep 18 13:49:35 SC-2 osafimmd[2801]: NO Skipping re-send of fevs message 9589 
since it has recently been resent.
Sep 18 13:49:35 SC-2 osafimmd[2801]: NO Skipping re-send of fevs message 9590 
since it has recently been resent.

6) Failover.
Sep 18 13:49:35 SC-2 osaffmd[2791]: NO Controller Failover: Setting role to 
ACTIVE
Sep 18 13:49:35 SC-2 osafrded[2782]: NO RDE role set to ACTIVE
Sep 18 13:49:35 SC-2 osafimmd[2801]: NO ACTIVE request
Sep 18 13:49:35 SC-2 osaflogd[2821]: NO ACTIVE request
Sep 18 13:49:35 SC-2 osafntfd[2834]: NO ACTIVE request
Sep 18 13:49:35 SC-2 osafclmd[2848]: NO ACTIVE request
Sep 18 13:49:35 SC-2 osafamfd[2867]: NO FAILOVER StandBy --> Active
Sep 18 13:49:35 SC-2 osafimmd[2801]: NO ellect_coord invoke from rda_callback 
ACTIVE
Sep 18 13:49:35 SC-2 osafimmd[2801]: NO New coord elected, resides at 2020f
Sep 18 13:49:35 SC-2 osafimmnd[2811]: NO 2PBE configured, 
IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
Sep 18 13:49:35 SC-2 osafimmnd[2811]: NO This IMMND is now the NEW Coord

========================
Then things go wrong in some way after the failover below. 
Note that the sequene of events is different:

A5) IMMD gets DOWN even for peer IMMD. Possibly the IMMD crashed/got killed.
    I can not see what ocurred because I dont have syslogs for SC-1
Sep 18 13:53:49 SC-2 osafimmd[2773]: WA IMMD lost contact with peer IMMD 
(NCSMDS_RED_DOWN)

B6) IMMND ignores two duplicate fevs messages sent by local IMMD
Sep 18 13:53:49 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10700
Sep 18 13:53:49 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:49 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10701
Sep 18 13:53:49 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring

C1) FMD reports node down event for peer.
Sep 18 13:53:54 SC-2 osaffmd[2763]: NO Node Down event for node id 2010f:

D2) TIPC link loss:
Sep 18 13:53:54 SC-2 kernel: [  117.540095] TIPC: Resetting link 
<1.1.2:eth2-1.1.1:eth3>, peer not responding
Sep 18 13:53:54 SC-2 kernel: [  117.540100] TIPC: Lost link 
<1.1.2:eth2-1.1.1:eth3> on network plane A
Sep 18 13:53:54 SC-2 kernel: [  117.540103] TIPC: Lost contact with <1.1.1>

E3) IMMD gets down for IMMND over MDS:
Sep 18 13:53:54 SC-2 osafimmd[2773]: WA IMMND DOWN on active controller f1 
detected at standby immd!! f2. Possible failover
Sep 18 13:53:54 SC-2 osafimmd[2773]: NO Skipping re-send of fevs message 10700 
since it has recently been resent.
Sep 18 13:53:54 SC-2 osafimmd[2773]: NO Skipping re-send of fevs message 10701 
since it has recently been resent.

F6) Failover.
Sep 18 13:53:54 SC-2 osaffmd[2763]: NO Controller Failover: Setting role to 
ACTIVE
Sep 18 13:53:54 SC-2 osafrded[2754]: NO RDE role set to ACTIVE
Sep 18 13:53:54 SC-2 osaflogd[2793]: NO ACTIVE request
Sep 18 13:53:54 SC-2 osafntfd[2806]: NO ACTIVE request
Sep 18 13:53:54 SC-2 osafclmd[2820]: NO ACTIVE request
Sep 18 13:53:54 SC-2 osafamfd[2843]: NO FAILOVER StandBy --> Active
Sep 18 13:53:54 SC-2 osafimmd[2773]: NO ACTIVE request
Sep 18 13:53:54 SC-2 osafimmd[2773]: NO ellect_coord invoke from rda_callback 
ACTIVE
Sep 18 13:53:54 SC-2 osafimmd[2773]: NO New coord elected, resides at 2020f

After this we start to se QUADRUPLED fevs messages (via MDS broadcast)
arriving att IMMND (note one silent original and three duplicates).
There are also gaps where apparently no duplicaes are sent. 


Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10703
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10703
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10703
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10710
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10710
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10710
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: NO Implementer connected: 303 
(safMsgGrpService) <312, 2020f>
Sep 18 13:53:54 SC-2 osafimmnd[2783]: NO Implementer connected: 304 
(safCheckPointService) <316, 2020f>
Sep 18 13:53:54 SC-2 osafimmnd[2783]: NO Implementer disconnected 295 <4, 
2020f> (@OpenSafImmReplicatorB)
Sep 18 13:53:54 SC-2 osafntfd[2806]: NO handle_state_ntfimcn: osafntfimcnd 
process terminated. State change
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10713
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10713
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10713
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10714
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10714
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10714
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10717
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10717
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA Error code 2 returned for message type 
57 - ignoring
Sep 18 13:53:54 SC-2 osafimmnd[2783]: WA DISCARD DUPLICATE FEVS message:10717

Finaly we see split brain symptoms:

Sep 18 13:54:52 SC-2 osafimmnd[2783]: WA Imm at this evs node has epoch 40, 
COORDINATOR appears to be a stragler!!, aborting.
Sep 18 13:54:52 SC-2 osafimmd[2773]: NO Successfully announced sync. New ruling 
epoch:40
Sep 18 13:54:52 SC-2 osafimmpbed: WA PBE lost contact with parent IMMND - 
Exiting
Sep 18 13:54:52 SC-2 osafamfnd[2853]: NO 
'safSu=SC-2,safSg=NoRed,safApp=OpenSAF' component restart probation timer 
started (timeout: 60000000000 ns)
Sep 18 13:54:52 SC-2 osafamfnd[2853]: NO Restarting a component of 
'safSu=SC-2,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
Sep 18 13:54:52 SC-2 osafamfnd[2853]: NO 
'safComp=IMMND,safSu=SC-2,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' 
: Recovery is 'componentRestart'
Sep 18 13:54:52 SC-2 osafntfimcnd[3340]: ER saImmOiDispatch() Fail 
SA_AIS_ERR_BAD_HANDLE (9)
Sep 18 13:54:52 SC-2 osafimmd[2773]: WA IMMND coordinator at 2020f apparently 
crashed => electing new coord
Sep 18 13:54:52 SC-2 osafimmd[2773]: ER Failed to find candidate for new IMMND 
coordinator
Sep 18 13:54:52 SC-2 osafimmd[2773]: ER Active IMMD has to restart the IMMSv. 
All IMMNDs will restart


---

** [tickets:#1112] 2pbe: immnd crashed on all nodes and led to cluster reset**

**Status:** assigned
**Milestone:** 4.3.3
**Created:** Thu Sep 18, 2014 11:07 AM UTC by surender khetavath
**Last Updated:** Thu Sep 18, 2014 12:57 PM UTC
**Owner:** Anders Bjornerstedt

changeset : 5697

As part of failovers the crash was observed

gdb on sc-1
(gdb) dir /home/staging/osaf/services/saf/immsv/immnd
Source directories searched: 
/home/staging/osaf/services/saf/immsv/immnd:$cdir:$cwd
(gdb) bt
#0  0x00007f91649a0b55 in raise () from /lib64/libc.so.6
#1  0x00007f91649a2131 in abort () from /lib64/libc.so.6
#2  0x0000000000426a43 in ImmModel::prepareForSync(bool) () at ImmModel.cc:2184
#3  0x0000000000425d69 in immModel_prepareForSync () at ImmModel.cc:1805
#4  0x0000000000418686 in immnd_process_evt () at immnd_evt.c:8152
#5  0x000000000040b83b in main () at immnd_main.c:343
(gdb) fr 2
#2  0x0000000000426a43 in ImmModel::prepareForSync(bool) () at ImmModel.cc:2184
2184                abort();
(gdb) fr 3
#3  0x0000000000425d69 in immModel_prepareForSync () at ImmModel.cc:1805
1805        ImmModel::instance(&cb->immModel)->prepareForSync(isJoining);

gdb on sc-2,pl3&4
#0  0x00007f753fae3b55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f753fae3b55 in raise () from /lib64/libc.so.6
#1  0x00007f753fae5131 in abort () from /lib64/libc.so.6
#2  0x0000000000418e40 in immnd_process_evt () at immnd_evt.c:8167
#3  0x000000000040b83b in main () at immnd_main.c:343


syslog on sc-1
Sep 18 13:54:53 SC-1 osafimmnd[2298]: ER Node is in a state that cannot accept 
start of sync, will terminate
Sep 18 13:54:53 SC-1 osafimmd[2288]: WA IMMND DOWN on active controller f2 
detected at standby immd!! f1. Possible failover
Sep 18 13:54:53 SC-1 osafimmd[2288]: ER Standby IMMD recieved reset message. 
All IMMNDs will restart.
Sep 18 13:54:53 SC-1 osafimmd[2288]: ER IMM RELOAD  => ensure cluster restart 
by IMMD exit at both SCs, exiting
Sep 18 13:54:59 SC-1 kernel: [   54.360115] eth3: no IPv6 routers present
Sep 18 13:54:59 SC-1 osaffmd[2278]: NO Node Down event for node id 2020f:
Sep 18 13:54:59 SC-1 osaffmd[2278]: NO Current role: STANDBY
Sep 18 13:54:59 SC-1 osaffmd[2278]: Rebooting OpenSAF NodeId = 0 EE Name = No 
EE Mapped, Reason: Failover occurred, but this node is not yet ready, OwnNodeId 
= 131343, SupervisionTime = 60
Sep 18 13:54:59 SC-1 kernel: [   54.680115] TIPC: Resetting link 
<1.1.1:eth3-1.1.2:eth2>, peer not responding
Sep 18 13:54:59 SC-1 kernel: [   54.680128] TIPC: Lost link 
<1.1.1:eth3-1.1.2:eth2> on network plane A
Sep 18 13:54:59 SC-1 kernel: [   54.680137] TIPC: Lost contact with <1.1.2>



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1112 2pbe: immnd crashed on all nodes and led to cluster reset

Reply via email to