The fevs duplicates symptom is appears to be due to a bug in TIPC.
Fixed by this patch:
https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/net/tipc?id=999417549c16dd0e3a382aa9f6ae61688db03181
That TIPC fix relates to the new MDS use of TIPC multicast in the
implementation of MDS_SENDTYPE_BCAST.
I cant say for sure if that was the only cause in #1112, but I am pretty sure
it contributed to the chaos.
/AndersBj
________________________________
From: surender khetavath [mailto:[email protected]]
Sent: den 22 september 2014 08:49
To: [opensaf:tickets]
Subject: [opensaf:tickets] #1112 mds: immnd crashes and massive fevs duplicate
messages seen
For now attaching mds logs. I shall attach other logs if it's reproducible
Attachment: logs_mds.tgz (408.9 kB; application/x-compressed-tar)
________________________________
[tickets:#1112]<http://sourceforge.net/p/opensaf/tickets/1112> mds: immnd
crashes and massive fevs duplicate messages seen
Status: unassigned
Milestone: 4.5.0
Created: Thu Sep 18, 2014 11:07 AM UTC by surender khetavath
Last Updated: Fri Sep 19, 2014 07:15 AM UTC
Owner: nobody
changeset : 5697
As part of failovers the crash was observed
gdb on sc-1
(gdb) dir /home/staging/osaf/services/saf/immsv/immnd
Source directories searched:
/home/staging/osaf/services/saf/immsv/immnd:$cdir:$cwd
(gdb) bt
0 0x00007f91649a0b55 in raise () from /lib64/libc.so.6
1 0x00007f91649a2131 in abort () from /lib64/libc.so.6
2 0x0000000000426a43 in ImmModel::prepareForSync(bool) () at ImmModel.cc:2184
3 0x0000000000425d69 in immModel_prepareForSync () at ImmModel.cc:1805
4 0x0000000000418686 in immnd_process_evt () at immnd_evt.c:8152
5 0x000000000040b83b in main () at immnd_main.c:343
(gdb) fr 2
2 0x0000000000426a43 in ImmModel::prepareForSync(bool) () at ImmModel.cc:2184
2184 abort();
(gdb) fr 3
3 0x0000000000425d69 in immModel_prepareForSync () at ImmModel.cc:1805
1805 ImmModel::instance(&cb->immModel)->prepareForSync(isJoining);
gdb on sc-2,pl3&4
0 0x00007f753fae3b55 in raise () from /lib64/libc.so.6
(gdb) bt
0 0x00007f753fae3b55 in raise () from /lib64/libc.so.6
1 0x00007f753fae5131 in abort () from /lib64/libc.so.6
2 0x0000000000418e40 in immnd_process_evt () at immnd_evt.c:8167
3 0x000000000040b83b in main () at immnd_main.c:343
syslog on sc-1
Sep 18 13:54:53 SC-1 osafimmnd[2298]: ER Node is in a state that cannot accept
start of sync, will terminate
Sep 18 13:54:53 SC-1 osafimmd[2288]: WA IMMND DOWN on active controller f2
detected at standby immd!! f1. Possible failover
Sep 18 13:54:53 SC-1 osafimmd[2288]: ER Standby IMMD recieved reset message.
All IMMNDs will restart.
Sep 18 13:54:53 SC-1 osafimmd[2288]: ER IMM RELOAD => ensure cluster restart by
IMMD exit at both SCs, exiting
Sep 18 13:54:59 SC-1 kernel: [ 54.360115] eth3: no IPv6 routers present
Sep 18 13:54:59 SC-1 osaffmd[2278]: NO Node Down event for node id 2020f:
Sep 18 13:54:59 SC-1 osaffmd[2278]: NO Current role: STANDBY
Sep 18 13:54:59 SC-1 osaffmd[2278]: Rebooting OpenSAF NodeId = 0 EE Name = No
EE Mapped, Reason: Failover occurred, but this node is not yet ready, OwnNodeId
= 131343, SupervisionTime = 60
Sep 18 13:54:59 SC-1 kernel: [ 54.680115] TIPC: Resetting link
<1.1.1:eth3-1.1.2:eth2>, peer not responding
Sep 18 13:54:59 SC-1 kernel: [ 54.680128] TIPC: Lost link
<1.1.1:eth3-1.1.2:eth2> on network plane A
Sep 18 13:54:59 SC-1 kernel: [ 54.680137] TIPC: Lost contact with <1.1.2>
________________________________
Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/opensaf/tickets/1112/<https://sourceforge.net/p/opensaf/tickets/1112>
To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/<https://sourceforge.net/auth/subscriptions>
---
** [tickets:#1112] mds: immnd crashes and massive fevs duplicate messages seen**
**Status:** unassigned
**Milestone:** 4.5.0
**Created:** Thu Sep 18, 2014 11:07 AM UTC by surender khetavath
**Last Updated:** Mon Sep 22, 2014 06:48 AM UTC
**Owner:** nobody
changeset : 5697
As part of failovers the crash was observed
gdb on sc-1
(gdb) dir /home/staging/osaf/services/saf/immsv/immnd
Source directories searched:
/home/staging/osaf/services/saf/immsv/immnd:$cdir:$cwd
(gdb) bt
#0 0x00007f91649a0b55 in raise () from /lib64/libc.so.6
#1 0x00007f91649a2131 in abort () from /lib64/libc.so.6
#2 0x0000000000426a43 in ImmModel::prepareForSync(bool) () at ImmModel.cc:2184
#3 0x0000000000425d69 in immModel_prepareForSync () at ImmModel.cc:1805
#4 0x0000000000418686 in immnd_process_evt () at immnd_evt.c:8152
#5 0x000000000040b83b in main () at immnd_main.c:343
(gdb) fr 2
#2 0x0000000000426a43 in ImmModel::prepareForSync(bool) () at ImmModel.cc:2184
2184 abort();
(gdb) fr 3
#3 0x0000000000425d69 in immModel_prepareForSync () at ImmModel.cc:1805
1805 ImmModel::instance(&cb->immModel)->prepareForSync(isJoining);
gdb on sc-2,pl3&4
#0 0x00007f753fae3b55 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f753fae3b55 in raise () from /lib64/libc.so.6
#1 0x00007f753fae5131 in abort () from /lib64/libc.so.6
#2 0x0000000000418e40 in immnd_process_evt () at immnd_evt.c:8167
#3 0x000000000040b83b in main () at immnd_main.c:343
syslog on sc-1
Sep 18 13:54:53 SC-1 osafimmnd[2298]: ER Node is in a state that cannot accept
start of sync, will terminate
Sep 18 13:54:53 SC-1 osafimmd[2288]: WA IMMND DOWN on active controller f2
detected at standby immd!! f1. Possible failover
Sep 18 13:54:53 SC-1 osafimmd[2288]: ER Standby IMMD recieved reset message.
All IMMNDs will restart.
Sep 18 13:54:53 SC-1 osafimmd[2288]: ER IMM RELOAD => ensure cluster restart
by IMMD exit at both SCs, exiting
Sep 18 13:54:59 SC-1 kernel: [ 54.360115] eth3: no IPv6 routers present
Sep 18 13:54:59 SC-1 osaffmd[2278]: NO Node Down event for node id 2020f:
Sep 18 13:54:59 SC-1 osaffmd[2278]: NO Current role: STANDBY
Sep 18 13:54:59 SC-1 osaffmd[2278]: Rebooting OpenSAF NodeId = 0 EE Name = No
EE Mapped, Reason: Failover occurred, but this node is not yet ready, OwnNodeId
= 131343, SupervisionTime = 60
Sep 18 13:54:59 SC-1 kernel: [ 54.680115] TIPC: Resetting link
<1.1.1:eth3-1.1.2:eth2>, peer not responding
Sep 18 13:54:59 SC-1 kernel: [ 54.680128] TIPC: Lost link
<1.1.1:eth3-1.1.2:eth2> on network plane A
Sep 18 13:54:59 SC-1 kernel: [ 54.680137] TIPC: Lost contact with <1.1.2>
---
Sent from sourceforge.net because [email protected] is
subscribed to http://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
http://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets