Hey all,

we hit some nice traffic last night that took our main gateway down.
Pacemaker was configured to failover to our second one, but that one died
aswell.

In a little post-analysis, I found the following in the logs:

Apr 14 21:42:11 cesar1 kernel: [27613652.439846] BUG: soft lockup - CPU#4
stuck for 22s! [swapper/4:0]
Apr 14 21:42:11 cesar1 kernel: [27613652.440319] Stack:
Apr 14 21:42:11 cesar1 kernel: [27613652.440446] Call Trace:
Apr 14 21:42:11 cesar1 kernel: [27613652.440595]  <IRQ>
Apr 14 21:42:12 cesar1 kernel: [27613652.440828]  <EOI>
Apr 14 21:42:12 cesar1 kernel: [27613652.440979] Code: c1 51 da 03 81 48 c7
c2 4e da 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8
00 00 01 00 48 89 e5 f0 0f c1 07 <89> c2
Apr 14 21:42:12 cesar1 CRON[13599]: nss_ldap: could not connect to any LDAP
server as cn=admin,dc=rz,dc=dawanda,dc=com - Can't contact LDAP server
Apr 14 21:42:12 cesar1 CRON[13599]: nss_ldap: could not search LDAP server
- Server is unavailable
Apr 14 21:42:24 cesar1 crmd: [7287]: ERROR: process_lrm_event: LRM
operation management-gateway-ip1_stop_0 (917) Timed Out (timeout=20000ms)
Apr 14 21:42:48 cesar1 kernel: [27613688.611501] BUG: soft lockup - CPU#7
stuck for 22s! [named:32166]
Apr 14 21:42:48 cesar1 kernel: [27613688.611914] Stack:
Apr 14 21:42:48 cesar1 kernel: [27613688.612036] Call Trace:
Apr 14 21:42:48 cesar1 kernel: [27613688.612200]  <IRQ>
Apr 14 21:42:48 cesar1 kernel: [27613688.612408]  <EOI>
Apr 14 21:42:48 cesar1 kernel: [27613688.612626] Code: c1 51 da 03 81 48 c7
c2 4e da 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8
00 00 01 00 48 89 e5 f0 0f c1 07 <89> c2
Apr 14 21:42:55 cesar1 kernel: [27613695.946295] BUG: soft lockup - CPU#0
stuck for 21s! [ksoftirqd/0:3]

Apr 14 21:42:55 cesar1 kernel: [27613695.946785] Stack:
Apr 14 21:42:55 cesar1 kernel: [27613695.946917] Call Trace:
Apr 14 21:42:55 cesar1 kernel: [27613695.947137] Code: c4 00 00 81 a8 44 e0
ff ff ff 01 00 00 48 63 80 44 e0 ff ff a9 00 ff ff 07 74 36 65 48 8b 04 25
c8 c4 00 00 83 a8 44 e0 ff ff 01 <5d> c3

We're using irqbalance to not only hit the first CPU for ethernet card
hardware interrupts when traffic comes in (learned from last much more
intensive DDoS).
However, since this not helped, I'd like to find out what else we can do.
Our gateway has to do NAT and has a few other iptables rules it needs in
order to run OpenStack behind,
so I can't just drop it.

Regarding the logs, I can see, that something caused the CPU cores to get
stuck for a number of different processes.
Has anyone ever encountered such error messages I quoted above or knows
other things one might want to do in order to prevent hugh unsocialized
incoming traffic from bringing a Linux node down?

Best regards,
Christian.

Reply via email to