Hello, a colleague of Jarrod's here. I was assigned to look into this issue and have identified the cause. I'm not familiar with the underlying kernel code, so please bear with me if I seem to be mistaken on anything.
Setup ----- I've been testing under single-CPU kvm instances, though we see exactly the same thing running on real multi-core hardware. There are two servers with essentially identical configurations. We're using keepalived to control load-balancing between the two, but again we do see the same problem if we configure each manually with "ip addr add" and "ipvsadm" directly. Each server is running 4 HTTP(S) servers: 2 Apache 2.2 HTTP, 1 Apache 2.2 HTTPS and a custom HTTP server, and we can see in the logfiles the regular keepalived probes against them all. The first machine to boot up becomes the keepalived master, the second the backup. If no further action is taken things appear to function normally. ipvsadm reports normal running weights of both servers of around 250. We're running a pure IPv4 workload by the way. Repro ----- If we administratively disable the master by causing our probe script to assign a weight of 0 (but not stopping keepalived, so the master/backup assignment remains the same), then the regular liveness probes are usually sufficient to trigger the problem. If not, firing off 100 or so simultaneous HTTPS requests from an external host against one of the load-balanced services is always enough. The effect is dramatic, rapid and predictable. "top" on the backup reports a consistently high softirq (si) CPU% (100% if nothing else is using CPU, otherwise as high as it can get) which does not recover once external load is removed. ksoftirqd stays near the top of the top process list. Interactive response and network throughput drops through the floor and the machine essentially becomes unusable. Attempting to reboot the box hangs at the end of the shutdown sequence after the "Rebooting..." message, and the box must be physically reset. The ifconfig stats show that lo is passing 1M packets per second, compared to the 50-100 packets under normal operation. The external interfaces have normal numbers. tcpdumping lo shows mostly packets from the external host to one of the load balanced addresses, the vast majority are FIN, with quite a few RST and a very much smaller number of ACK and data packets. This capture was made 5-10 minutes after the external load was removed, and it looks like the same small set of FIN packets are occurring again and again. In addition, the following message is output via dmesg every minute or so: [ 876.182162] BUG: soft lockup - CPU#0 stuck for 61s! [ipvs_syncmaster:11280] [ 876.182162] Modules linked in: k8_edac e752x_edac edac_core ip_vs_wrr fuse ip_vs deflate zlib_deflate ctr twofish twofish_common camellia serpent blowfish des_generic cbc aes_x86_64 aes_generic xcbc sha256_generic sha1_generic md5 hmac crypto_hash cryptomgr crypto_null crypto_blkcipher crypto_algapi af_key xt_MARK iptable_mangle xt_tcpudp xt_policy xt_comment iptable_filter ip_tables x_tables floppy nvram i2c_piix4 i2c_core ata_piix ata_generic rtc e1000 virtio_net 8139cp mii virtio_blk virtio_pci virtio_ring virtio uhci_hcd ohci_hcd ehci_hcd [last unloaded: nf_conntrack] [ 876.182162] CPU 0: [ 876.182162] Modules linked in: k8_edac e752x_edac edac_core ip_vs_wrr fuse ip_vs deflate zlib_deflate ctr twofish twofish_common camellia serpent blowfish des_generic cbc aes_x86_64 aes_generic xcbc sha256_generic sha1_generic md5 hmac crypto_hash cryptomgr crypto_null crypto_blkcipher crypto_algapi af_key xt_MARK iptable_mangle xt_tcpudp xt_policy xt_comment iptable_filter ip_tables x_tables floppy nvram i2c_piix4 i2c_core ata_piix ata_generic rtc e1000 virtio_net 8139cp mii virtio_blk virtio_pci virtio_ring virtio uhci_hcd ohci_hcd ehci_hcd [last unloaded: nf_conntrack] [ 876.182162] Pid: 11280, comm: ipvs_syncmaster Not tainted 2.6.27-rc4-30.0.jsullivan.CL1 #1 [ 876.182162] RIP: 0010:[<ffffffff80484740>] [<ffffffff80484740>] skb_push+0x0/0x50 [ 876.182162] RSP: 0000:ffffffff80769c20 EFLAGS: 00000246 [ 876.182162] RAX: ffff880016921010 RBX: ffffffff80769c58 RCX: 0000000000000000 [ 876.182162] RDX: 0000000000000000 RSI: 000000000000000e RDI: ffff88005e86c0c0 [ 876.182162] RBP: ffffffff80769ba0 R08: ffffffff804b3d50 R09: ffffffffa013d9f0 [ 876.182162] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8020c406 [ 876.182162] R13: ffffffff80769ba0 R14: ffff88005e86c0c0 R15: 000000000000000e [ 876.182162] FS: 0000000000000000(0000) GS:ffffffff806c7600(0000) knlGS:0000000000000000 [ 876.182162] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 876.182162] CR2: 00007fe50dff0020 CR3: 000000001a833000 CR4: 00000000000006e0 [ 876.182162] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 876.182162] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 876.182162] [ 876.182162] Call Trace: [ 876.182162] <IRQ> [<ffffffff804b3e6f>] ? ip_finish_output+0x11f/0x2c0 [ 876.182162] [<ffffffff804b4361>] ? ip_output+0x71/0xb0 [ 876.182162] [<ffffffffa0132071>] ? ip_vs_dr_xmit+0xc1/0x1d0 [ip_vs] [ 876.182162] [<ffffffffa012c557>] ? ip_vs_in+0x1a7/0x360 [ip_vs] [ 876.182162] [<ffffffff804a917f>] ? nf_iterate+0x5f/0x90 [ 876.182162] [<ffffffff804af970>] ? ip_local_deliver_finish+0x0/0x210 [ 876.182162] [<ffffffff804a9213>] ? nf_hook_slow+0x63/0xf0 [ 876.182162] [<ffffffff804af970>] ? ip_local_deliver_finish+0x0/0x210 [ 876.182162] [<ffffffff804a917f>] ? nf_iterate+0x5f/0x90 [ 876.182162] [<ffffffff804b0028>] ? ip_local_deliver+0x78/0x90 [ 876.182162] [<ffffffff804af717>] ? ip_rcv_finish+0x157/0x3b0 [ 876.182162] [<ffffffff804afeb6>] ? ip_rcv+0x1f6/0x2f0 [ 876.182162] [<ffffffff8048ad82>] ? netif_receive_skb+0x312/0x520 [ 876.182162] [<ffffffff804c9a30>] ? tcp_write_timer+0x0/0x680 [ 876.182162] [<ffffffff8048d947>] ? process_backlog+0x77/0xe0 [ 876.182162] [<ffffffff8048d0c9>] ? net_rx_action+0xf9/0x1b0 [ 876.182162] [<ffffffff80239d2a>] ? __do_softirq+0x7a/0xf0 [ 876.182162] [<ffffffff8020c9bc>] ? call_softirq+0x1c/0x30 [ 876.182162] <EOI> [<ffffffff8020e35d>] ? do_softirq+0x3d/0x80 [ 876.182162] [<ffffffff8023a1ce>] ? local_bh_enable+0x9e/0xb0 [ 876.182162] [<ffffffff804abc17>] ? __ip_route_output_key+0x177/0xa90 [ 876.182162] [<ffffffff804b2556>] ? ip_cork_release+0x36/0x50 [ 876.182162] [<ffffffff804b3c52>] ? ip_push_pending_frames+0x2e2/0x3e0 [ 876.182162] [<ffffffff804ac561>] ? ip_route_output_flow+0x31/0x2b0 [ 876.182162] [<ffffffff804d1edf>] ? udp_sendmsg+0x56f/0x690 [ 876.182162] [<ffffffff804d8815>] ? inet_sendmsg+0x45/0x80 [ 876.182162] [<ffffffffa0133940>] ? sync_thread_master+0x0/0x200 [ip_vs] [ 876.182162] [<ffffffff8047deef>] ? sock_sendmsg+0xdf/0x110 [ 876.182162] [<ffffffff8024bcaa>] ? enqueue_hrtimer+0x7a/0x100 [ 876.182162] [<ffffffff8024c8cc>] ? ktime_get_ts+0x4c/0x60 [ 876.182162] [<ffffffff80249360>] ? autoremove_wake_function+0x0/0x40 [ 876.182162] [<ffffffff8022d274>] ? hrtick_start_fair+0x154/0x170 [ 876.182162] [<ffffffff8022d31f>] ? pick_next_task_fair+0x8f/0xb0 [ 876.182162] [<ffffffff8047f594>] ? kernel_sendmsg+0x34/0x50 [ 876.182162] [<ffffffffa0132e87>] ? ip_vs_send_sync_msg+0x57/0x70 [ip_vs] [ 876.182162] [<ffffffffa01339b0>] ? sync_thread_master+0x70/0x200 [ip_vs] [ 876.182162] [<ffffffff80248f0d>] ? kthread+0x4d/0x80 [ 876.182162] [<ffffffff8020c659>] ? child_rip+0xa/0x11 [ 876.182162] [<ffffffff80248ec0>] ? kthread+0x0/0x80 [ 876.182162] [<ffffffff8020c64f>] ? child_rip+0x0/0x11 [ 876.182162] Bug Location ------------ The bug is present in the latest 2.6.35 kernel and the head of the lvs-test-2.6 branch. Using git-bisect I identified the following commit within the 2.6.28 branch as the culprit: commit f2428ed5e7bc89c7716ead22748cb5d076e204f0 Author: Simon Horman <ho...@verge.net.au> Date: Fri Sep 5 11:17:14 2008 +1000 ipvs: load balance ipv6 connections from a local process In particular this hunk (the last in the patch) when backed out causes the problem to go away: diff --git a/net/ipv4/ipvs/ip_vs_core.c b/net/ipv4/ipvs/ip_vs_core.c --- a/net/ipv4/ipvs/ip_vs_core.c +++ b/net/ipv4/ipvs/ip_vs_core.c @@ -1281,9 +1274,7 @@ ip_vs_in(unsigned int hooknum, struct sk_buff *skb, * Big tappo: only PACKET_HOST, including loopback for local client * Don't handle local packets on IPv6 for now */ - if (unlikely(skb->pkt_type != PACKET_HOST || - (af == AF_INET6 || (skb->dev->flags & IFF_LOOPBACK || - skb->sk)))) { + if (unlikely(skb->pkt_type != PACKET_HOST)) { IP_VS_DBG_BUF(12, "packet type=%d proto=%d daddr=%s ignored\n", skb->pkt_type, iph.protocol, This is within the function ip_vs_in(), which as can be seen is high up on the oops backtrace above. Now I don't understand the codebase anywhere near enough to know exactly what each of the four original sub-conditions is trying to achieve (except the af check, that much is obvious), however putting back those sub-conditions one by one shows that the IFF_LOOPBACK check appears to be important here. If I back out that hunk and remove just the af check or the af check and the skb->sk check, then again, the bug disappears. I suspect that removing the last two conditions causes this function to attempt to handle more types of packet than it is prepared for, re-injecting some particular sub-type, which it then gets to process again, re-inject again, process again... essentially causing an infinite loop within that kernel thread. The ipvs directory was moved since that commit, this is a patch against 2.6.35 HEAD which restores the final two conditions: diff -r -U3 a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c --- a/net/netfilter/ipvs/ip_vs_core.c 2008-09-08 00:34:43.000000000 +0100 +++ b/net/netfilter/ipvs/ip_vs_core.c 2010-09-22 16:44:43.402466098 +0100 @@ -1274,7 +1274,9 @@ * Big tappo: only PACKET_HOST, including loopback for local client * Don't handle local packets on IPv6 for now */ - if (unlikely(skb->pkt_type != PACKET_HOST)) { + if (unlikely(skb->pkt_type != PACKET_HOST || + (skb->dev->flags & IFF_LOOPBACK || + skb->sk))) { IP_VS_DBG_BUF(12, "packet type=%d proto=%d daddr=%s ignored\n", skb->pkt_type, iph.protocol, Note that the comment implies the original check was for "local packets on IPv6" and that local IPv4 should have been handled, however all the sub-conditions are ORed together so would have applied equally to IPv4 traffic, so I assume removing just the af check is enough to handle IPv6 in the same way as IPv4. John -- _______________________________________________ Please read the documentation before posting - it's available at: http://www.linuxvirtualserver.org/ LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org Send requests to lvs-users-requ...@linuxvirtualserver.org or go to http://lists.graemef.net/mailman/listinfo/lvs-users