Hi friends, On one of our machines at work, we observed a sequence of events starting from an OOM in a secondary cgroup which ends up in the bond interface being down for a period of up to 12 seconds. Below is some piece of dmesg about when the bond interface went down:
[Wed Jul 25 19:20:45 2018] Call Trace: [Wed Jul 25 19:20:45 2018] <IRQ> [Wed Jul 25 19:20:45 2018] ? dev_deactivate_queue.constprop.29+0x60/0x60 [Wed Jul 25 19:20:45 2018] call_timer_fn+0x30/0x120 [Wed Jul 25 19:20:45 2018] run_timer_softirq+0x3c8/0x420 [Wed Jul 25 19:20:45 2018] ? timerqueue_add+0x52/0x80 [Wed Jul 25 19:20:45 2018] ? enqueue_hrtimer+0x37/0x80 [Wed Jul 25 19:20:45 2018] ? recalibrate_cpu_khz+0x10/0x10 [Wed Jul 25 19:20:45 2018] __do_softirq+0xde/0x2b3 [Wed Jul 25 19:20:45 2018] irq_exit+0xae/0xb0 [Wed Jul 25 19:20:45 2018] smp_apic_timer_interrupt+0x70/0x120 [Wed Jul 25 19:20:45 2018] apic_timer_interrupt+0x7d/0x90 [Wed Jul 25 19:20:45 2018] </IRQ> [Wed Jul 25 19:20:45 2018] RIP: 0010:cpuidle_enter_state+0xa2/0x2e0 [Wed Jul 25 19:20:45 2018] RSP: 0018:ffffffff9c403eb0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [Wed Jul 25 19:20:45 2018] RAX: ffff9075c0821a40 RBX: 0006cdb159b6000e RCX: 000000000000001f [Wed Jul 25 19:20:45 2018] RDX: 0006cdb159b6000e RSI: ffed6f2696159a35 RDI: 0000000000000000 [Wed Jul 25 19:20:45 2018] RBP: ffffd3edc100b900 R08: 0000000000000f48 R09: 0000000000000cfe [Wed Jul 25 19:20:45 2018] R10: ffffffff9c403e90 R11: 0000000000000f12 R12: 0000000000000003 [Wed Jul 25 19:20:45 2018] R13: ffffffff9c4b03d8 R14: 0000000000000000 R15: 0006cdb159783c8e [Wed Jul 25 19:20:45 2018] do_idle+0x181/0x1e0 [Wed Jul 25 19:20:45 2018] cpu_startup_entry+0x19/0x20 [Wed Jul 25 19:20:45 2018] start_kernel+0x400/0x408 [Wed Jul 25 19:20:45 2018] secondary_startup_64+0xa5/0xb0 [Wed Jul 25 19:20:45 2018] Code: 63 8e 60 04 00 00 eb 8f 4c 89 f7 c6 05 79 c7 b8 00 01 e8 00 7c fd ff 89 d9 48 89 c2 4c 89 f6 48 c7 c7 f0 38 28 9c e8 c7 a3 b6 ff <0f> 0b eb bd 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 56 [Wed Jul 25 19:20:45 2018] ---[ end trace 2ad2942fe3431402 ]--- [Wed Jul 25 19:20:45 2018] ixgbe 0000:19:00.1 eno2: initiating reset due to tx timeout [Wed Jul 25 19:20:45 2018] ixgbe 0000:19:00.1 eno2: Reset adapter [Wed Jul 25 19:20:48 2018] ixgbe 0000:19:00.0 eno1: initiating reset due to tx timeout [Wed Jul 25 19:20:53 2018] ixgbe 0000:19:00.0 eno1: initiating reset due to tx timeout We have observed a similar behavior on a 4.15.11 kernel we were running on a different machine, the current machine these logs are from runs a 4.14.52 kernel. A more detailed dmesg content can be found here: https://gist.github.com/bjhaid/49a1c58742ef2458984339503290ef9a I will appreciate any help in figuring out the cause and fix of this issue, posting here because the ixgbe drivers were involved... ixgbe driver version 5.1.0-k NICs: 2 Intel Corporation Ethernet Controller 10G X550T Bond: LACP Thanks! Abejide Ayodele It always seems impossible until it's done. --Nelson Mandela ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired