[E1000-devel] OOM in secondary cgroup leading to networking loss

Àbéjídé Àyodélé Mon, 30 Jul 2018 10:56:11 -0700

Hi friends,

On one of our machines at work, we observed a sequence of events starting
from an OOM in a secondary cgroup which ends up in the bond interface being
down for a period of up to 12 seconds. Below is some piece of dmesg about when
the bond interface went down:


[Wed Jul 25 19:20:45 2018] Call Trace:
[Wed Jul 25 19:20:45 2018]  <IRQ>
[Wed Jul 25 19:20:45 2018]  ? dev_deactivate_queue.constprop.29+0x60/0x60
[Wed Jul 25 19:20:45 2018]  call_timer_fn+0x30/0x120
[Wed Jul 25 19:20:45 2018]  run_timer_softirq+0x3c8/0x420
[Wed Jul 25 19:20:45 2018]  ? timerqueue_add+0x52/0x80
[Wed Jul 25 19:20:45 2018]  ? enqueue_hrtimer+0x37/0x80
[Wed Jul 25 19:20:45 2018]  ? recalibrate_cpu_khz+0x10/0x10
[Wed Jul 25 19:20:45 2018]  __do_softirq+0xde/0x2b3
[Wed Jul 25 19:20:45 2018]  irq_exit+0xae/0xb0
[Wed Jul 25 19:20:45 2018]  smp_apic_timer_interrupt+0x70/0x120
[Wed Jul 25 19:20:45 2018]  apic_timer_interrupt+0x7d/0x90
[Wed Jul 25 19:20:45 2018]  </IRQ>
[Wed Jul 25 19:20:45 2018] RIP: 0010:cpuidle_enter_state+0xa2/0x2e0
[Wed Jul 25 19:20:45 2018] RSP: 0018:ffffffff9c403eb0 EFLAGS: 00000246
ORIG_RAX: ffffffffffffff10
[Wed Jul 25 19:20:45 2018] RAX: ffff9075c0821a40 RBX: 0006cdb159b6000e RCX:
000000000000001f
[Wed Jul 25 19:20:45 2018] RDX: 0006cdb159b6000e RSI: ffed6f2696159a35 RDI:
0000000000000000
[Wed Jul 25 19:20:45 2018] RBP: ffffd3edc100b900 R08: 0000000000000f48 R09:
0000000000000cfe
[Wed Jul 25 19:20:45 2018] R10: ffffffff9c403e90 R11: 0000000000000f12 R12:
0000000000000003
[Wed Jul 25 19:20:45 2018] R13: ffffffff9c4b03d8 R14: 0000000000000000 R15:
0006cdb159783c8e
[Wed Jul 25 19:20:45 2018]  do_idle+0x181/0x1e0
[Wed Jul 25 19:20:45 2018]  cpu_startup_entry+0x19/0x20
[Wed Jul 25 19:20:45 2018]  start_kernel+0x400/0x408
[Wed Jul 25 19:20:45 2018]  secondary_startup_64+0xa5/0xb0
[Wed Jul 25 19:20:45 2018] Code: 63 8e 60 04 00 00 eb 8f 4c 89 f7 c6 05 79 c7
b8 00 01 e8 00 7c fd ff 89 d9 48 89 c2 4c 89 f6 48 c7 c7 f0 38 28 9c e8 c7 a3
b6 ff <0f> 0b eb bd 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 56
[Wed Jul 25 19:20:45 2018] ---[ end trace 2ad2942fe3431402 ]---
[Wed Jul 25 19:20:45 2018] ixgbe 0000:19:00.1 eno2: initiating reset due to tx
timeout
[Wed Jul 25 19:20:45 2018] ixgbe 0000:19:00.1 eno2: Reset adapter
[Wed Jul 25 19:20:48 2018] ixgbe 0000:19:00.0 eno1: initiating reset due to tx
timeout
[Wed Jul 25 19:20:53 2018] ixgbe 0000:19:00.0 eno1: initiating reset due to tx
timeout

We have observed a similar behavior on a 4.15.11 kernel we were running on a
different machine, the current machine these logs are from runs a 4.14.52
kernel. A more detailed dmesg content can be found here:

https://gist.github.com/bjhaid/49a1c58742ef2458984339503290ef9a

I will appreciate any help in figuring out the cause and fix of this issue,
posting here because the ixgbe drivers were involved...

ixgbe driver version 5.1.0-k

NICs:
2 Intel Corporation Ethernet Controller 10G X550T
Bond: LACP

Thanks!

Abejide Ayodele
It always seems impossible until it's done. --Nelson Mandela

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

[E1000-devel] OOM in secondary cgroup leading to networking loss

Reply via email to