Re: [E1000-devel] OOM in secondary cgroup leading to networking loss

Alexander Duyck Mon, 30 Jul 2018 11:39:42 -0700

On Sat, Jul 28, 2018 at 5:44 PM, Àbéjídé Àyodélé
<abejideayod...@gmail.com> wrote:
> Hi friends,
>
> On one of our machines at work, we observed a sequence of events starting
> from an OOM in a secondary cgroup which ends up in the bond interface being
> down for a period of up to 12 seconds. Below is some piece of dmesg about when
> the bond interface went down:
>
> [Wed Jul 25 19:20:45 2018] Call Trace:
> [Wed Jul 25 19:20:45 2018]  <IRQ>
> [Wed Jul 25 19:20:45 2018]  ? dev_deactivate_queue.constprop.29+0x60/0x60
> [Wed Jul 25 19:20:45 2018]  call_timer_fn+0x30/0x120
> [Wed Jul 25 19:20:45 2018]  run_timer_softirq+0x3c8/0x420
> [Wed Jul 25 19:20:45 2018]  ? timerqueue_add+0x52/0x80
> [Wed Jul 25 19:20:45 2018]  ? enqueue_hrtimer+0x37/0x80
> [Wed Jul 25 19:20:45 2018]  ? recalibrate_cpu_khz+0x10/0x10
> [Wed Jul 25 19:20:45 2018]  __do_softirq+0xde/0x2b3
> [Wed Jul 25 19:20:45 2018]  irq_exit+0xae/0xb0
> [Wed Jul 25 19:20:45 2018]  smp_apic_timer_interrupt+0x70/0x120
> [Wed Jul 25 19:20:45 2018]  apic_timer_interrupt+0x7d/0x90
> [Wed Jul 25 19:20:45 2018]  </IRQ>
> [Wed Jul 25 19:20:45 2018] RIP: 0010:cpuidle_enter_state+0xa2/0x2e0
> [Wed Jul 25 19:20:45 2018] RSP: 0018:ffffffff9c403eb0 EFLAGS: 00000246
> ORIG_RAX: ffffffffffffff10
> [Wed Jul 25 19:20:45 2018] RAX: ffff9075c0821a40 RBX: 0006cdb159b6000e RCX:
> 000000000000001f
> [Wed Jul 25 19:20:45 2018] RDX: 0006cdb159b6000e RSI: ffed6f2696159a35 RDI:
> 0000000000000000
> [Wed Jul 25 19:20:45 2018] RBP: ffffd3edc100b900 R08: 0000000000000f48 R09:
> 0000000000000cfe
> [Wed Jul 25 19:20:45 2018] R10: ffffffff9c403e90 R11: 0000000000000f12 R12:
> 0000000000000003
> [Wed Jul 25 19:20:45 2018] R13: ffffffff9c4b03d8 R14: 0000000000000000 R15:
> 0006cdb159783c8e
> [Wed Jul 25 19:20:45 2018]  do_idle+0x181/0x1e0
> [Wed Jul 25 19:20:45 2018]  cpu_startup_entry+0x19/0x20
> [Wed Jul 25 19:20:45 2018]  start_kernel+0x400/0x408
> [Wed Jul 25 19:20:45 2018]  secondary_startup_64+0xa5/0xb0
> [Wed Jul 25 19:20:45 2018] Code: 63 8e 60 04 00 00 eb 8f 4c 89 f7 c6 05 79 c7
> b8 00 01 e8 00 7c fd ff 89 d9 48 89 c2 4c 89 f6 48 c7 c7 f0 38 28 9c e8 c7 a3
> b6 ff <0f> 0b eb bd 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 56
> [Wed Jul 25 19:20:45 2018] ---[ end trace 2ad2942fe3431402 ]---
> [Wed Jul 25 19:20:45 2018] ixgbe 0000:19:00.1 eno2: initiating reset due to tx
> timeout
> [Wed Jul 25 19:20:45 2018] ixgbe 0000:19:00.1 eno2: Reset adapter
> [Wed Jul 25 19:20:48 2018] ixgbe 0000:19:00.0 eno1: initiating reset due to tx
> timeout
> [Wed Jul 25 19:20:53 2018] ixgbe 0000:19:00.0 eno1: initiating reset due to tx
> timeout
>
> We have observed a similar behavior on a 4.15.11 kernel we were running on a
> different machine, the current machine these logs are from runs a 4.14.52
> kernel. A more detailed dmesg content can be found here:
>
> https://gist.github.com/bjhaid/49a1c58742ef2458984339503290ef9a
>
> I will appreciate any help in figuring out the cause and fix of this issue,
> posting here because the ixgbe drivers were involved...
>
> ixgbe driver version 5.1.0-k
>
> NICs:
> 2 Intel Corporation Ethernet Controller 10G X550T
> Bond: LACP
>
> Thanks!
>
> Abejide Ayodele
> It always seems impossible until it's done. --Nelson Mandela


Hi Abejide,

There are few possibilities in terms of what could be going on.

First, based on the dmesg log you provided it would seem the Tx
clean-up path is unable to run. I am assuming this as from what I can
tell it looks like the resets are in response to the Tx watchdog in
the network stack, and not being reported via the "Tx Hang" detection
logic in the driver.

Second, the dmesg log you included seemed to indicate that something
was causing your CPUs to spin. In the log you will find references to
"rcu_sched detected stalls on CPUs/tasks" for CPU 26, and it looks
like it has been stuck dumping logs via serial connection for roughly
20+ seconds based on the fact that the log looks very similar to what
you start out with, but now has a loop waiting on some serial output
appended to the top of the log.

Those two combined make me wonder if we had one or more NAPI threads
for the Tx cleanup on CPU 26 that got stalled as well resulting in the
Tx hang message being generated by CPU 0.

This is mostly just theory at this point since I am just guessing this
based on the logs provided. Have you been seeing this as a
reproducible issue or is this something that has only occurred once?
One thing you may want to do is take a look at the serial output
configuration. If you are running at just 9600 baud you may want to
see if you can configure a higher speed so that you can flush out the
logs quick enough to avoid long delays in serial output.

Thanks.

- Alex

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] OOM in secondary cgroup leading to networking loss

Reply via email to