Re: [ovs-dev] ovs-vswitch kernel panic randomly started after 400+ days uptime

Joe Stringer Fri, 06 Jan 2017 12:12:49 -0800

On 6 January 2017 at 11:47, Uri Foox <[email protected]> wrote:

> Hey Joe,
>
> I do agree that the patches for the Linux Kernel were not 1:1 with what
> our stack trace showed but it was the only thing we remotely found that
> explained our issue. Granted, after upgrading the kernel it was clear that
> it fixed nothing - so, back to the drawing board...
>
> Given your initial comment of something above the stack most likely
> causing the issue we went through our network switches and took the steps
> in disconnecting one of the network interfaces on each of the computing
> nodes that communicate to our Juniper Switch which routes internet traffic.
> Looking at the Juniper Switch we see a lot of errors about interfaces
> flapping on/off although the timing of them does not correlate exactly with
> the timing of the crashes (they are plus/minus a few minutes before/after
> the crash) but we do see that these errors appear to begin on the same
> day/time that we had our first kernel panic and have continued. As soon as
> we disconnected the network interface the Juniper stopped logging any error
> messages and we have not experienced a kernel panic in nearly six hours
> whereas before it was happening as frequently as every two hours. Won't
> declare victory yet but it's the first time in a couple of weeks we've had
> stability.
>


For completeness, I want to say - there's no good reason that Linux should
crash if it receives a bad packet. This condition may be triggered by
something external, but it's a bug in the kernel. I think that there's
supposed to be a check after IPGRE decap to ensure the packet is big
enough, and that doesn't exist. Somewhere between gre_cisco_rcv() and
ovs_flow_extract() (or, in newer kernels, key_extract()), there is supposed
to be this check and it's missing.

If you can alleviate your issue, that's great for you; if we can fix this
problem for other users of GRE (and potentially other tunnel types), that's
great for everybody. So I think it's worth digging a bit further if we can.
Having a minimal reproduction environment is always nice so we can verify
that any proposed fix does address the issue. I also wonder if this even
affects the latest kernels, although the code has been refactored
considerably in v4.3 onwards.


>
> Here is a sample of the error message in the Juniper log if it tells you
> anything:
>
> Jan  2 00:34:23  pod2-core dfwd[1114]: CH_NET_SERV_KNOB_STATE read failed
> (rtslib err 2 - No such file or directory). Setting chassis state to NORMAL
> (All FPC) and retry in the idle phase (59 retries)
> Jan  2 00:34:48  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 567,
> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/30
> Jan  2 00:35:06  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 569,
> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/31
> Jan  2 00:36:06  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 569,
> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/31
> Jan  2 00:39:06  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 569,
> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/31
> Jan  2 00:44:33  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 567,
> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/30
>
> Where the interface 0/0/30 can be replaced with any interface that we have
> plugged into our computing notes - they all showed errors.
>
> I suspect that your analysis is somewhat accurate and essentially this
> switch suffered some sort of failure that has manifested itself in an
> extremely odd way with sending some rogue packets that either the kernel or
> the version of OVS we are running cannot recover from.
>
> root@node-2:~# ovs-vswitchd -V
> ovs-vswitchd (Open vSwitch) 2.0.2
> Compiled Nov 28 2014 21:37:19
> OpenFlow versions 0x1:0x1
>

> I figured I would follow up with what we did to "solve" the issue. We're
> not really sure whether we should reboot or RMA the switch. For now if the
> above gives Pravin or you any more insights please do share.
>
> As a side note, I have to say I am extremely thankful for the replies to
> this thread. I figured posting something would have a low chance of getting
> any attention but your confirmation of what I was able to piece together
> gave us the confidence to move in a direction that hopefully brings back
> stability.
>

It always helps when you bring very specific kernel traces,
impact/behaviour and well-written descriptions. Thanks for reporting the
issue!
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] ovs-vswitch kernel panic randomly started after 400+ days uptime

Reply via email to