On 6 January 2017 at 11:47, Uri Foox <[email protected]> wrote: > Hey Joe, > > I do agree that the patches for the Linux Kernel were not 1:1 with what > our stack trace showed but it was the only thing we remotely found that > explained our issue. Granted, after upgrading the kernel it was clear that > it fixed nothing - so, back to the drawing board... > > Given your initial comment of something above the stack most likely > causing the issue we went through our network switches and took the steps > in disconnecting one of the network interfaces on each of the computing > nodes that communicate to our Juniper Switch which routes internet traffic. > Looking at the Juniper Switch we see a lot of errors about interfaces > flapping on/off although the timing of them does not correlate exactly with > the timing of the crashes (they are plus/minus a few minutes before/after > the crash) but we do see that these errors appear to begin on the same > day/time that we had our first kernel panic and have continued. As soon as > we disconnected the network interface the Juniper stopped logging any error > messages and we have not experienced a kernel panic in nearly six hours > whereas before it was happening as frequently as every two hours. Won't > declare victory yet but it's the first time in a couple of weeks we've had > stability. >
For completeness, I want to say - there's no good reason that Linux should crash if it receives a bad packet. This condition may be triggered by something external, but it's a bug in the kernel. I think that there's supposed to be a check after IPGRE decap to ensure the packet is big enough, and that doesn't exist. Somewhere between gre_cisco_rcv() and ovs_flow_extract() (or, in newer kernels, key_extract()), there is supposed to be this check and it's missing. If you can alleviate your issue, that's great for you; if we can fix this problem for other users of GRE (and potentially other tunnel types), that's great for everybody. So I think it's worth digging a bit further if we can. Having a minimal reproduction environment is always nice so we can verify that any proposed fix does address the issue. I also wonder if this even affects the latest kernels, although the code has been refactored considerably in v4.3 onwards. > > Here is a sample of the error message in the Juniper log if it tells you > anything: > > Jan 2 00:34:23 pod2-core dfwd[1114]: CH_NET_SERV_KNOB_STATE read failed > (rtslib err 2 - No such file or directory). Setting chassis state to NORMAL > (All FPC) and retry in the idle phase (59 retries) > Jan 2 00:34:48 pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 567, > ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/30 > Jan 2 00:35:06 pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 569, > ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/31 > Jan 2 00:36:06 pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 569, > ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/31 > Jan 2 00:39:06 pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 569, > ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/31 > Jan 2 00:44:33 pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 567, > ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/30 > > Where the interface 0/0/30 can be replaced with any interface that we have > plugged into our computing notes - they all showed errors. > > I suspect that your analysis is somewhat accurate and essentially this > switch suffered some sort of failure that has manifested itself in an > extremely odd way with sending some rogue packets that either the kernel or > the version of OVS we are running cannot recover from. > > root@node-2:~# ovs-vswitchd -V > ovs-vswitchd (Open vSwitch) 2.0.2 > Compiled Nov 28 2014 21:37:19 > OpenFlow versions 0x1:0x1 > > I figured I would follow up with what we did to "solve" the issue. We're > not really sure whether we should reboot or RMA the switch. For now if the > above gives Pravin or you any more insights please do share. > > As a side note, I have to say I am extremely thankful for the replies to > this thread. I figured posting something would have a low chance of getting > any attention but your confirmation of what I was able to piece together > gave us the confidence to move in a direction that hopefully brings back > stability. > It always helps when you bring very specific kernel traces, impact/behaviour and well-written descriptions. Thanks for reporting the issue! _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
