Hello all, We are experiencing very random crashes on brand new hardware with the ixgbe driver in Ubuntu. This is both with the included Ubuntu driver, as well as with the updated driver from e1000.
The problem initially happened a couple years ago on a server that we put together ourselves (Supermicro motherboard plus Intel X540-T2 PCIe adapter). So, when the crashing started to happen after Ubuntu 14.04, I wrote it off as “a hardware incompatibility that must have been overlooked” or some hardware failure that must have happened. I do not believe this ever happened even one single time while running Ubuntu 12.04. Removing the X540-T2 adapter from the system restores complete stability. Since then, we had purchased multiple servers with on-board X540 (complete system designed by Supermicro) and have had zero crashes over nearly 3 years. In November, we received the newest addition to the datacenter, which was also a server designed by Supermicro, but instead of having on-board X540, we had to add a Supermicro add-on card with the X540 chip. This server performs excellent in all situations and under any workload. But it randomly crashes and we are unable to get any identifying information at all to diagnose the problem. Working with Supermicro to resolve the problem, we have updated firmwares for both the server as well as the X540 add-on card, and neither has helped. And no crashes happen at all after removing the X540 adapter. We use these servers as VM hosts connected to Cisco switch using bonded RJ-45 10Gb to the iSCSI target. As a last-ditch effort to “try anything that might help,” we are running Ubuntu 17.10 now and the problem still persists. I have not tested without bonding enabled, as that is critical to our datacenter requirements. With the exact same OS/network configuration (also now on 17.10 to rule out potential causes), the servers with on-board X540 do not exhibit any of these issues. The readme details problems with LRO with bridging/forwarding, but we are only bonding. And also, in some other Intel readme type docs, we see some info about MSI/MSI-X and iSCSI, but it is a little confusing to me. Either way, the module is loading the adapter with LRO disabled, but the bonding seems to reenable it (though as said, the other servers with on-board X540 do not crash). Not sure if that matters. One other thing that I wouldn’t expect to have any impact on matters is that the bonding actually takes a little time to get going. The switch logs show that the adapter takes up to 3.5 minutes between the time that the adapter first establishes a physical link and when it is finally accepted as a bonded team. Our other servers establish this link in near real-time as the adapters are going online during the OS boot. So, while that’s a little odd, I wouldn’t think it has any direct relation to the crashing … But maybe a clue as to certain behavior of the OS and network adapter? I have attempted to catch whatever error/crash/panic messages on the console, but have been unable to do so, since the crashing is so unbelievably random. It can run for 5 days without any issue, and then crash during a no-load period. Or one time it crashed when I purposely pushed the 10Gb adapters using iperf — but this only happened one time and I have been unable to reproduce. And it doesn’t seem that the kdump crash tool is even loading when whatever this event is happens. It simply appears as a hard reset. What we do have SNMP data that is retrieved during the “crashing” period. The server actually takes about 5 or so minutes to completely crash. During this crashing period, we notice the following data from SNMP: - Ultra-high context switches (~3M/s) — usually this is <30K/s - Ultra-high interrupts (>10M/s) — usually this is <30K/s - Processor data is wacky (user, nice and system all ~30%) — usually about 4% user, 2% system and 0.1% nice - Ultra-high system I/O (>50Gb/s in, >15Gb/s out) — usually about 2Mb/s in, 500Kb/s out - Ultra-high IP (IPv4/6/ICMP) packets (>25M/s in, >12M/s out) — usually less than a few hundred packets per second During this crashing period, all VMs that are running on the host begin to die off, since it seems like they become unable to access the network storage. And after about 2 or 3 minutes of this madness, the system just resets. There is nothing in system or IPMI logs, nor is there anything in any of the Ubuntu OS logs. It might be pure speculation and coincidence, but these problems do not happen with our on-board X540. Only PCIe adapters that have been added. Please help, we don’t know what to do or what to try for debugging purposes (but we’ll try anything). Thank you in advance, any help would be greatly appreciated. ~~~ In addition to the previous info, all I have to go on is our previous experience with the X540 PCIe card on Supermicro motherboard that had problems, as well as the following threads: https://sourceforge.net/p/scst/mailman/message/32636869/ https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1547680 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=183390 https://forums.freenas.org/index.php?threads/intel-x520-ixgbe-hangs.14288/ https://redmine.ixsystems.com/issues/4560 https://www.reddit.com/r/vmware/comments/43on8p/supermicro_10g_nics_drop/?st=jawx4jcp&sh=982d07f0#bottom-comments http://www.synchronet.com/blog/intel-10-gigabit-x540-at2-driverfirmware-issues/ https://forums.freenas.org/index.php?threads/kernel-panic-on-supermicro-ssg-6048r-e1cr36l.39890/ https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1404409 Here is info from ethtool (these settings are true for both the servers that have problems as well as the ones that have zero issues). Each adapter: Features for enp97s0f1: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: on [fixed] tx-checksum-sctp: on scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off [fixed] generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: on highdma: on [fixed] rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: on [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] Bonded interface: Features for iscsi0: rx-checksumming: off [fixed] tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [requested on] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: on tx-tcp-mangleid-segmentation: on tx-tcp6-segmentation: on udp-fragmentation-offload: off [fixed] generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: on rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off [fixed] receive-hashing: off [fixed] highdma: on rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: on [fixed] netns-local: on [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: off [fixed] tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off [fixed] esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] ~ Laz Peterson Paravis, LLC ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired