Hello all,

We are experiencing very random crashes on brand new hardware with the ixgbe 
driver in Ubuntu.  This is both with the included Ubuntu driver, as well as 
with the updated driver from e1000.

The problem initially happened a couple years ago on a server that we put 
together ourselves (Supermicro motherboard plus Intel X540-T2 PCIe adapter).  
So, when the crashing started to happen after Ubuntu 14.04, I wrote it off as 
“a hardware incompatibility that must have been overlooked” or some hardware 
failure that must have happened.  I do not believe this ever happened even one 
single time while running Ubuntu 12.04.  Removing the X540-T2 adapter from the 
system restores complete stability.

Since then, we had purchased multiple servers with on-board X540 (complete 
system designed by Supermicro) and have had zero crashes over nearly 3 years.  
In November, we received the newest addition to the datacenter, which was also 
a server designed by Supermicro, but instead of having on-board X540, we had to 
add a Supermicro add-on card with the X540 chip.  This server performs 
excellent in all situations and under any workload.  But it randomly crashes 
and we are unable to get any identifying information at all to diagnose the 
problem.

Working with Supermicro to resolve the problem, we have updated firmwares for 
both the server as well as the X540 add-on card, and neither has helped.  And 
no crashes happen at all after removing the X540 adapter.

We use these servers as VM hosts connected to Cisco switch using bonded RJ-45 
10Gb to the iSCSI target.  As a last-ditch effort to “try anything that might 
help,” we are running Ubuntu 17.10 now and the problem still persists.  I have 
not tested without bonding enabled, as that is critical to our datacenter 
requirements.  With the exact same OS/network configuration (also now on 17.10 
to rule out potential causes), the servers with on-board X540 do not exhibit 
any of these issues.  The readme details problems with LRO with 
bridging/forwarding, but we are only bonding.  And also, in some other Intel 
readme type docs, we see some info about MSI/MSI-X and iSCSI, but it is a 
little confusing to me.  Either way, the module is loading the adapter with LRO 
disabled, but the bonding seems to reenable it (though as said, the other 
servers with on-board X540 do not crash).  Not sure if that matters.

One other thing that I wouldn’t expect to have any impact on matters is that 
the bonding actually takes a little time to get going.  The switch logs show 
that the adapter takes up to 3.5 minutes between the time that the adapter 
first establishes a physical link and when it is finally accepted as a bonded 
team.  Our other servers establish this link in near real-time as the adapters 
are going online during the OS boot.  So, while that’s a little odd, I wouldn’t 
think it has any direct relation to the crashing … But maybe a clue as to 
certain behavior of the OS and network adapter?

I have attempted to catch whatever error/crash/panic messages on the console, 
but have been unable to do so, since the crashing is so unbelievably random.  
It can run for 5 days without any issue, and then crash during a no-load 
period.  Or one time it crashed when I purposely pushed the 10Gb adapters using 
iperf — but this only happened one time and I have been unable to reproduce.  
And it doesn’t seem that the kdump crash tool is even loading when whatever 
this event is happens.  It simply appears as a hard reset.

What we do have SNMP data that is retrieved during the “crashing” period.  The 
server actually takes about 5 or so minutes to completely crash.  During this 
crashing period, we notice the following data from SNMP:
- Ultra-high context switches (~3M/s) — usually this is <30K/s
- Ultra-high interrupts (>10M/s) — usually this is <30K/s
- Processor data is wacky (user, nice and system all ~30%) — usually about 4% 
user, 2% system and 0.1% nice
- Ultra-high system I/O (>50Gb/s in, >15Gb/s out) — usually about 2Mb/s in, 
500Kb/s out
- Ultra-high IP (IPv4/6/ICMP) packets (>25M/s in, >12M/s out) — usually less 
than a few hundred packets per second

During this crashing period, all VMs that are running on the host begin to die 
off, since it seems like they become unable to access the network storage.  And 
after about 2 or 3 minutes of this madness, the system just resets.  There is 
nothing in system or IPMI logs, nor is there anything in any of the Ubuntu OS 
logs.

It might be pure speculation and coincidence, but these problems do not happen 
with our on-board X540.  Only PCIe adapters that have been added.

Please help, we don’t know what to do or what to try for debugging purposes 
(but we’ll try anything).  Thank you in advance, any help would be greatly 
appreciated.

~~~

In addition to the previous info, all I have to go on is our previous 
experience with the X540 PCIe card on Supermicro motherboard that had problems, 
as well as the following threads:

https://sourceforge.net/p/scst/mailman/message/32636869/
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1547680
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=183390
https://forums.freenas.org/index.php?threads/intel-x520-ixgbe-hangs.14288/
https://redmine.ixsystems.com/issues/4560
https://www.reddit.com/r/vmware/comments/43on8p/supermicro_10g_nics_drop/?st=jawx4jcp&sh=982d07f0#bottom-comments
http://www.synchronet.com/blog/intel-10-gigabit-x540-at2-driverfirmware-issues/
https://forums.freenas.org/index.php?threads/kernel-panic-on-supermicro-ssg-6048r-e1cr36l.39890/
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1404409

Here is info from ethtool (these settings are true for both the servers that 
have problems as well as the ones that have zero issues).

Each adapter:
Features for enp97s0f1:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: on [fixed]
        tx-checksum-sctp: on
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]

Bonded interface:
Features for iscsi0:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [requested on]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]

~ Laz Peterson
Paravis, LLC
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to