> I'd say we go with option #2. Please provide information on how to proceed, > and how to > undo any changes we test :)
ok, so first, these instructions may cause the card to hang; the system may need to be rebooted or the driver reloaded. The changes here can be undone by resetting the card; rebooting or reloading the driver. Also please note these instructions are ONLY FOR i40e NICs! The process here is to clear all the nic's hardware asserts, and then enable each of them one-by-one and try to reproduce the MDD event. That way, when it reproduces, we know exactly which hw assert triggered it. First, find your nic's pci address, e.g. ethtool -i NIC | grep bus-info Then (as root) cd to "/sys/kernel/debug/i40e/BUSID" (replace BUSID with your nic's actual pci addr). You should see a "command" file there. Now zero out the registers: $ echo write 0xe648c 0 > command $ echo write 0x442f4 0 > command Then, set a single bit; starting with 0x1 on the first register: $ echo write 0xe648c 0x1 > command Do normal testing. There are 3 possibilities at this step: a) you test long enough to be sure the problem was avoided b) your system and/or nic hangs due to an "uncaught" MDD event c) you reproduce the problem, and see the TX error and PF reset For either (a) or (b), that means this bit isn't the one we're looking for, so move to the next bit: $ echo write 0xe648c 0 > command $ echo write 0x442f4 0 > command $ echo write 0xe648c 0x2 > command Then retest. Replace "0x2" with incrementing bits, as you test each bit. Note this is setting individual bits, so the sequence to test is (in hex) 1, 2, 4, 8, 10, 20, 40, 80, 100, etc. This is a 32 bit register so the highest bit to test is 0x80000000. If you test all bits in register 0xe648c without reproducing the problem, then move on to register 0x442f4 testing bit-by-bit again starting at 0x1 again. You should be able to reproduce the problem with one of the bits set in one of these two registers, according to what I've been told by Intel. As you set each bit, you should get output in your dmesg and/or syslog or kern.log, indicating the current value of the registers, e.g.: write: 0xe648c = 0x1 You can also manually read the registers at any time with: $ echo read 0xe648c > command $ echo read 0x442f4 > command you should see the results in dmesg/logs, e.g.: read: 0xe648c = 0x1 Once/if you do reproduce the problem, make note of the values for both registers (i.e. what bit was set), and report that back here. I'll check with Intel to find what the specific bit indicates the problem was. Thanks! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1723127 Title: Intel i40e PF reset due to incorrect MDD detection (continues...) Status in linux package in Ubuntu: In Progress Status in linux source package in Xenial: In Progress Bug description: This is a continuation from bug 1713553; a patch was added in that bug to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. The issue is the i40e driver, when TSO is enabled, sometimes sees the NIC firmware issue a "MDD event" where MDD is "Malicious Driver Detection". This is vaguely defined in the i40e spec, but with no way to tell what the NIC actually saw that it didn't like. So, the driver can do nothing but print an error message and reset the PF (or VF). Unfortunately, this resets the interface, which causes an interruption in network traffic flow while the PF is resetting. See bug 1713553 for more details. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1723127/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp