> I'd say we go with option #2. Please provide information on how to proceed, 
> and how to
> undo any changes we test :)

ok, so first, these instructions may cause the card to hang; the system
may need to be rebooted or the driver reloaded.  The changes here can be
undone by resetting the card; rebooting or reloading the driver.

Also please note these instructions are ONLY FOR i40e NICs!

The process here is to clear all the nic's hardware asserts, and then
enable each of them one-by-one and try to reproduce the MDD event.  That
way, when it reproduces, we know exactly which hw assert triggered it.

First, find your nic's pci address, e.g. ethtool -i NIC | grep bus-info

Then (as root) cd to "/sys/kernel/debug/i40e/BUSID" (replace BUSID with
your nic's actual pci addr).  You should see a "command" file there.

Now zero out the registers:

$ echo write 0xe648c 0 > command
$ echo write 0x442f4 0 > command

Then, set a single bit; starting with 0x1 on the first register:

$ echo write 0xe648c 0x1 > command

Do normal testing.  There are 3 possibilities at this step:

a) you test long enough to be sure the problem was avoided
b) your system and/or nic hangs due to an "uncaught" MDD event
c) you reproduce the problem, and see the TX error and PF reset

For either (a) or (b), that means this bit isn't the one we're looking
for, so move to the next bit:

$ echo write 0xe648c 0 > command
$ echo write 0x442f4 0 > command
$ echo write 0xe648c 0x2 > command

Then retest.  Replace "0x2" with incrementing bits, as you test each
bit.  Note this is setting individual bits, so the sequence to test is
(in hex) 1, 2, 4, 8, 10, 20, 40, 80, 100, etc.  This is a 32 bit
register so the highest bit to test is 0x80000000.  If you test all bits
in register 0xe648c without reproducing the problem, then move on to
register 0x442f4 testing bit-by-bit again starting at 0x1 again.  You
should be able to reproduce the problem with one of the bits set in one
of these two registers, according to what I've been told by Intel.

As you set each bit, you should get output in your dmesg and/or syslog
or kern.log, indicating the current value of the registers, e.g.:

write: 0xe648c = 0x1

You can also manually read the registers at any time with:

$ echo read 0xe648c > command
$ echo read 0x442f4 > command

you should see the results in dmesg/logs, e.g.:

read: 0xe648c = 0x1


Once/if you do reproduce the problem, make note of the values for both 
registers (i.e. what bit was set), and report that back here.  I'll check with 
Intel to find what the specific bit indicates the problem was.

Thanks!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1723127

Title:
  Intel i40e PF reset due to incorrect MDD detection (continues...)

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  In Progress

Bug description:
  This is a continuation from bug 1713553; a patch was added in that bug
  to attempt to fix this, and it may have helped reduce the issue but
  appears not to have fixed it, based on more reports.

  The issue is the i40e driver, when TSO is enabled, sometimes sees the
  NIC firmware issue a "MDD event" where MDD is "Malicious Driver
  Detection".  This is vaguely defined in the i40e spec, but with no way
  to tell what the NIC actually saw that it didn't like.  So, the driver
  can do nothing but print an error message and reset the PF (or VF).
  Unfortunately, this resets the interface, which causes an interruption
  in network traffic flow while the PF is resetting.

  See bug 1713553 for more details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1723127/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to