Due to the nature of this bug, being very difficult to reproduce, real
verification could take weeks instead of only days. However, one reporter has
been running with a test kernel I built here
which is the base 4.4.0-112 kernel plus the two patches from this bug.
In their testing, running on 6 weeks now, the problem has not reproduced
and they have seen no other issues. Of course, that test kernel doesn't
have all the other patches that the -proposed kernel has, but that
testing is likely the best verification we can get for this particular
bug. I have also asked the same reporter to switch their testing from
my test kernel over to the -proposed kernel, and to report any
unexpected issues they see. If they do report any regression, I'll
communicate that here.
Based on that justification, I'll mark this bug as verified.
** Tags removed: verification-needed-artful verification-needed-xenial
** Tags added: verification-done-artful verification-done-xenial
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
Intel i40e PF reset due to incorrect MDD detection (continues...)
Status in linux package in Ubuntu:
Status in linux source package in Trusty:
Status in linux source package in Xenial:
Status in linux source package in Artful:
Status in linux source package in Bionic:
The i40e driver sometimes causes a "malicious device" event that the
firmware detects, which causes the firmware to reset the nic, causing
an interruption in the network connection - which can cause further
problems, e.g. if the interface is in a bond; the reset will at least
cause a temporary interruption in network traffic.
The upstream patch to fix this adjusts how the driver fragments TX
data; the "malicious driver" detected by the firmware is a result of
incorrectly crafted TX fragment descriptors (the firmware has specific
complicated restrictions on this). The patch is from Intel, and they
suggested this specific patch to address the problem; additionally I
have checked with someone who reported this to me and provided a test
kernel with the patch to them, and they have been able to run ~6 weeks
so far without reproducing the issue; previously they could reproduce
it as quickly as a day, but usually within 2-3 weeks.
the bug is unfortunately very difficult to reproduce, but as shown in
this (and previous) bug comments, some users of the i40e have traffic
that can consistently reproduce the problem (although usually on the
order of days, or longer, to reproduce). Reproducing is easily
detected, as the nw traffic will be interrupted and the system logs
will contain a message like:
i40e 0000:02:00.1: TX driver issue detected, PF reset issued
the patch for this alters how tx is fragmented by the driver, so a
possible regression would likely cause problems in TX traffic and/or
additional "malicious device detection" events.
This is a continuation from bug 1713553; a patch was added in that bug
to attempt to fix this, and it may have helped reduce the issue but
appears not to have fixed it, based on more reports.
The issue is the i40e driver, when TSO is enabled, sometimes sees the
NIC firmware issue a "MDD event" where MDD is "Malicious Driver
Detection". This is vaguely defined in the i40e spec, but with no way
to tell what the NIC actually saw that it didn't like. So, the driver
can do nothing but print an error message and reset the PF (or VF).
Unfortunately, this resets the interface, which causes an interruption
in network traffic flow while the PF is resetting.
See bug 1713553 for more details.
To manage notifications about this bug go to:
Mailing list: https://launchpad.net/~kernel-packages
Post to : email@example.com
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp