[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

Heitor Alves de Siqueira Wed, 10 Mar 2021 12:01:11 -0800

** Description changed:

  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.
  
  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.
  
  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e 0000:02:00.1: TX driver issue detected, PF reset issued
  
  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - Passthrough VF 1 to a Bionic VM
  - Start iperf3 client on VM, going through i40evf interface
  - Start another iperf3 client on host, going through i40e interface
  Both iperf3 clients should be using an external server located on a separate 
host. By loading the kprobe module while iperf3 is running, we should be able 
to raise MDDs more consistently. MDD behaviour can change according to firmware 
version, so we may need to try with different sets of probes. The one with the 
most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the 
cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified 
of new data.
  
  [Regression Potential]
- Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected.
+ Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected, and that necessary PF resets will be issued by the netdev watchdog in 
case of any hung queues.
  
  ==
  [original description]
  
  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.
  
  See bug 1713553 and bug 1723127 for more details.


-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1772675

Title:
  i40e PF reset due to incorrect MDD event

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  Won't Fix

Bug description:
  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.

  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.

  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e 0000:02:00.1: TX driver issue detected, PF reset issued

  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - Passthrough VF 1 to a Bionic VM
  - Start iperf3 client on VM, going through i40evf interface
  - Start another iperf3 client on host, going through i40e interface
  Both iperf3 clients should be using an external server located on a separate 
host. By loading the kprobe module while iperf3 is running, we should be able 
to raise MDDs more consistently. MDD behaviour can change according to firmware 
version, so we may need to try with different sets of probes. The one with the 
most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the 
cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified 
of new data.

  [Regression Potential]
  Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected, and that necessary PF resets will be issued by the netdev watchdog in 
case of any hung queues.

  ==
  [original description]

  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.

  See bug 1713553 and bug 1723127 for more details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

Reply via email to