This bug is awaiting verification that the linux-nvidia-
tegra/5.15.0-1023.23 kernel in -proposed solves the problem. Please test
the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-nvidia-tegra' to
'verification-done-jammy-linux-nvidia-tegra'. If the problem still
exists, change the tag 'verification-needed-jammy-linux-nvidia-tegra' to
'verification-failed-jammy-linux-nvidia-tegra'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-nvidia-tegra-v2 
verification-needed-jammy-linux-nvidia-tegra

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
      [Impact]
       * Issue is causing transmit hang on E810 ports with bonding enabled.
       * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
       * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
      
      [Fix]
      * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
        This change has been tested in an environment where reproduction is 
easily achieved.
        After multiple iterations, no reproduction has been observed.
      * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
      
      [Test Plan]
      * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
      * The issue could appear on a random node, making reproduction hard to 
achieve.
      * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
      
      [Where problems could occur]
      * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
      * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
        Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.
        
      [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
      [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

      [Other Info]
      * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
      * Original description of the case below:
      
      

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw---- 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:

  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware                             20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw---- 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7525
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  Package: linux (not installed)
  PciMultimedia:

  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.2.0-32-generic 
root=UUID=9b437790-e6e2-4a2e-af79-5b13fee932af ro
  ProcVersionSignature: Ubuntu 6.2.0-32.32~22.04.1-generic 6.2.16
  RebootRequiredPkgs: Error: path contained symlinks.
  RelatedPackageVersions:
   linux-restricted-modules-6.2.0-32-generic N/A
   linux-backports-modules-6.2.0-32-generic  N/A
   linux-firmware                            20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 6.2.0-32-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/26/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 03WYW4
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A02
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/26/2023:br2.12:svnDellInc.:pnPowerEdgeR7525:pvr:rvnDellInc.:rn03WYW4:rvrA02:cvnDellInc.:ct23:cvr:skuSKU=08FF;ModelName=PowerEdgeR7525:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7525
  dmi.product.sku: SKU=08FF;ModelName=PowerEdge R7525
  dmi.sys.vendor: Dell Inc.
  mtime.conffile..etc.logrotate.d.apport: 2023-09-15T13:17:01.203771

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to