** Changed in: linux (Ubuntu Plucky)
       Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2121149

Title:
  [UBUNTU 24.04] s390/pci: Fix stale function handles in error handling

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Plucky:
  Fix Committed

Bug description:
  [ Impact ]

  s390/pci: Fix stale function handles in error handling

  In some error scenarios multiple error events may be generated for the
  same PCI function before Linux even started its automatic recovery
  process. In this case Linux may succeed to recover based on the first
  event but then fails recovery when handling a subsequent event. This
  is because events capture the function handle as they are created. At
  the time when the secondary event is handled the handle stored with
  the error event is then stale and using it to reset the function will
  fail.

  Fix this by retrieving a fresh function handle using the CLP List PCI
  Functions and only process events where the stored handle matches this
  handle. This effectively ignores error events which were captured
  before the latest disable/enable cycles. Relatedly if the current
  handle is already disabled do not attempt to simply reset the error
  state as a re-enable is necessary and clearing the error state would
  fail.

  [ Fix ]

  Backport the following commits from upstream:
  - 45537926dd2a s390/pci: Fix stale function handles in error handling
  - b97a7972b1f4 s390/pci: Do not try re-enabling load/store if device is 
disabled

  [ Test Plan ]

  Booting the system on a IBM Z mainframe with at least one PCI passthrough 
device available.
  Enable debug logging in order to monitor how error events are processed in 
real time.
  Trigger PCI error conditions, either through firmware error injection or by 
repeatedly disabling and re-enabling the device under load using sysfs 
interfaces.
  While the device is busy handling real traffic, such as network or crypto 
operations, watch the kernel logs to see how error events are processed.
  Verify that events carrying stale function handles are detected and ignored, 
and that recovery attempts against disabled devices escalate properly to a full 
reset.

  [ Regression Potential ]

  The fix affects how the s390 PCI error handler validates and uses function 
handles during recovery.
  A bug here could cause valid error events to be incorrectly ignored or 
recovery paths to escalate unnecessarily.
  Users may see PCI devices not recovering from transient errors, devices being 
reset or re-enabled more often than required, or even unexpected device removal.

  ---

  Description:   s390/pci: Fix stale function handles in error handling

  Symptom:
  In some error scenarios automatic recovery may ultimately fail after Linux 
initially recovered successfully when it then tries to handle another error 
event.

  Problem:
  In some error scenarios multiple error events may be generated for the same 
PCI function before Linux even started its automatic recovery process. In this 
case Linux may succeed to recover based on the first event but then fails 
recovery when handling a subsequent event. This is because events capture the 
function handle as they are created. At the time when the secondary event is 
handled the handle stored with the error event is then stale and using it to 
reset the function will fail.

  Solution:
  Fix this by retrieving a fresh function handle using the CLP List PCI 
Functions and only process events where the stored handle matches this handle. 
This effectively ignores error events which were captured before the latest 
disable/enable cycles. Relatedly if the current handle is already disabled do 
not attempt to simply reset the error state as a re-enable is necessary and 
clearing the error state would fail.

  Reproduction:
  This may be reproduced in an artificial error scenario by issuing multiple 
zpcictl --reset-fw <dev> in quick succession generating multiple PEC 0x3A 
events with the same handle.

  Required Fixes / Upstream-IDs:
  45537926dd2aaa9190ac0fac5a0fbeefcadfea95
  b97a7972b1f4f81417840b9a2ab0c19722b577d5

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2121149/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to