[Kernel-packages] [Bug 1887774] Re: [UBUNTU 20.04] zfcp: Fix panic on ERP timeout for previously dismissed ERP

Frank Heimes Thu, 16 Jul 2020 08:01:17 -0700

Kernel SRU request submitted:
https://lists.ubuntu.com/archives/kernel-team/2020-July/thread.html#112154
Updating status to 'In Progress'.


** Changed in: linux (Ubuntu Focal)
       Status: New => In Progress

** Changed in: ubuntu-z-systems
       Status: Triaged => In Progress

** Description changed:

+ SRU Justification:
+ ==================
+ 
+ [Impact]
+ 
+ * Linux kernel panics due to kernel page fault in IRQ context when
+ running zfcp_erp_timeout_handler() calling zfcp_erp_notify().
+ 
+ [Fix]
+ 
+ * 936e6b85da0476dd2edac7c51c68072da9fb4ba2 936e6b85da04 "scsi: zfcp: Fix
+ panic on ERP timeout for previously dismissed ERP action"
+ 
+ [Test Case]
+ 
+ * Requires an IBM z13/z13s or LinuxONE Rockhopper/Emperor system (or
+ newer) connected to zfcp capcble storage sub-system.
+ 
+ * Initiate an (ERP) timeout (maybe by injection or by causing a slow
+ recovery otherwise).
+ 
+ * Monitor the system log for any kernel panics.
+ 
+ [Regression Potential]
+ 
+ * The regression can be considered as medium since the modification is
+ platform specific / limited to s390x and again limited to the zfcp
+ layer.
+ 
+ * Within zfcp it's further limited to the error recovery procedure (ERP)
+ of fcp and only touches zfcp_erp.c, means the code path is mainly active
+ under error conditions.
+ 
+ [Other]
+ 
+ * The above fix is upstream accepted with v5.8-rc3, hence will make it's
+ way to groovy with kernel 5.8.
+ 
+ * Therefore this SRU request was submitted for bionic and focal only and
+ not for groovy.
+ 
+ __________
+ 
  Description:   zfcp: Fix panic on ERP timeout for previously dismissed ERP
  Symptom:       Linux kernel panic due to kernel page fault in IRQ context
-                when running zfcp_erp_timeout_handler() calling
-                zfcp_erp_notify().
+                when running zfcp_erp_timeout_handler() calling
+                zfcp_erp_notify().
  Problem:       Suppose that, for unrelated reasons, FSF requests on behalf
-                of recovery are very slow and can run into the ERP timeout.
-                In the case at hand, we did adapter recovery to a large
-                degree. However due to the slowness a LUN open is pending so
-                the corresponding fc_rport remains blocked. After
-                fast_io_fail_tmo we trigger close physical port recovery for
-                the port under which the LUN should have been opened. The
-                new higher order port recovery dismisses the pending LUN
-                open ERP action and dismisses the pending LUN open FSF
-                request. Such dismissal decouples the ERP action from the
-                pending corresponding FSF request by setting
-                zfcp_fsf_req->erp_action to NULL (among other things)
-                [zfcp_erp_strategy_check_fsfreq()].
-                If now the ERP timeout for the pending open LUN request runs
-                out, we must not use zfcp_fsf_req->erp_action in the ERP
-                timeout handler. This is a problem since v4.15 commit
-                75492a51568b ("s390/scsi: Convert timers to use
-                timer_setup()"). Before that we intentionally only passed
-                zfcp_erp_action as context argument to
-                zfcp_erp_timeout_handler().
-                Note: The lifetime of the corresponding zfcp_fsf_req object
-                continues until a (late) response or an (unrelated) adapter
-                recovery.
+                of recovery are very slow and can run into the ERP timeout.
+                In the case at hand, we did adapter recovery to a large
+                degree. However due to the slowness a LUN open is pending so
+                the corresponding fc_rport remains blocked. After
+                fast_io_fail_tmo we trigger close physical port recovery for
+                the port under which the LUN should have been opened. The
+                new higher order port recovery dismisses the pending LUN
+                open ERP action and dismisses the pending LUN open FSF
+                request. Such dismissal decouples the ERP action from the
+                pending corresponding FSF request by setting
+                zfcp_fsf_req->erp_action to NULL (among other things)
+                [zfcp_erp_strategy_check_fsfreq()].
+                If now the ERP timeout for the pending open LUN request runs
+                out, we must not use zfcp_fsf_req->erp_action in the ERP
+                timeout handler. This is a problem since v4.15 commit
+                75492a51568b ("s390/scsi: Convert timers to use
+                timer_setup()"). Before that we intentionally only passed
+                zfcp_erp_action as context argument to
+                zfcp_erp_timeout_handler().
+                Note: The lifetime of the corresponding zfcp_fsf_req object
+                continues until a (late) response or an (unrelated) adapter
+                recovery.
  Solution:      Just like the regular response path ignores dismissed
-                requests [zfcp_fsf_req_complete() =>
-                zfcp_fsf_protstatus_eval() => return early] the ERP timeout
-                handler now needs to ignore dismissed requests. So simply
-                return early in the ERP timeout handler if the FSF request
-                is marked as dismissed in its status flags. To protect
-                against the race where zfcp_erp_strategy_check_fsfreq()
-                dismisses and sets zfcp_fsf_req->erp_action to NULL after
-                our previous status flag check, return early if
-                zfcp_fsf_req->erp_action is NULL. After all, the former ERP
-                action does not need to be woken up as that was already done
-                as part of the dismissal above [zfcp_erp_action_dismiss()].
+                requests [zfcp_fsf_req_complete() =>
+                zfcp_fsf_protstatus_eval() => return early] the ERP timeout
+                handler now needs to ignore dismissed requests. So simply
+                return early in the ERP timeout handler if the FSF request
+                is marked as dismissed in its status flags. To protect
+                against the race where zfcp_erp_strategy_check_fsfreq()
+                dismisses and sets zfcp_fsf_req->erp_action to NULL after
+                our previous status flag check, return early if
+                zfcp_fsf_req->erp_action is NULL. After all, the former ERP
+                action does not need to be woken up as that was already done
+                as part of the dismissal above [zfcp_erp_action_dismiss()].
  
  Upstream-ID:   936e6b85da0476dd2edac7c51c68072da9fb4ba2 -> kernel 5.8
  
  Will be integrated by kernel 5.8 by groovy.
  
  Please check that this also be integrated into 20.04

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1887774

Title:
  [UBUNTU 20.04] zfcp: Fix panic on ERP timeout for previously dismissed
  ERP

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  New
Status in linux source package in Focal:
  In Progress
Status in linux source package in Groovy:
  New

Bug description:
  SRU Justification:
  ==================

  [Impact]

  * Linux kernel panics due to kernel page fault in IRQ context when
  running zfcp_erp_timeout_handler() calling zfcp_erp_notify().

  [Fix]

  * 936e6b85da0476dd2edac7c51c68072da9fb4ba2 936e6b85da04 "scsi: zfcp:
  Fix panic on ERP timeout for previously dismissed ERP action"

  [Test Case]

  * Requires an IBM z13/z13s or LinuxONE Rockhopper/Emperor system (or
  newer) connected to zfcp capcble storage sub-system.

  * Initiate an (ERP) timeout (maybe by injection or by causing a slow
  recovery otherwise).

  * Monitor the system log for any kernel panics.

  [Regression Potential]

  * The regression can be considered as medium since the modification is
  platform specific / limited to s390x and again limited to the zfcp
  layer.

  * Within zfcp it's further limited to the error recovery procedure
  (ERP) of fcp and only touches zfcp_erp.c, means the code path is
  mainly active under error conditions.

  [Other]

  * The above fix is upstream accepted with v5.8-rc3, hence will make
  it's way to groovy with kernel 5.8.

  * Therefore this SRU request was submitted for bionic and focal only
  and not for groovy.

  __________

  Description:   zfcp: Fix panic on ERP timeout for previously dismissed ERP
  Symptom:       Linux kernel panic due to kernel page fault in IRQ context
                 when running zfcp_erp_timeout_handler() calling
                 zfcp_erp_notify().
  Problem:       Suppose that, for unrelated reasons, FSF requests on behalf
                 of recovery are very slow and can run into the ERP timeout.
                 In the case at hand, we did adapter recovery to a large
                 degree. However due to the slowness a LUN open is pending so
                 the corresponding fc_rport remains blocked. After
                 fast_io_fail_tmo we trigger close physical port recovery for
                 the port under which the LUN should have been opened. The
                 new higher order port recovery dismisses the pending LUN
                 open ERP action and dismisses the pending LUN open FSF
                 request. Such dismissal decouples the ERP action from the
                 pending corresponding FSF request by setting
                 zfcp_fsf_req->erp_action to NULL (among other things)
                 [zfcp_erp_strategy_check_fsfreq()].
                 If now the ERP timeout for the pending open LUN request runs
                 out, we must not use zfcp_fsf_req->erp_action in the ERP
                 timeout handler. This is a problem since v4.15 commit
                 75492a51568b ("s390/scsi: Convert timers to use
                 timer_setup()"). Before that we intentionally only passed
                 zfcp_erp_action as context argument to
                 zfcp_erp_timeout_handler().
                 Note: The lifetime of the corresponding zfcp_fsf_req object
                 continues until a (late) response or an (unrelated) adapter
                 recovery.
  Solution:      Just like the regular response path ignores dismissed
                 requests [zfcp_fsf_req_complete() =>
                 zfcp_fsf_protstatus_eval() => return early] the ERP timeout
                 handler now needs to ignore dismissed requests. So simply
                 return early in the ERP timeout handler if the FSF request
                 is marked as dismissed in its status flags. To protect
                 against the race where zfcp_erp_strategy_check_fsfreq()
                 dismisses and sets zfcp_fsf_req->erp_action to NULL after
                 our previous status flag check, return early if
                 zfcp_fsf_req->erp_action is NULL. After all, the former ERP
                 action does not need to be woken up as that was already done
                 as part of the dismissal above [zfcp_erp_action_dismiss()].

  Upstream-ID:   936e6b85da0476dd2edac7c51c68072da9fb4ba2 -> kernel 5.8

  Will be integrated by kernel 5.8 by groovy.

  Please check that this also be integrated into 20.04

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1887774/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1887774] Re: [UBUNTU 20.04] zfcp: Fix panic on ERP timeout for previously dismissed ERP

Reply via email to