Public bug reported:
Description: zfcp: fix failed recovery on gone remote port, non-NPIV
FCP dev
Symptom: With non-NPIV FCP devices, failed recovery on gone remote port.
As follow-on error, failed paths after storage target NPIV
failover with IBM FCP storage based on Spectrum Virtualize such
as FlashSystem 9200, V7000, SAN Volume Controller, etc.
Problem: Suppose we have an environment with a number of non-NPIV FCP
devices (virtual HBAs / FCP devices / zfcp "adapter"s) sharing
the same physical FCP channel (HBA port) and its I_T nexus. Plus
a number of storage target ports zoned to such shared channel.
Now one target port logs out of the fabric causing an RSCN. Zfcp
reacts with an ADISC ELS and subsequent port recovery depending
on the ADISC result. This happens on all such FCP devices (in
different Linux images) concurrently as they all receive a copy
of this RSCN. In the following we look at one of those FCP
devices.
Requests other than FSF_QTCB_FCP_CMND can be slow until they get
a response.
Depending on which requests are affected by slow responses,
there are different recovery outcomes.
Solution: Here we want to fix failed recoveries on port or adapter level
by avoiding recovery requests that can be slow.
We need the cached N_Port_ID for the remote port "link" test
with ADISC. Just before sending the ADISC, we now intentionally
forget the old cached N_Port_ID. The idea is that on receiving
an RSCN for a port, we have to assume that any cached
information about this port is stale. This forces a fresh new
GID_PN [FC-GS] nameserver lookup on any subsequent recovery for
the same port. Since we typically can still communicate with the
nameserver efficiently, we now reach steady state quicker:
Either the nameserver still does not know about the port so we
stop recovery, or the nameserver already knows the port
potentially with a new N_Port_ID and we can successfully and
quickly perform open port recovery. For the one case, where
ADISC returns successfully, we re-initialize port->d_id because
that case does not involve any port recovery.
This also solves a problem if the storage WWPN quickly logs into
the fabric again but with a different N_Port_ID. Such as on
virtual WWPN takeover during target NPIV failover.
[https://www.redbooks.ibm.com/abstracts/redp5477.html] In that
case the RSCN from the storage FDISC was ignored by zfcp and we
could not successfully recover the failover. On some later
failback on the storage, we could have been lucky if the virtual
WWPN got the same old N_Port_ID from the SAN switch as we still
had cached. Then the related RSCN triggered a successful port
reopen recovery. However, there is no guarantee to get the same
N_Port_ID on NPIV FDISC.
Even though NPIV-enabled FCP devices are not affected by this
problem, this code change optimizes recovery time for gone
remote ports as a side effect. The timely drop of cached
N_Port_IDs prevents unnecessary slow open port attempts.
Reproduction: With a sufficiently shared FCP channel with non-NPIV FCP devs,
perform SAN switch port disable on the storage side,
or trigger storage target NPIV failover such as with
Spectrum Virtualize CLI command "stopsystem -node 2".
Upstream-ID: 8c9db6679be4348b8aae108e11d4be2f83976e30
Master-BZ-ID: 196197
Distros: Ubuntu 20.04
Ubuntu 21.10
Ubuntu 22.04
Author: <[email protected]>
Component: kernel
** Affects: linux (Ubuntu)
Importance: Undecided
Assignee: Skipper Bug Screeners (skipper-screen-team)
Status: New
** Tags: architecture-s39064 bugnameltc-198451 severity-medium
targetmilestone-inin---
** Tags added: architecture-s39064 bugnameltc-198451 severity-medium
targetmilestone-inin---
** Changed in: ubuntu
Assignee: (unassigned) => Skipper Bug Screeners (skipper-screen-team)
** Package changed: ubuntu => linux (Ubuntu)
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1977508
Title:
[UBUNTU 20.04] zfcp: fix failed recovery on gone remote port, non-NPIV
FCP dev
Status in linux package in Ubuntu:
New
Bug description:
Description: zfcp: fix failed recovery on gone remote port, non-NPIV
FCP dev
Symptom: With non-NPIV FCP devices, failed recovery on gone remote port.
As follow-on error, failed paths after storage target NPIV
failover with IBM FCP storage based on Spectrum Virtualize such
as FlashSystem 9200, V7000, SAN Volume Controller, etc.
Problem: Suppose we have an environment with a number of non-NPIV FCP
devices (virtual HBAs / FCP devices / zfcp "adapter"s) sharing
the same physical FCP channel (HBA port) and its I_T nexus.
Plus
a number of storage target ports zoned to such shared channel.
Now one target port logs out of the fabric causing an RSCN.
Zfcp
reacts with an ADISC ELS and subsequent port recovery depending
on the ADISC result. This happens on all such FCP devices (in
different Linux images) concurrently as they all receive a copy
of this RSCN. In the following we look at one of those FCP
devices.
Requests other than FSF_QTCB_FCP_CMND can be slow until they
get
a response.
Depending on which requests are affected by slow responses,
there are different recovery outcomes.
Solution: Here we want to fix failed recoveries on port or adapter level
by avoiding recovery requests that can be slow.
We need the cached N_Port_ID for the remote port "link" test
with ADISC. Just before sending the ADISC, we now intentionally
forget the old cached N_Port_ID. The idea is that on receiving
an RSCN for a port, we have to assume that any cached
information about this port is stale. This forces a fresh new
GID_PN [FC-GS] nameserver lookup on any subsequent recovery for
the same port. Since we typically can still communicate with
the
nameserver efficiently, we now reach steady state quicker:
Either the nameserver still does not know about the port so we
stop recovery, or the nameserver already knows the port
potentially with a new N_Port_ID and we can successfully and
quickly perform open port recovery. For the one case, where
ADISC returns successfully, we re-initialize port->d_id because
that case does not involve any port recovery.
This also solves a problem if the storage WWPN quickly logs
into
the fabric again but with a different N_Port_ID. Such as on
virtual WWPN takeover during target NPIV failover.
[https://www.redbooks.ibm.com/abstracts/redp5477.html] In that
case the RSCN from the storage FDISC was ignored by zfcp and we
could not successfully recover the failover. On some later
failback on the storage, we could have been lucky if the
virtual
WWPN got the same old N_Port_ID from the SAN switch as we still
had cached. Then the related RSCN triggered a successful port
reopen recovery. However, there is no guarantee to get the same
N_Port_ID on NPIV FDISC.
Even though NPIV-enabled FCP devices are not affected by this
problem, this code change optimizes recovery time for gone
remote ports as a side effect. The timely drop of cached
N_Port_IDs prevents unnecessary slow open port attempts.
Reproduction: With a sufficiently shared FCP channel with non-NPIV FCP devs,
perform SAN switch port disable on the storage side,
or trigger storage target NPIV failover such as with
Spectrum Virtualize CLI command "stopsystem -node 2".
Upstream-ID: 8c9db6679be4348b8aae108e11d4be2f83976e30
Master-BZ-ID: 196197
Distros: Ubuntu 20.04
Ubuntu 21.10
Ubuntu 22.04
Author: <[email protected]>
Component: kernel
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1977508/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp