*** This bug is a duplicate of bug 2002256 ***
    https://bugs.launchpad.net/bugs/2002256

** Changed in: ubuntu-z-systems
   Importance: Undecided => High

** Changed in: linux (Ubuntu)
     Assignee: Skipper Bug Screeners (skipper-screen-team) => (unassigned)

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Changed in: ubuntu-z-systems
       Status: New => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2002984

Title:
  [UBUNTU 20.04] zfcp: fix double free of FSF request when qdio send
  fails

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released

Bug description:
  Description:   zfcp: fix double free of FSF request when qdio send
  fails

  Symptom:       When doing maintenance actions on FCP devices that turn off a
                 FCP device while I/O is still running on it in Linux - for
                 example turning off the channel path of the FCP device - the
                 Linux kernel crashes.

  Problem:       We used to use the wrong type of integer in
                 'zfcp_fsf_req_send()' to cache the FSF request ID when sending 
a
                 new FSF request. This is used in case the sending fails and we
                 need to remove the request from our internal hash table again
                 (so we don't keep an invalid reference and use it when we free
                 the request again).

                 In 'zfcp_fsf_req_send()' we used to cache the ID as 'int'
                 (signed and 32 bit wide), but the rest of the zfcp code (and 
the
                 firmware specification) handles the ID as 'unsigned long'/'u64'
                 (unsigned and 64 bit wide [s390x ELF ABI]).
                     For one this has the obvious problem that when the ID grows
                 past 32 bit (this can happen reasonably fast) it is truncated 
to
                 32 bit when storing it in the cache variable and so doesn't
                 match the original ID anymore.
                     The second less obvious problem is that even when the
                 original ID has not yet grown past 32 bit, as soon as the 32nd
                 bit is set in the original ID (0x80000000 = 2'147'483'648) we
                 will have a mismatch when we cast it back to 'unsigned long'
                 because casting the signed type 'int' into the wider type
                 'unsigned long' will use a sign-extending instruction, and so
                 flip all leading zeros to one instead.

                 If we can't successfully remove the request from the hash table
                 again after 'zfcp_qdio_send()' fails (this happens regularly
                 when zfcp cannot notify the adapter about new work because the
                 adapter is already gone during e.g. a ChpID toggle) we will end
                 up with a double free.
                     We unconditionally free the request in the calling function
                 when 'zfcp_fsf_req_send()' fails, but because the request is
                 still in the hash table we end up with a stale memory 
reference,
                 and once the zfcp adapter is either reset during recovery or
                 shutdown we end up freeing the same memory twice.

  Solution:      To fix this, simply change the type of the cache variable to
                 'unsigned long', like the rest of zfcp and also the argument 
for
                 'zfcp_reqlist_find_rm()'. This prevents truncation and wrong
                 sign extension and so can successfully remove the request from
                 the hash table.

  Reproduction:  Run I/O on a FCP device for so long that you have sent
                 2'147'483'648 requests. The current request number can not be
                 read directly from user space, but can be read indirectly by
                 using 'zfcp_ping' and 'zfcpdbf' (use the correct 
device-bus-ID):

                     sudo sh -c 'zfcp_ping -a "${0}" 0xFFFFFFFFFFFFFFFF \
                     2>/dev/null 1>&2; zfcpdbf "${0}" -x all -i SAN 2>/dev/null 
\
                     | grep -E -e "^(Timestamp|Request ID)[[:blank:]]+:" | tail 
\
                     -n2' 0.0.1700

                 After having reached 0x80000000 requests, stop all I/O on the
                 FCP device and start only a single process doing 
single-threaded
                 synchronous, direct I/O on the FCP device (always only one
                 outstanding I/O operation).

                 While this I/O process is running, turn of the channel path
                 (ChpID) that is used for the FCP device/subchannel. This will
                 not always trigger the bug, but occasionally it will.
                     Proof that it hit the correct code-path in zfcp can be 
found
                 by using 'zfcpdbf' again (use the correct device-bus-ID):

                     zfcpdbf 0.0.1700 -x all -i REC 2>/dev/null | grep
  'fsrs__1'

                 In case you hit the correct code-path this will print some 
lines
                 starting with 'Tag'.

  Upstream-ID:   0954256e970ecf371b03a6c9af2cf91b9c4085ff
  Preventive:    yes

  Author:        Benjamin Block <bbl...@linux.ibm.com>
  Component:     kernel
  Link:          
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0954256e970e

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2002984/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to