Public bug reported:

Problem description:

When a NVMe drive is assigned/hotplugged to a Linux LPAR then
a bug is hit in lib/list_debug.c. And the device is not accessible, there is no 
/dev/ file
and lspci does not report it also.


[ 1681.564462] list_add double add: new=00000000eed0f808, 
prev=00000000eed0f808, next=000000004070a300.
[ 1681.564489] ------------[ cut here ]------------
[ 1681.564490] kernel BUG at lib/list_debug.c:31!
[ 1681.564504] monitor event: 0040 ilc:2 [#1] SMP
[ 1681.564507] Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter 
ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat 
ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat 
iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter s390_trng ghash_s390 prng aes_s390 
des_s390 libdes sha512_s390 vfio_ccw sha1_s390 vfio_mdev mdev chsc_sch 
vfio_iommu_type1 eadm_sch vfio ip_tables dm_service_time nvme crc32_vx_s390 
sha256_s390 sha_common nvme_core qeth_l2 zfcp qeth scsi_transport_fc qdio 
ccwgroup dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt
[ 1681.564534] CPU: 6 PID: 139 Comm: kmcheck Not tainted 5.8.0-rc1+ #2
[ 1681.564535] Hardware name: IBM 8561 T01 701 (LPAR)
[ 1681.564536] Krnl PSW : 0704c00180000000 000000003ffcadb8 
(__list_add_valid+0x70/0xa8)
[ 1681.564544]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
RI:0 EA:3
[ 1681.564545] Krnl GPRS: 0000000000000040 0000000000000027 0000000000000058 
0000000000000007
[ 1681.564546]            000000003ffcadb4 0000000000000000 0000000000000000 
000003e0051a7ce0
[ 1681.564547]            000000004070a300 00000000eed0f808 00000000eed0f808 
000000004070a300
[ 1681.564548]            00000000f56a2000 0000000040c2c788 000000003ffcadb4 
000003e0051a7bc8
[ 1681.564583] Krnl Code: 000000003ffcada8: c02000302b09        larl    
%r2,00000000405d03ba
                          000000003ffcadae: c0e5ffdd30b1        brasl   
%r14,000000003fb70f10
                         #000000003ffcadb4: af000000            mc      0,0
                         >000000003ffcadb8: b9040054            lgr     %r5,%r4
                          000000003ffcadbc: c02000302aad        larl    
%r2,00000000405d0316
                          000000003ffcadc2: b9040041            lgr     %r4,%r1
                          000000003ffcadc6: c0e5ffdd30a5        brasl   
%r14,000000003fb70f10
                          000000003ffcadcc: af000000            mc      0,0
[ 1681.564592] Call Trace:
[ 1681.564594]  [<000000003ffcadb8>] __list_add_valid+0x70/0xa8
[ 1681.564596] ([<000000003ffcadb4>] __list_add_valid+0x6c/0xa8)
[ 1681.564599]  [<000000003faf2920>] zpci_create_device+0x60/0x1b0
[ 1681.564601]  [<000000003faf704a>] zpci_event_availability+0x282/0x2f0
[ 1681.564605]  [<0000000040367848>] chsc_process_crw+0x2b8/0xa18
[ 1681.564607]  [<000000004036f35c>] crw_collect_info+0x254/0x348
[ 1681.564610]  [<000000003fb2a6ea>] kthread+0x14a/0x168
[ 1681.564613]  [<00000000403a55c0>] ret_from_fork+0x24/0x2c
[ 1681.564614] Last Breaking-Event-Address:
[ 1681.564618]  [<000000003fb70f62>] printk+0x52/0x58
[ 1681.564620] ---[ end trace 7ea67c348aa67e14 ]---


uname:
Linux t83lp49.lnxne.boe 5.8.0-rc1+ #2 SMP Thu Jun 18 12:38:02 CEST 2020 s390x 
s390x s390x GNU/Linux

How to reproduce:
1. Unassign a NVMe drive in HMC from your LPAR
2. Reassign it to your LPAR again
3. dmesg


========================== Solution

The issue with VF attach/detach is with the fact that
on IBM Z VFs can be enabled/disabled individually using

echo 0 > /sys/bus/pci/slots/<vf_fid>/power

If this was done with a VF linked to a parent PF the
symlink in the parent (/sys/bus/pci/devices/<pf>/virtfnX)
would become stale while the VF is disabled and
when turned back on the VF would not get linked to the PF
again and so could not be used e.g. with QEMU which
relies on the links.
Similarly stale virtfn links could remain after
removing VFs through.

echo 0 > /sys/bus/pci/devices/<pf>/sriov_numvfs

Furthermore there was a missing pci_dev_put() when
searching for the parent PF potentially resulting
in a too high reference count of the parent PFs.

This has been fixed upstream and in 5.8 stable
with the following 3 upstream commits:

3cddb79afc60bcdb5fd9dd7a1c64a8d03bdd460f s390/pci: fix zpci_bus_link_virtfn()
2f0230b2f2d5fd287a85583eefb5aed35b6fe510 s390/pci: re-introduce 
zpci_remove_device()
b97bf44f99155e57088e16974afb1f2d7b5287aa s390/pci: fix PF/VF linking on hot plug

These should apply cleanly after applying

b76fee1bc56c31a9d2a49592810eba30cc06d61a s390/pci: ignore stale
configuration request event

from Bug 1891437.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Skipper Bug Screeners (skipper-screen-team)
         Status: New


** Tags: architecture-s39064 bugnameltc-186885 severity-medium 
targetmilestone-inin20041

** Tags added: architecture-s39064 bugnameltc-186885 severity-medium
targetmilestone-inin20041

** Changed in: ubuntu
     Assignee: (unassigned) => Skipper Bug Screeners (skipper-screen-team)

** Package changed: ubuntu => linux (Ubuntu)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1892849

Title:
  [UBUNTU 20.04] zPCI attach/detach issues with PF/VF linking support

Status in linux package in Ubuntu:
  New

Bug description:
  Problem description:

  When a NVMe drive is assigned/hotplugged to a Linux LPAR then
  a bug is hit in lib/list_debug.c. And the device is not accessible, there is 
no /dev/ file
  and lspci does not report it also.

  
  [ 1681.564462] list_add double add: new=00000000eed0f808, 
prev=00000000eed0f808, next=000000004070a300.
  [ 1681.564489] ------------[ cut here ]------------
  [ 1681.564490] kernel BUG at lib/list_debug.c:31!
  [ 1681.564504] monitor event: 0040 ilc:2 [#1] SMP
  [ 1681.564507] Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter 
ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat 
ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat 
iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter s390_trng ghash_s390 prng aes_s390 
des_s390 libdes sha512_s390 vfio_ccw sha1_s390 vfio_mdev mdev chsc_sch 
vfio_iommu_type1 eadm_sch vfio ip_tables dm_service_time nvme crc32_vx_s390 
sha256_s390 sha_common nvme_core qeth_l2 zfcp qeth scsi_transport_fc qdio 
ccwgroup dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt
  [ 1681.564534] CPU: 6 PID: 139 Comm: kmcheck Not tainted 5.8.0-rc1+ #2
  [ 1681.564535] Hardware name: IBM 8561 T01 701 (LPAR)
  [ 1681.564536] Krnl PSW : 0704c00180000000 000000003ffcadb8 
(__list_add_valid+0x70/0xa8)
  [ 1681.564544]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
RI:0 EA:3
  [ 1681.564545] Krnl GPRS: 0000000000000040 0000000000000027 0000000000000058 
0000000000000007
  [ 1681.564546]            000000003ffcadb4 0000000000000000 0000000000000000 
000003e0051a7ce0
  [ 1681.564547]            000000004070a300 00000000eed0f808 00000000eed0f808 
000000004070a300
  [ 1681.564548]            00000000f56a2000 0000000040c2c788 000000003ffcadb4 
000003e0051a7bc8
  [ 1681.564583] Krnl Code: 000000003ffcada8: c02000302b09        larl    
%r2,00000000405d03ba
                            000000003ffcadae: c0e5ffdd30b1        brasl   
%r14,000000003fb70f10
                           #000000003ffcadb4: af000000            mc      0,0
                           >000000003ffcadb8: b9040054            lgr     
%r5,%r4
                            000000003ffcadbc: c02000302aad        larl    
%r2,00000000405d0316
                            000000003ffcadc2: b9040041            lgr     
%r4,%r1
                            000000003ffcadc6: c0e5ffdd30a5        brasl   
%r14,000000003fb70f10
                            000000003ffcadcc: af000000            mc      0,0
  [ 1681.564592] Call Trace:
  [ 1681.564594]  [<000000003ffcadb8>] __list_add_valid+0x70/0xa8
  [ 1681.564596] ([<000000003ffcadb4>] __list_add_valid+0x6c/0xa8)
  [ 1681.564599]  [<000000003faf2920>] zpci_create_device+0x60/0x1b0
  [ 1681.564601]  [<000000003faf704a>] zpci_event_availability+0x282/0x2f0
  [ 1681.564605]  [<0000000040367848>] chsc_process_crw+0x2b8/0xa18
  [ 1681.564607]  [<000000004036f35c>] crw_collect_info+0x254/0x348
  [ 1681.564610]  [<000000003fb2a6ea>] kthread+0x14a/0x168
  [ 1681.564613]  [<00000000403a55c0>] ret_from_fork+0x24/0x2c
  [ 1681.564614] Last Breaking-Event-Address:
  [ 1681.564618]  [<000000003fb70f62>] printk+0x52/0x58
  [ 1681.564620] ---[ end trace 7ea67c348aa67e14 ]---

  
  uname:
  Linux t83lp49.lnxne.boe 5.8.0-rc1+ #2 SMP Thu Jun 18 12:38:02 CEST 2020 s390x 
s390x s390x GNU/Linux

  How to reproduce:
  1. Unassign a NVMe drive in HMC from your LPAR
  2. Reassign it to your LPAR again
  3. dmesg

  
  ========================== Solution

  The issue with VF attach/detach is with the fact that
  on IBM Z VFs can be enabled/disabled individually using

  echo 0 > /sys/bus/pci/slots/<vf_fid>/power

  If this was done with a VF linked to a parent PF the
  symlink in the parent (/sys/bus/pci/devices/<pf>/virtfnX)
  would become stale while the VF is disabled and
  when turned back on the VF would not get linked to the PF
  again and so could not be used e.g. with QEMU which
  relies on the links.
  Similarly stale virtfn links could remain after
  removing VFs through.

  echo 0 > /sys/bus/pci/devices/<pf>/sriov_numvfs

  Furthermore there was a missing pci_dev_put() when
  searching for the parent PF potentially resulting
  in a too high reference count of the parent PFs.

  This has been fixed upstream and in 5.8 stable
  with the following 3 upstream commits:

  3cddb79afc60bcdb5fd9dd7a1c64a8d03bdd460f s390/pci: fix zpci_bus_link_virtfn()
  2f0230b2f2d5fd287a85583eefb5aed35b6fe510 s390/pci: re-introduce 
zpci_remove_device()
  b97bf44f99155e57088e16974afb1f2d7b5287aa s390/pci: fix PF/VF linking on hot 
plug

  These should apply cleanly after applying

  b76fee1bc56c31a9d2a49592810eba30cc06d61a s390/pci: ignore stale
  configuration request event

  from Bug 1891437.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1892849/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to