Yes, the kernel(s) were the significant set (of about 30) zFCP related patches 
were applied to,
already landed in focal (-updates) respectively the groovy kernel (indicated by 
the Fix Released status at LP 1887124 - 
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1887124)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1881109

Title:
  [Ubuntu 20.04] LPAR crashes in block layer under high stress. Might be
  triggered by scsi errors.

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  We can reproduce a crash in the block layer with lots of stress on
  lots of SCSI disks (on an XIV storage server).

  We seem to have several scsi stalls in the logs/errors (needs to be
  analyzed further) but in the end we do crash with this this calltrace.

  [20832.901147] Failing address: 00007fe00dea8000 TEID: 00007fe00dea8403
  [20832.901159] Fault in home space mode while using kernel ASCE.
  [20832.901171] AS:000001d3cccf400b R2:000003fd0020800b R3:000003fd0020c007 
S:000003fc1cc78800 P:0000000000000400 
  [20832.901264] Oops: 0011 ilc:2 [#1] SMP 
  [20832.901280] Modules linked in: vhost_net vhost macvtap macvlan tap xfs 
xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip6table_filter ip6_tables 
iptable_filter bpfilter bridge aufs overlay dm_service_time dm_multipath 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua s390_trng chsc_sch eadm_sch vfio_ccw 
vfio_mdev mdev vfio_iommu_type1 vfio 8021q garp mrp stp llc sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress 
zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 linear dm_mirror dm_region_hash 
dm_log qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 
libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common 
zfcp scsi_transport_fc dasd_eckd_mod dasd_mod qeth qdio ccwgroup
  [20832.901516] CPU: 29 PID: 389709 Comm: CPU 0/KVM Kdump: loaded Not tainted 
5.4.0-29-generic #33-Ubuntu
  [20832.901530] Hardware name: IBM 8561 T01 708 (LPAR)
  [20832.901542] Krnl PSW : 0404e00180000000 000001d3cbd559be 
(try_to_wake_up+0x4e/0x700)
  [20832.901575]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 
RI:0 EA:3
  [20832.901744] Krnl GPRS: 000003fc917cd988 00007fe000000000 00007fe00000001e 
0000000000000003
  [20832.901750]            0000000000000000 000000000000004c 040003fd005a4600 
0000000000000003
  [20832.901753]            0000000000000003 00007fe00dea8454 0000000000000000 
00007fe00dea7b00
  [20832.901754]            000003f44bd9b300 000001d3cc587088 00007fe0021c7ae0 
00007fe0021c7a60
  [20832.901767] Krnl Code: 000001d3cbd559b2: 41902954            la      
%r9,2388(%r2)
                            000001d3cbd559b6: 582003ac            l       
%r2,940
                           #000001d3cbd559ba: a7180000            lhi     %r1,0
                           >000001d3cbd559be: ba129000            cs      
%r1,%r2,0(%r9)
                            000001d3cbd559c2: a77401c9            brc     
7,000001d3cbd55d54
                            000001d3cbd559c6: e310b0080004        lg      
%r1,8(%r11)
                            000001d3cbd559cc: b9800018            ngr     
%r1,%r8
                            000001d3cbd559d0: a774001f            brc     
7,000001d3cbd55a0e
  [20832.901784] Call Trace:
  [20832.901816] ([<000001d3cc57e0ac>] cleanup_critical+0x0/0x474)
  [20832.901823]  [<000001d3cc1d16ba>] rq_qos_wake_function+0x8a/0xa0 
  [20832.901827]  [<000001d3cbd74bde>] __wake_up_common+0x9e/0x1b0 
  [20832.901829]  [<000001d3cbd750e4>] __wake_up_common_lock+0x94/0xe0 
  [20832.901830]  [<000001d3cbd7515a>] __wake_up+0x2a/0x40 
  [20832.901835]  [<000001d3cc1e8640>] wbt_done+0x90/0xe0 
  [20832.901837]  [<000001d3cc1d17be>] __rq_qos_done+0x3e/0x60 
  [20832.901841]  [<000001d3cc1bd5b0>] blk_mq_free_request+0xe0/0x140 
  [20832.901848]  [<000001d3cc35fc60>] dm_softirq_done+0x140/0x230 
  [20832.901849]  [<000001d3cc1bbfbc>] blk_done_softirq+0xbc/0xe0 
  [20832.901850]  [<000001d3cc57e710>] __do_softirq+0x100/0x360 
  [20832.901853]  [<000001d3cbd2525e>] irq_exit+0x9e/0xc0 
  [20832.901856]  [<000001d3cbcb0b18>] do_IRQ+0x78/0xb0 
  [20832.901859]  [<000001d3cc57dc28>] ext_int_handler+0x128/0x12c 
  [20832.901860]  [<000001d3cc57d306>] sie_exit+0x0/0x46 
  [20832.901866] ([<000001d3cbce944a>] __vcpu_run+0x27a/0xc30)
  [20832.901870]  [<000001d3cbcf29a8>] kvm_arch_vcpu_ioctl_run+0x2d8/0x840 
  [20832.901875]  [<000001d3cbcdd242>] kvm_vcpu_ioctl+0x282/0x770 
  [20832.901880]  [<000001d3cbf85f66>] do_vfs_ioctl+0x376/0x690 
  [20832.901881]  [<000001d3cbf86304>] ksys_ioctl+0x84/0xb0 
  [20832.901883]  [<000001d3cbf8639a>] __s390x_sys_ioctl+0x2a/0x40 
  [20832.901885]  [<000001d3cc57d5f2>] system_call+0x2a6/0x2c8 
  [20832.901885] Last Breaking-Event-Address:
  [20832.901889]  [<000001d3cbd5607e>] wake_up_process+0xe/0x20
  [20832.901897] Kernel panic - not syncing: Fatal exception in interrupt

  Addl. comment:
  In all dumps we've seen so far a crash similar to what Christian posted in
  comment #2 would happen soon after the SCSI Midlayer went through Error
  Recovery, after issuing SCSI Aborts and potentially SCSI Device and Target
  resets.

  From our most recent repro:
  [20226.914939] sd 1:0:3:16687: [sdatp] tag#1256 Done: TIMEOUT_ERROR Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK
  [20226.914948] sd 1:0:3:16687: [sdatp] tag#1256 CDB: Read(10) 28 00 00 00 25 
30 00 00 08 00
  [20226.914953] sd 1:0:3:16687: [sdatp] tag#1256 abort scheduled
  [20227.626587] sd 1:0:2:10: [sdlw] tag#1071 Done: SUCCESS Result: 
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
  [20227.626593] sd 1:0:2:10: [sdlw] tag#1071 CDB: Report target port groups a3 
0a 00 00 00 00 00 00 10 00 00 00
  [20227.626749] sd 1:0:2:2: [sdlb] tag#3388 cmd abort failed
  [20227.626862] sd 1:0:2:2: [sdlb] tag#2356 aborting command
  [20227.626865] sd 1:0:2:2: [sdlb] tag#2356 cmd abort failed
  [20227.626867] sd 1:0:2:2: [sdlb] tag#2353 aborting command
  [20227.626868] sd 1:0:2:2: [sdlb] tag#2353 cmd abort failed
  [20227.626870] sd 1:0:2:2: [sdlb] tag#2351 aborting command
  [20227.626871] sd 1:0:2:2: [sdlb] tag#2351 cmd abort failed
  [20227.626873] sd 1:0:2:2: [sdlb] tag#1255 aborting command
  [20227.626874] sd 1:0:2:2: [sdlb] tag#1255 cmd abort failed
  [20227.626876] sd 1:0:2:2: [sdlb] tag#1254 aborting command
  [20227.626877] sd 1:0:2:2: [sdlb] tag#1254 cmd abort failed
  [20227.626879] sd 1:0:2:2: [sdlb] tag#1252 aborting command
  [20227.626881] sd 1:0:2:2: [sdlb] tag#1252 cmd abort failed
  [20227.626883] sd 1:0:2:2: [sdlb] tag#1250 aborting command
  [20227.626884] sd 1:0:2:2: [sdlb] tag#1250 cmd abort failed
  [20227.626886] sd 1:0:2:2: [sdlb] tag#1248 aborting command
  [20227.626889] sd 1:0:2:16687: [sdahy] tag#1072 Done: NEEDS_RETRY Result: 
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
  [20227.626891] sd 1:0:2:2: [sdlb] tag#1248 cmd abort failed
  [20227.626893] sd 1:0:2:2: [sdlb] tag#1245 aborting command
  [20227.626895] sd 1:0:2:2: [sdlb] tag#1245 cmd abort failed
  [20227.626897] sd 1:0:2:16687: [sdahy] tag#1072 CDB: Read(10) 28 00 00 00 59 
98 00 00 08 00
  [20227.626899] sd 1:0:2:2: [sdlb] tag#1244 aborting command
  [20227.626901] sd 1:0:2:2: [sdlb] tag#1244 cmd abort failed
  [20227.626903] sd 1:0:2:2: [sdlb] tag#1242 aborting command
  [20227.626904] sd 1:0:2:2: [sdlb] tag#1242 cmd abort failed
  [20227.626906] sd 1:0:2:2: [sdlb] tag#1240 aborting command
  [20227.626907] sd 1:0:2:2: [sdlb] tag#1240 cmd abort failed
  [20227.626909] sd 1:0:2:2: [sdlb] tag#302 aborting command
  [20227.626910] sd 1:0:2:2: [sdlb] tag#302 cmd abort failed
  [20227.626911] sd 1:0:2:2: [sdlb] tag#297 aborting command
  [20227.626912] sd 1:0:2:2: [sdlb] tag#297 cmd abort failed
  [20227.626914] sd 1:0:2:2: [sdlb] tag#295 aborting command
  [20227.626915] sd 1:0:2:2: [sdlb] tag#295 cmd abort failed
  [20227.626917] sd 1:0:2:2: [sdlb] tag#248 aborting command
  [20227.626972] sd 1:0:2:2: [sdlb] tag#248 cmd abort failed
  [20227.626977] sd 1:0:2:2: [sdlb] tag#246 aborting command
  [20227.626979] sd 1:0:2:2: [sdlb] tag#246 cmd abort failed
  [20227.626981] sd 1:0:2:2: [sdlb] tag#244 aborting command
  [20227.626982] sd 1:0:2:2: [sdlb] tag#244 cmd abort failed
  [20227.626984] sd 1:0:2:2: [sdlb] tag#242 aborting command
  [20227.626985] sd 1:0:2:2: [sdlb] tag#242 cmd abort failed
  [20227.626986] sd 1:0:2:2: [sdlb] tag#241 aborting command
  [20227.626988] sd 1:0:2:2: [sdlb] tag#241 cmd abort failed
  [20227.626989] sd 1:0:2:2: [sdlb] tag#239 aborting command
  [20227.626990] sd 1:0:2:2: [sdlb] tag#239 cmd abort failed
  [20227.626991] sd 1:0:2:2: [sdlb] tag#237 aborting command
  [20227.626992] sd 1:0:2:2: [sdlb] tag#237 cmd abort failed
  [20227.626994] sd 1:0:2:2: [sdlb] tag#236 aborting command
  [20227.626995] sd 1:0:2:2: [sdlb] tag#236 cmd abort failed
  [20227.626996] sd 1:0:2:2: [sdlb] tag#234 aborting command
  [20227.626997] sd 1:0:2:2: [sdlb] tag#234 cmd abort failed
  [20227.626999] sd 1:0:2:2: [sdlb] tag#199 aborting command
  [20227.627000] sd 1:0:2:2: [sdlb] tag#199 cmd abort failed
  [20227.627001] sd 1:0:2:2: [sdlb] tag#195 aborting command
  [20227.627003] sd 1:0:2:2: [sdlb] tag#195 cmd abort failed
  [20227.627004] sd 1:0:2:2: [sdlb] tag#193 aborting command
  [20227.627006] sd 1:0:2:2: [sdlb] tag#193 cmd abort failed
  [20227.627009] sd 1:0:3:2: [sdaia] tag#2355 aborting command
  [20227.627011] sd 1:0:3:2: [sdaia] tag#2355 cmd abort failed
  [20227.627013] sd 1:0:3:2: [sdaia] tag#2354 aborting command
  [20227.627014] sd 1:0:3:2: [sdaia] tag#2354 cmd abort failed
  [20227.627016] sd 1:0:3:2: [sdaia] tag#2352 aborting command
  [20227.627017] sd 1:0:3:2: [sdaia] tag#2352 cmd abort failed
  [20227.627019] sd 1:0:3:2: [sdaia] tag#2350 aborting command
  [20227.627020] sd 1:0:3:2: [sdaia] tag#2350 cmd abort failed
  [20227.627022] sd 1:0:3:2: [sdaia] tag#1253 aborting command
  [20227.627023] sd 1:0:3:2: [sdaia] tag#1253 cmd abort failed
  [20227.627025] sd 1:0:3:2: [sdaia] tag#1251 aborting command
  [20227.627026] sd 1:0:3:2: [sdaia] tag#1251 cmd abort failed
  [20227.627028] sd 1:0:3:2: [sdaia] tag#1249 aborting command
  [20227.627030] sd 1:0:3:2: [sdaia] tag#1249 cmd abort failed
  [20227.627031] sd 1:0:3:2: [sdaia] tag#1247 aborting command
  [20227.627032] sd 1:0:3:2: [sdaia] tag#1247 cmd abort failed
  [20227.627034] sd 1:0:3:2: [sdaia] tag#1246 aborting command
  [20227.627035] sd 1:0:3:2: [sdaia] tag#1246 cmd abort failed
  [20227.627037] sd 1:0:3:2: [sdaia] tag#1243 aborting command
  [20227.627038] sd 1:0:3:2: [sdaia] tag#1243 cmd abort failed
  [20227.627039] sd 1:0:3:2: [sdaia] tag#1241 aborting command
  [20227.627041] sd 1:0:3:2: [sdaia] tag#1241 cmd abort failed
  [20227.627042] sd 1:0:3:2: [sdaia] tag#1239 aborting command
  [20227.627043] sd 1:0:3:2: [sdaia] tag#1239 cmd abort failed
  [20227.627045] sd 1:0:3:2: [sdaia] tag#301 aborting command
  [20227.627047] sd 1:0:3:2: [sdaia] tag#301 cmd abort failed
  [20227.627049] sd 1:0:3:2: [sdaia] tag#299 aborting command
  [20227.627050] sd 1:0:3:2: [sdaia] tag#299 cmd abort failed
  [20227.627052] sd 1:0:3:2: [sdaia] tag#298 aborting command
  [20227.627053] sd 1:0:3:2: [sdaia] tag#298 cmd abort failed
  [20227.627055] sd 1:0:3:2: [sdaia] tag#296 aborting command
  [20227.627056] sd 1:0:3:2: [sdaia] tag#296 cmd abort failed
  [20227.627058] sd 1:0:3:2: [sdaia] tag#294 aborting command
  [20227.627059] sd 1:0:3:2: [sdaia] tag#294 cmd abort failed
  [20227.627061] sd 1:0:3:2: [sdaia] tag#247 aborting command
  [20227.627063] sd 1:0:3:2: [sdaia] tag#247 cmd abort failed
  [20227.627064] sd 1:0:3:2: [sdaia] tag#245 aborting command
  [20227.627065] sd 1:0:3:2: [sdaia] tag#245 cmd abort failed
  [20227.627067] sd 1:0:3:2: [sdaia] tag#243 aborting command
  [20227.627068] sd 1:0:3:2: [sdaia] tag#243 cmd abort failed
  [20227.627070] sd 1:0:3:2: [sdaia] tag#240 aborting command
  [20227.627071] sd 1:0:3:2: [sdaia] tag#240 cmd abort failed
  [20227.627073] sd 1:0:3:2: [sdaia] tag#238 aborting command
  [20227.627074] sd 1:0:3:2: [sdaia] tag#238 cmd abort failed
  [20227.627075] sd 1:0:3:2: [sdaia] tag#235 aborting command
  [20227.627077] sd 1:0:3:2: [sdaia] tag#235 cmd abort failed
  [20227.627079] sd 1:0:3:2: [sdaia] tag#198 aborting command
  [20227.627081] sd 1:0:3:2: [sdaia] tag#198 cmd abort failed
  [20227.627082] sd 1:0:3:2: [sdaia] tag#197 aborting command
  [20227.627083] sd 1:0:3:2: [sdaia] tag#197 cmd abort failed
  [20227.627085] sd 1:0:3:2: [sdaia] tag#196 aborting command
  [20227.627086] sd 1:0:3:2: [sdaia] tag#196 cmd abort failed
  [20227.627088] sd 1:0:3:2: [sdaia] tag#194 aborting command
  [20227.627089] sd 1:0:3:2: [sdaia] tag#194 cmd abort failed
  [20227.627091] sd 1:0:3:2: [sdaia] tag#192 aborting command
  [20227.627092] sd 1:0:3:2: [sdaia] tag#192 cmd abort failed
  [20227.627095] sd 1:0:3:16687: [sdatp] tag#1256 aborting command
  [20227.627096] sd 1:0:3:16687: [sdatp] tag#1256 cmd abort failed
  [20227.632049] qdio: 0.0.7100 ZFCP on SC 21e using AI:1 QEBSM:0 PRI:1 TDD:1 
SIGA: W AP
  [20227.705142] scsi host1: scsi_eh_1: waking up 0/57/57
  [20227.705150] scsi host1: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 28
  [20227.705282] scsi host1: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 28
  [20227.705436] scsi host1: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1
  [20227.705437] scsi host1: Total of 57 commands on 3 devices require eh work
  [20227.705534] sd 1:0:2:2: scsi_eh_1: Sending BDR
  [20228.787915] sd 1:0:2:2: [sdlb] tag#3388 scsi_eh_done result: 2
  [20228.787928] sd 1:0:2:2: [sdlb] tag#3388 Done: SUCCESS Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK
  [20228.787931] sd 1:0:2:2: [sdlb] tag#3388 CDB: Test Unit Ready 00 00 00 00 
00 00
  [20228.787933] sd 1:0:2:2: [sdlb] tag#3388 Sense Key : Unit Attention 
[current]
  [20228.787937] sd 1:0:2:2: [sdlb] tag#3388 Add. Sense: Power on, reset, or 
bus device reset occurred
  [20228.787939] sd 1:0:2:2: [sdlb] tag#3388 scsi_send_eh_cmnd timeleft: 1000
  [20228.787940] sd 1:0:2:2: Power-on or device reset occurred
  [20228.787950] sd 1:0:2:2: [sdlb] tag#3388 scsi_send_eh_cmnd: 
scsi_eh_completed_normally 2006
  [20228.787952] sd 1:0:2:2: [sdlb] tag#3388 scsi_eh_tur return: 2001
  [20228.788219] sd 1:0:2:2: [sdlb] tag#3388 scsi_eh_done result: 0
  [20228.788236] sd 1:0:2:2: [sdlb] tag#3388 scsi_send_eh_cmnd timeleft: 1000
  [20228.788237] sd 1:0:2:2: [sdlb] tag#3388 scsi_send_eh_cmnd: 
scsi_eh_completed_normally 2002
  [20228.788239] sd 1:0:2:2: [sdlb] tag#3388 scsi_eh_tur return: 2002
  [20228.788297] sd 1:0:3:2: scsi_eh_1: Sending BDR
  [20241.124934] sd 1:0:3:2: [sdaia] tag#2355 Done: FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK
  [20241.124948] sd 1:0:3:2: [sdaia] tag#2355 CDB: Test Unit Ready 00 00 00 00 
00 00
  [20241.124951] sd 1:0:3:2: [sdaia] tag#2355 scsi_send_eh_cmnd timeleft: 0
  [20241.124953] sd 1:0:3:2: [sdaia] tag#2355 scsi_eh_tur return: 2003
  [20241.125181] sd 1:0:3:16687: scsi_eh_1: Sending BDR
  [20241.128570] sd 1:0:3:16687: [sdatp] tag#1256 scsi_eh_done result: c0000
  [20241.128574] sd 1:0:3:16687: [sdatp] tag#1256 Done: SUCCESS Result: 
hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK
  [20241.128576] sd 1:0:3:16687: [sdatp] tag#1256 CDB: Test Unit Ready 00 00 00 
00 00 00
  [20241.128577] sd 1:0:3:16687: [sdatp] tag#1256 scsi_send_eh_cmnd timeleft: 
1000
  [20241.128580] sd 1:0:3:16687: [sdatp] tag#1256 scsi_send_eh_cmnd: 
scsi_eh_completed_normally 2003
  [20241.128581] sd 1:0:3:16687: [sdatp] tag#1256 scsi_eh_tur return: 2003
  [20241.128583] scsi host1: scsi_eh_1: Sending target reset to target 3
  [20247.159773] scsi 1:0:3:0: scsi scan: device exists on 1:0:3:0
  [20247.159810] scsi 1:0:3:0: scsi scan: Sending REPORT LUNS to (try 0)
  [20247.265072] sd 1:0:3:2: [sdaia] tag#2355 scsi_eh_done result: 2
  [20247.265111] sd 1:0:3:2: [sdaia] tag#2355 Done: SUCCESS Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK
  [20247.265117] sd 1:0:3:2: [sdaia] tag#2355 CDB: Test Unit Ready 00 00 00 00 
00 00
  [20247.265120] sd 1:0:3:2: [sdaia] tag#2355 Sense Key : Unit Attention 
[current]
  [20247.265123] sd 1:0:3:2: [sdaia] tag#2355 Add. Sense: Power on, reset, or 
bus device reset occurred
  [20247.265124] sd 1:0:3:2: [sdaia] tag#2355 scsi_send_eh_cmnd timeleft: 1000
  [20247.265128] sd 1:0:3:2: Power-on or device reset occurred
  [20247.265141] sd 1:0:3:2: [sdaia] tag#2355 scsi_send_eh_cmnd: 
scsi_eh_completed_normally 2006
  [20247.265143] sd 1:0:3:2: [sdaia] tag#2355 scsi_eh_tur return: 2001
  [20247.265254] sd 1:0:3:2: [sdaia] tag#2355 scsi_eh_done result: 0
  [20247.265278] sd 1:0:3:2: [sdaia] tag#2355 scsi_send_eh_cmnd timeleft: 1000
  [20247.265281] sd 1:0:3:2: [sdaia] tag#2355 scsi_send_eh_cmnd: 
scsi_eh_completed_normally 2002
  [20247.265282] sd 1:0:3:2: [sdaia] tag#2355 scsi_eh_tur return: 2002
  [20247.265439] sd 1:0:3:16687: [sdatp] tag#1256 scsi_eh_done result: 2
  [20247.265454] sd 1:0:3:16687: [sdatp] tag#1256 Done: SUCCESS Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK
  [20247.265456] sd 1:0:3:16687: [sdatp] tag#1256 CDB: Test Unit Ready 00 00 00 
00 00 00
  [20247.265458] sd 1:0:3:16687: [sdatp] tag#1256 Sense Key : Unit Attention 
[current]
  [20247.265459] sd 1:0:3:16687: [sdatp] tag#1256 Add. Sense: Power on, reset, 
or bus device reset occurred
  [20247.265461] sd 1:0:3:16687: [sdatp] tag#1256 scsi_send_eh_cmnd timeleft: 
1000
  [20247.265462] sd 1:0:3:16687: Power-on or device reset occurred
  [20247.265469] sd 1:0:3:16687: [sdatp] tag#1256 scsi_send_eh_cmnd: 
scsi_eh_completed_normally 2006
  [20247.265470] sd 1:0:3:16687: [sdatp] tag#1256 scsi_eh_tur return: 2001
  [20247.265568] sd 1:0:3:16687: [sdatp] tag#1256 scsi_eh_done result: 0
  [20247.265582] sd 1:0:3:16687: [sdatp] tag#1256 scsi_send_eh_cmnd timeleft: 
1000
  [20247.265584] sd 1:0:3:16687: [sdatp] tag#1256 scsi_send_eh_cmnd: 
scsi_eh_completed_normally 2002
  [20247.265585] sd 1:0:3:16687: [sdatp] tag#1256 scsi_eh_tur return: 2002
  [20247.265593] sd 1:0:2:2: [sdlb] tag#3388 scsi_eh_1: flush retry cmd
  [20247.265603] sd 1:0:2:2: [sdlb] tag#2356 scsi_eh_1: flush retry cmd
  [20247.265607] sd 1:0:2:2: [sdlb] tag#2353 scsi_eh_1: flush retry cmd
  [20247.265610] sd 1:0:2:2: [sdlb] tag#2351 scsi_eh_1: flush retry cmd
  [20247.265612] sd 1:0:2:2: [sdlb] tag#1255 scsi_eh_1: flush retry cmd
  [20247.265615] sd 1:0:2:2: [sdlb] tag#1254 scsi_eh_1: flush retry cmd
  [20247.265617] sd 1:0:2:2: [sdlb] tag#1252 scsi_eh_1: flush retry cmd
  [20247.265619] sd 1:0:2:2: [sdlb] tag#1250 scsi_eh_1: flush retry cmd
  [20247.265621] sd 1:0:2:2: [sdlb] tag#1248 scsi_eh_1: flush retry cmd
  [20247.265622] sd 1:0:2:2: [sdlb] tag#1245 scsi_eh_1: flush retry cmd
  [20247.265624] sd 1:0:2:2: [sdlb] tag#1244 scsi_eh_1: flush retry cmd
  [20247.265625] sd 1:0:2:2: [sdlb] tag#1242 scsi_eh_1: flush retry cmd
  [20247.265626] sd 1:0:2:2: [sdlb] tag#1240 scsi_eh_1: flush retry cmd
  [20247.265629] sd 1:0:2:2: [sdlb] tag#302 scsi_eh_1: flush retry cmd
  [20247.265631] sd 1:0:2:2: [sdlb] tag#297 scsi_eh_1: flush retry cmd
  [20247.265633] sd 1:0:2:2: [sdlb] tag#295 scsi_eh_1: flush retry cmd
  [20247.265635] sd 1:0:2:2: [sdlb] tag#248 scsi_eh_1: flush retry cmd
  [20247.265638] sd 1:0:2:2: [sdlb] tag#246 scsi_eh_1: flush retry cmd
  [20247.265640] sd 1:0:2:2: [sdlb] tag#244 scsi_eh_1: flush retry cmd
  [20247.265642] sd 1:0:2:2: [sdlb] tag#242 scsi_eh_1: flush retry cmd
  [20247.265643] sd 1:0:2:2: [sdlb] tag#241 scsi_eh_1: flush retry cmd
  [20247.265645] sd 1:0:2:2: [sdlb] tag#239 scsi_eh_1: flush retry cmd
  [20247.265647] sd 1:0:2:2: [sdlb] tag#237 scsi_eh_1: flush retry cmd
  [20247.265648] sd 1:0:2:2: [sdlb] tag#236 scsi_eh_1: flush retry cmd
  [20247.265649] sd 1:0:2:2: [sdlb] tag#234 scsi_eh_1: flush retry cmd
  [20247.265651] sd 1:0:2:2: [sdlb] tag#199 scsi_eh_1: flush retry cmd
  [20247.265653] sd 1:0:2:2: [sdlb] tag#195 scsi_eh_1: flush retry cmd
  [20247.265654] sd 1:0:2:2: [sdlb] tag#193 scsi_eh_1: flush retry cmd
  [20247.265655] sd 1:0:3:2: [sdaia] tag#2355 scsi_eh_1: flush retry cmd
  [20247.265658] sd 1:0:3:2: [sdaia] tag#2354 scsi_eh_1: flush retry cmd
  [20247.265659] sd 1:0:3:2: [sdaia] tag#2352 scsi_eh_1: flush retry cmd
  [20247.265660] sd 1:0:3:2: [sdaia] tag#2350 scsi_eh_1: flush retry cmd
  [20247.265662] sd 1:0:3:2: [sdaia] tag#1253 scsi_eh_1: flush retry cmd
  [20247.265664] sd 1:0:3:2: [sdaia] tag#1251 scsi_eh_1: flush retry cmd
  [20247.265665] sd 1:0:3:2: [sdaia] tag#1249 scsi_eh_1: flush retry cmd
  [20247.265666] sd 1:0:3:2: [sdaia] tag#1247 scsi_eh_1: flush retry cmd
  [20247.265668] sd 1:0:3:2: [sdaia] tag#1246 scsi_eh_1: flush retry cmd
  [20247.265669] sd 1:0:3:2: [sdaia] tag#1243 scsi_eh_1: flush retry cmd
  [20247.265670] sd 1:0:3:2: [sdaia] tag#1241 scsi_eh_1: flush retry cmd
  [20247.265671] sd 1:0:3:2: [sdaia] tag#1239 scsi_eh_1: flush retry cmd
  [20247.265673] sd 1:0:3:2: [sdaia] tag#301 scsi_eh_1: flush retry cmd
  [20247.265674] sd 1:0:3:2: [sdaia] tag#299 scsi_eh_1: flush retry cmd
  [20247.265676] sd 1:0:3:2: [sdaia] tag#298 scsi_eh_1: flush retry cmd
  [20247.265677] sd 1:0:3:2: [sdaia] tag#296 scsi_eh_1: flush retry cmd
  [20247.265678] sd 1:0:3:2: [sdaia] tag#294 scsi_eh_1: flush retry cmd
  [20247.265679] sd 1:0:3:2: [sdaia] tag#247 scsi_eh_1: flush retry cmd
  [20247.265681] sd 1:0:3:2: [sdaia] tag#245 scsi_eh_1: flush retry cmd
  [20247.265682] sd 1:0:3:2: [sdaia] tag#243 scsi_eh_1: flush retry cmd
  [20247.265683] sd 1:0:3:2: [sdaia] tag#240 scsi_eh_1: flush retry cmd
  [20247.265684] sd 1:0:3:2: [sdaia] tag#238 scsi_eh_1: flush retry cmd
  [20247.265686] sd 1:0:3:2: [sdaia] tag#235 scsi_eh_1: flush retry cmd
  [20247.265687] sd 1:0:3:2: [sdaia] tag#198 scsi_eh_1: flush retry cmd
  [20247.265688] sd 1:0:3:2: [sdaia] tag#197 scsi_eh_1: flush retry cmd
  [20247.265689] sd 1:0:3:2: [sdaia] tag#196 scsi_eh_1: flush retry cmd
  [20247.265690] sd 1:0:3:2: [sdaia] tag#194 scsi_eh_1: flush retry cmd
  [20247.265692] sd 1:0:3:2: [sdaia] tag#192 scsi_eh_1: flush retry cmd
  [20247.265693] sd 1:0:3:16687: [sdatp] tag#1256 scsi_eh_1: flush retry cmd
  [20247.266024] scsi host1: waking up host to restart
  [20247.268104] scsi 1:0:3:0: scsi scan: REPORT LUNS successful (try 0) result 
0x0
  [20247.268107] scsi 1:0:3:0: scsi scan: REPORT LUN scan
  [20247.268114] scsi 1:0:3:0: scsi scan: device exists on 1:0:3:0
  [20247.268117] sd 1:0:3:1: scsi scan: device exists on 1:0:3:1
  [20247.268119] sd 1:0:3:2: scsi scan: device exists on 1:0:3:2
  ....
  [20247.269151] sd 1:0:3:16686: scsi scan: device exists on 1:0:3:16686
  [20247.269158] sd 1:0:3:16687: scsi scan: device exists on 1:0:3:16687
  [20247.269776] scsi host1: scsi_eh_1: sleeping
  [20247.273485] scsi 1:0:0:0: scsi scan: INQUIRY pass 1 length 36
  [20247.273912] scsi 1:0:0:0: scsi scan: INQUIRY successful with code 0x0
  [20247.273944] scsi 1:0:0:0: scsi scan: INQUIRY pass 2 length 109
  [20247.274650] scsi 1:0:0:0: scsi scan: INQUIRY successful with code 0x0
  [20247.274685] scsi 1:0:0:0: scsi scan: peripheral device type of 31, no 
device added
  [20247.278681] scsi 1:0:0:0: scsi scan: Sending REPORT LUNS to (try 0)
  [20247.279981] scsi 1:0:0:0: scsi scan: REPORT LUNS successful (try 0) result 
0x0
  [20247.279983] scsi 1:0:0:0: scsi scan: REPORT LUN scan
  [20247.279984] scsi 1:0:0:0: scsi scan: device exists on 1:0:0:0
  [20247.281405] scsi 1:0:1:0: scsi scan: INQUIRY pass 1 length 36
  [20247.281517] scsi 1:0:1:0: scsi scan: INQUIRY successful with code 0x0
  [20247.281529] scsi 1:0:1:0: scsi scan: INQUIRY pass 2 length 109
  [20247.281607] scsi 1:0:1:0: scsi scan: INQUIRY successful with code 0x0
  [20247.281612] scsi 1:0:1:0: scsi scan: peripheral device type of 31, no 
device added
  [20247.282750] scsi 1:0:1:0: scsi scan: Sending REPORT LUNS to (try 0)
  [20247.282872] scsi 1:0:1:0: scsi scan: REPORT LUNS successful (try 0) result 
0x0
  [20247.282873] scsi 1:0:1:0: scsi scan: REPORT LUN scan
  [20247.282875] scsi 1:0:1:0: scsi scan: device exists on 1:0:1:0
  [20247.295705] sd 1:0:2:2: Power-on or device reset occurred
  [20247.295717] sd 1:0:2:2: tag#2125 Done: ADD_TO_MLQUEUE Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK
  [20247.295719] sd 1:0:2:2: tag#2125 CDB: Report target port groups a3 2a 00 
00 00 00 00 00 00 80 00 00
  [20247.295721] sd 1:0:2:2: tag#2125 Sense Key : Unit Attention [current]
  [20247.295723] sd 1:0:2:2: tag#2125 Add. Sense: Power on, reset, or bus 
device reset occurred
  [20247.299125] sd 1:0:2:2: alua: port group 00 state A non-preferred supports 
tolusnA
  [20247.299734] sd 1:0:2:2: alua: port group 00 state A non-preferred supports 
tolusnA
  [20247.315850] sd 1:0:3:16687: alua: port group 00 state A non-preferred 
supports tolusnA
  [20482.227493] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [20482.227509] Failing address: 001fffe00c3c0000 TEID: 001fffe00c3c0403
  [20482.227511] Fault in home space mode while using kernel ASCE.
  [20482.227513] AS:000003fa68ac800b R2:000003fa3400800b R3:000003fa3400c007 
S:000003f9fd093800 P:0000000000000400
  [20482.227562] Oops: 0011 ilc:2 [#1] SMP
  [20482.227569] Modules linked in: vhost_net vhost macvtap macvlan tap xfs 
xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip6table_filter ip6_tables 
iptable_filter bpfilter bridge aufs overlay dm_service_time dm_multipath 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua s390_trng chsc_sch eadm_sch vfio_ccw 
vfio_mdev mdev vfio_iommu_type1 vfio 8021q garp mrp stp llc sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress 
zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 linear dm_mirror dm_region_hash 
dm_log qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 
libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common 
zfcp scsi_transport_fc dasd_eckd_mod dasd_mod qeth qdio ccwgroup
  [20482.227654] CPU: 17 PID: 0 Comm: swapper/17 Kdump: loaded Not tainted 
5.4.0-29-generic #33-Ubuntu
  [20482.227656] Hardware name: ...
  [20482.227658] Krnl PSW : 0404e00180000000 000003fa67b299be 
(try_to_wake_up+0x4e/0x700)
  [20482.227672]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 
RI:0 EA:3
  [20482.227737] Krnl GPRS: 000003fa273e8588 001fffe000000000 001fffe000000012 
0000000000000003
  [20482.227739]            0000000000000000 0000000000000033 040003fa67a84b18 
0000000000000003
  [20482.227741]            0000000000000003 001fffe00c3c0284 0000000000000000 
001fffe00c3bf930
  [20482.227743]            000003fa330fa200 000003fa6835b088 001fffe001f87ae0 
001fffe001f87a60
  [20482.227758] Krnl Code: 000003fa67b299b2: 41902954          la      
%r9,2388(%r2)
                            000003fa67b299b6: 582003ac          l       %r2,940
                           #000003fa67b299ba: a7180000          lhi     %r1,0
                           >000003fa67b299be: ba129000          cs      
%r1,%r2,0(%r9)
                            000003fa67b299c2: a77401c9          brc     
7,000003fa67b29d54
                            000003fa67b299c6: e310b0080004      lg      
%r1,8(%r11)
                            000003fa67b299cc: b9800018          ngr     %r1,%r8
                            000003fa67b299d0: a774001f          brc     
7,000003fa67b29a0e
  [20482.227775] Call Trace:
  [20482.227784] ([<000003fa6887c8b0>] __irq_regs+0x0/0x8)
  [20482.227791]  [<000003fa67fa56ba>] rq_qos_wake_function+0x8a/0xa0
  [20482.227795]  [<000003fa67b48bde>] __wake_up_common+0x9e/0x1b0
  [20482.227797]  [<000003fa67b490e4>] __wake_up_common_lock+0x94/0xe0
  [20482.227799]  [<000003fa67b4915a>] __wake_up+0x2a/0x40
  [20482.227804]  [<000003fa67fbc640>] wbt_done+0x90/0xe0
  [20482.227806]  [<000003fa67fa57be>] __rq_qos_done+0x3e/0x60
  [20482.227809]  [<000003fa67f915b0>] blk_mq_free_request+0xe0/0x140
  [20482.227816]  [<000003fa68133c60>] dm_softirq_done+0x140/0x230
  [20482.227817]  [<000003fa67f8ffbc>] blk_done_softirq+0xbc/0xe0
  [20482.227822]  [<000003fa68352710>] __do_softirq+0x100/0x360
  [20482.227826]  [<000003fa67af925e>] irq_exit+0x9e/0xc0
  [20482.227829]  [<000003fa67a84b18>] do_IRQ+0x78/0xb0
  [20482.227831]  [<000003fa68351c28>] ext_int_handler+0x128/0x12c
  [20482.227835]  [<000003fa67a7b5bc>] enabled_wait+0x3c/0xd0
  [20482.227840] Last Breaking-Event-Address:
  [20482.227844]  [<000003fa67b2a07e>] wake_up_process+0xe/0x20
  [20482.227872] Kernel panic - not syncing: Fatal exception in interrupt

  You can see at 20226.914939 the start of common code scsi-mod error handling
  (SCSI EH). It finds a bunch of commands that have been running for too long
  (30s IIRC), and it tries to issue aborts for those commands. At 20227.632049
  the FCP HBA resets; this reset is not directly related to the aborts issued by
  SCSI EH - IOW we don't just reset the adapter each time we get an abort. A
  side-effect of the HBA reset ist that the Abort Commands are not issued, hence
  SCSI EH escalates to the next step and issues BDR at 20227.705534, 
subsequently
  even Target reset at 20241.128583. But the HBA reset also mean the commands
  that the midlayer wanted to be abort are effectively aborted. There won't be
  any more completions for them (we also have mutual exclusion due to the CAS
  operation in scsi_times_out() and scsi_mq_done() with SCMD_STATE_COMPLETE 
being
  set in the command state).

  Subsequently I/O works again. Evidence is that the TURs that the midlayer 
sends
  during SCSI EH are successful (e.g. at 20228.787915 directly after the BDR).
  After that we also see the midlayer scanning for new devices and such, which
  all works fine apparently - and is also normal I/O.

  The reason why our HBA goes through reset, and why the timeouts happen is a
  separate issue that we'll research further.

  As far as I've been able to find out, the crash in the WBT code is not 
directly
  related to the HBA reset. We couldn't find ultimate proof in any of the data,
  but it seems the crash is somehow related to the fact that the scsi midlayer
  went through EH, and later tried to wake up a thread waiting in WBT that 
points
  into nothing. So my current hypotheses is that WBT is somehow not integrated
  well in this error handling case with the SCSI midlayer.

  The same test with ECKD storage instead of SCSI does not crash.

  
  We also  chased after the requests that eventually
  caused the crash in the examples I posted above. Digging into the dump we 
could
  recover the request address at 0x3f89b5bb200. This request was serviced in a
  dm-multipath request queue, and was cloned into 0x3f9cc25f300, which was
  serviced by the SCSI midlayer. From there I could also see the LBA that this
  request was operating on:

  crash> print *((struct request *) 0x3f9cc25f300)
  $20 = {
    ....
    __data_len = 0x0,
    __sector = 0x1598e5a0,
    bio = 0x0,
    biotail = 0x3f8c86a5a10,
    ....
  }

  After all data is serviced the __sector address is not increased anymore, so 
we
  can see that this request was sent against LBA 0x1598e5a0. When I went through
  out HBA traces this LBA never occurred with an error, so currently I don't 
have
  much evidence pointing to the HBA driver messing this up.


  Like I said, I currently feel like this crash results from some bad 
interaction
  after SCSI EH between the SCSI midlayer and WBT. But I don't have any strong
  evidence for that as of now, so I can't really be certain.

  It is also a bit surprising that the time distance between End of EH and the
  crash is so long:

  [20247.315850] sd 1:0:3:16687: alua: port group 00 state A non-preferred 
supports tolusnA
  [20482.227493] Unable to handle kernel pointer dereference in virtual kernel 
address space

  nearly 235 seconds (~3 minutes 55 seconds).

  Addl. Comment:

  I did a little research, and it is indeed possible to disable WBT
  during runtime, its just a bit un-userfriendly for lots of disks, so I
  wrote a quick`n`dirty udev-rule file.

  Is it possible for you to retry this test with WBT disabled? I think
  it might prove useful to know whether we can handle the occasional
  queue-slowdown with that feature disabled, and it should further
  narrow down what components are affected (IOW, if it runs fine with
  WBT disabled, someone needs to audit the WBT code, and such).

  Here is the udev-rule to disable WBT for every block-device on the
  system:

      t35lp49 (1) ~ # cat /etc/udev/rules.d/99-disable-blk-wbt.rules
      #
      # Possible workaround for LTC 185913:
      #     Disable the Writeback Throttling Feature.
      #     Main interface is 'queue/wbt_lat_usec'. When writing '0' into it, 
the
      #     feature should be disabled for the queue in question
      #
      # Rule Syntax: https://www.freedesktop.org/software/systemd/man/udev.html
      #

      SUBSYSTEM!="block", GOTO="wbt_disable_end"
      ACTION=="remove", GOTO="wbt_disable_end"

      TEST{0644}=="queue/wbt_lat_usec", ATTR{queue/wbt_lat_usec}!="0",
  ATTR{queue/wbt_lat_usec}="0"

      LABEL="wbt_disable_end"

  Its a bit coarse (like I said, quick`n`dirty), but as I don't really
  know your storage-setup, this should catch everything. You can also
  test whether it worked before the test with `cat
  /sys/class/block/*/queue/wbt_lat_usec`; it should display '0'.

  Regarding the queue slowdowns we meanwhile have a theory as well, but
  that will require running a modified kernel, so lets do one step at a
  time.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1881109/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to