Hi,

While testing for device failure cases with failmode=wait we found the
following :
- The vdevs are using iscsi targets in the backend.
- induced IO failure by dropping all packets from the iscsi-target
- this resulted in the zpool seeing IO errors on the vdevs.
- the spa_sync() thread was stuck waiting for IOs to complete.

And after this fixed the iscsi path by restoring the packet flow.

And when I tried 'zpool clear' :

   - fails to clear (if failmode=continue) the pool although the underlying
   iSCSI device has recovered.
   - And if failmode=wait, the zfs_ioc_clear() although it has failed to
   reopen the device, does not report an error.And hence, "zpool clear"
   wrongly assumes that the zfs_ioc_clear() succeeded and
   issues zfs_ioc_log_history() which gets stuck in the kernel at
   dmu_tx_assign() waiting for spa_sync() to complete.

Below is the stack where 'zpool clear' is stuck in the kernel :

-- snip --
PID: 21037 TASK: ffff8800ba96aa80 CPU: 2 COMMAND: "zpool"
#0 [ffff880220323b30] __schedule at ffffffff816cb514
#1 <https://github.com/zfsonlinux/zfs/issues/1> [ffff880220323be0] schedule
at ffffffff816cbc10
#2 <https://github.com/zfsonlinux/zfs/issues/2> [ffff880220323c00]
cv_wait_common at ffffffffa08d9275 [spl]
#3 <https://github.com/zfsonlinux/zfs/issues/3> [ffff880220323c80]
__cv_wait at ffffffffa08d9305 [spl]
#4 <https://github.com/zfsonlinux/zfs/issues/4> [ffff880220323c90]
txg_wait_synced at ffffffffa0999c99 [zfs]
#5 <https://github.com/zfsonlinux/zfs/issues/5> [ffff880220323ce0]
dmu_tx_wait at ffffffffa095426e [zfs]
#6 <https://github.com/zfsonlinux/zfs/issues/6> [ffff880220323d50]
dmu_tx_assign at ffffffffa095436a [zfs]
#7 <https://github.com/zfsonlinux/zfs/issues/7> [ffff880220323d80]
spa_history_log_nvl at ffffffffa099106e [zfs]
#8 <https://github.com/zfsonlinux/zfs/issues/8> [ffff880220323dd0]
spa_history_log at ffffffffa0991194 [zfs]
#9 <https://github.com/zfsonlinux/zfs/issues/9> [ffff880220323e00]
zfs_ioc_log_history at ffffffffa09c380b [zfs]
#10 <https://github.com/zfsonlinux/zfs/issues/10> [ffff880220323e40]
zfsdev_ioctl at ffffffffa09c4357 [zfs]
#11 <https://github.com/zfsonlinux/zfs/issues/11> [ffff880220323eb0]
do_vfs_ioctl at ffffffff81216072
#12 <https://github.com/zfsonlinux/zfs/issues/12> [ffff880220323f00]
sys_ioctl at ffffffff81216402
#13 <https://github.com/zfsonlinux/zfs/issues/13> [ffff880220323f50]
entry_SYSCALL_64_fastpath at ffffffff816cf76e
RIP: 00007f17bcc9ba77 RSP: 00007ffd708a7f98 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f17bcc9ba77
RDX: 00007ffd708a7fa0 RSI: 0000000000005a3f RDI: 0000000000000003
RBP: 0000000001157060 R8: 0000000001157f20 R9: 616332666162362d
R10: 0000000039373030 R11: 0000000000000246 R12: 000000000061fce0
R13: 000000000061e980 R14: 000000000061ead0 R15: 000000000000000e
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
-- snip --

Upon further analysis found that the vdev_reopen() was failing because, the
IOs issued
by vdev_probe() were failing. vdev_accessible() called by
zio_vdev_io_start() was failing
and this was because, vdev->vdev_remove_wanted was set on the vdev. This
was set when the
vdev originally faulted. And this should have been cleared by
spa_async_remove() which should have been spawned by
spa_sync()->spa_async_dispatch(). However, the spa_sync() thread for this
pool is stuck waiting for sync-IOs to complete.
-- snip --
PID: 24311 TASK: ffff8802a03d6a40 CPU: 2 COMMAND: "txg_sync"
#0 [ffff8802a03eb940] __schedule at ffffffff816cb514
#1 <https://github.com/zfsonlinux/zfs/issues/1> [ffff8802a03eb9f0] schedule
at ffffffff816cbc10
#2 <https://github.com/zfsonlinux/zfs/issues/2> [ffff8802a03eba10]
schedule_timeout at ffffffff816ce7ca
#3 <https://github.com/zfsonlinux/zfs/issues/3> [ffff8802a03ebad0]
io_schedule_timeout at ffffffff816cb1f4
#4 <https://github.com/zfsonlinux/zfs/issues/4> [ffff8802a03ebb10]
cv_wait_common at ffffffffa08d6072 [spl]
#5 <https://github.com/zfsonlinux/zfs/issues/5> [ffff8802a03ebb90]
__cv_wait_io at ffffffffa08d6228 [spl]
#6 <https://github.com/zfsonlinux/zfs/issues/6> [ffff8802a03ebba0] zio_wait
at ffffffffa0a43bdd [zfs]
#7 <https://github.com/zfsonlinux/zfs/issues/7> [ffff8802a03ebbf0]
dsl_pool_sync_mos at ffffffffa09993e7 [zfs]
#8 <https://github.com/zfsonlinux/zfs/issues/8> [ffff8802a03ebc20]
dsl_pool_sync at ffffffffa099a1e2 [zfs]
#9 <https://github.com/zfsonlinux/zfs/issues/9> [ffff8802a03ebcb0] spa_sync
at ffffffffa09be347 [zfs]
#10 <https://github.com/zfsonlinux/zfs/issues/10> [ffff8802a03ebd60]
txg_sync_thread at ffffffffa09d8681 [zfs]
#11 <https://github.com/zfsonlinux/zfs/issues/11> [ffff8802a03ebe80]
thread_generic_wrapper at ffffffffa08cfb11 [spl]
#12 <https://github.com/zfsonlinux/zfs/issues/12> [ffff8802a03ebec0]
kthread at ffffffff810a263c
#13 <https://github.com/zfsonlinux/zfs/issues/13> [ffff8802a03ebf50]
ret_from_fork at ffffffff816cfacf
-- snip --

So, that is sort of a soft deadlock because, the zpool cannot be cleared
unless the "vdev_remove_wanted" is handled and spa_sync() which is supposed
to clear it, won't get there unless the IOs complete.



-------------------------------------------
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com

Reply via email to