Hi, While testing for device failure cases with failmode=wait we found the following : - The vdevs are using iscsi targets in the backend. - induced IO failure by dropping all packets from the iscsi-target - this resulted in the zpool seeing IO errors on the vdevs. - the spa_sync() thread was stuck waiting for IOs to complete.
And after this fixed the iscsi path by restoring the packet flow. And when I tried 'zpool clear' : - fails to clear (if failmode=continue) the pool although the underlying iSCSI device has recovered. - And if failmode=wait, the zfs_ioc_clear() although it has failed to reopen the device, does not report an error.And hence, "zpool clear" wrongly assumes that the zfs_ioc_clear() succeeded and issues zfs_ioc_log_history() which gets stuck in the kernel at dmu_tx_assign() waiting for spa_sync() to complete. Below is the stack where 'zpool clear' is stuck in the kernel : -- snip -- PID: 21037 TASK: ffff8800ba96aa80 CPU: 2 COMMAND: "zpool" #0 [ffff880220323b30] __schedule at ffffffff816cb514 #1 <https://github.com/zfsonlinux/zfs/issues/1> [ffff880220323be0] schedule at ffffffff816cbc10 #2 <https://github.com/zfsonlinux/zfs/issues/2> [ffff880220323c00] cv_wait_common at ffffffffa08d9275 [spl] #3 <https://github.com/zfsonlinux/zfs/issues/3> [ffff880220323c80] __cv_wait at ffffffffa08d9305 [spl] #4 <https://github.com/zfsonlinux/zfs/issues/4> [ffff880220323c90] txg_wait_synced at ffffffffa0999c99 [zfs] #5 <https://github.com/zfsonlinux/zfs/issues/5> [ffff880220323ce0] dmu_tx_wait at ffffffffa095426e [zfs] #6 <https://github.com/zfsonlinux/zfs/issues/6> [ffff880220323d50] dmu_tx_assign at ffffffffa095436a [zfs] #7 <https://github.com/zfsonlinux/zfs/issues/7> [ffff880220323d80] spa_history_log_nvl at ffffffffa099106e [zfs] #8 <https://github.com/zfsonlinux/zfs/issues/8> [ffff880220323dd0] spa_history_log at ffffffffa0991194 [zfs] #9 <https://github.com/zfsonlinux/zfs/issues/9> [ffff880220323e00] zfs_ioc_log_history at ffffffffa09c380b [zfs] #10 <https://github.com/zfsonlinux/zfs/issues/10> [ffff880220323e40] zfsdev_ioctl at ffffffffa09c4357 [zfs] #11 <https://github.com/zfsonlinux/zfs/issues/11> [ffff880220323eb0] do_vfs_ioctl at ffffffff81216072 #12 <https://github.com/zfsonlinux/zfs/issues/12> [ffff880220323f00] sys_ioctl at ffffffff81216402 #13 <https://github.com/zfsonlinux/zfs/issues/13> [ffff880220323f50] entry_SYSCALL_64_fastpath at ffffffff816cf76e RIP: 00007f17bcc9ba77 RSP: 00007ffd708a7f98 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f17bcc9ba77 RDX: 00007ffd708a7fa0 RSI: 0000000000005a3f RDI: 0000000000000003 RBP: 0000000001157060 R8: 0000000001157f20 R9: 616332666162362d R10: 0000000039373030 R11: 0000000000000246 R12: 000000000061fce0 R13: 000000000061e980 R14: 000000000061ead0 R15: 000000000000000e ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b -- snip -- Upon further analysis found that the vdev_reopen() was failing because, the IOs issued by vdev_probe() were failing. vdev_accessible() called by zio_vdev_io_start() was failing and this was because, vdev->vdev_remove_wanted was set on the vdev. This was set when the vdev originally faulted. And this should have been cleared by spa_async_remove() which should have been spawned by spa_sync()->spa_async_dispatch(). However, the spa_sync() thread for this pool is stuck waiting for sync-IOs to complete. -- snip -- PID: 24311 TASK: ffff8802a03d6a40 CPU: 2 COMMAND: "txg_sync" #0 [ffff8802a03eb940] __schedule at ffffffff816cb514 #1 <https://github.com/zfsonlinux/zfs/issues/1> [ffff8802a03eb9f0] schedule at ffffffff816cbc10 #2 <https://github.com/zfsonlinux/zfs/issues/2> [ffff8802a03eba10] schedule_timeout at ffffffff816ce7ca #3 <https://github.com/zfsonlinux/zfs/issues/3> [ffff8802a03ebad0] io_schedule_timeout at ffffffff816cb1f4 #4 <https://github.com/zfsonlinux/zfs/issues/4> [ffff8802a03ebb10] cv_wait_common at ffffffffa08d6072 [spl] #5 <https://github.com/zfsonlinux/zfs/issues/5> [ffff8802a03ebb90] __cv_wait_io at ffffffffa08d6228 [spl] #6 <https://github.com/zfsonlinux/zfs/issues/6> [ffff8802a03ebba0] zio_wait at ffffffffa0a43bdd [zfs] #7 <https://github.com/zfsonlinux/zfs/issues/7> [ffff8802a03ebbf0] dsl_pool_sync_mos at ffffffffa09993e7 [zfs] #8 <https://github.com/zfsonlinux/zfs/issues/8> [ffff8802a03ebc20] dsl_pool_sync at ffffffffa099a1e2 [zfs] #9 <https://github.com/zfsonlinux/zfs/issues/9> [ffff8802a03ebcb0] spa_sync at ffffffffa09be347 [zfs] #10 <https://github.com/zfsonlinux/zfs/issues/10> [ffff8802a03ebd60] txg_sync_thread at ffffffffa09d8681 [zfs] #11 <https://github.com/zfsonlinux/zfs/issues/11> [ffff8802a03ebe80] thread_generic_wrapper at ffffffffa08cfb11 [spl] #12 <https://github.com/zfsonlinux/zfs/issues/12> [ffff8802a03ebec0] kthread at ffffffff810a263c #13 <https://github.com/zfsonlinux/zfs/issues/13> [ffff8802a03ebf50] ret_from_fork at ffffffff816cfacf -- snip -- So, that is sort of a soft deadlock because, the zpool cannot be cleared unless the "vdev_remove_wanted" is handled and spa_sync() which is supposed to clear it, won't get there unless the IOs complete. ------------------------------------------- openzfs-developer Archives: https://www.listbox.com/member/archive/274414/=now RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa Modify Your Subscription: https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c Powered by Listbox: http://www.listbox.com
