Oops... hit the send a little too early... ignore this email.
Will send out the complete email shortly

On Wed, Feb 15, 2017 at 3:56 PM, sanjeev bagewadi <
[email protected]> wrote:

> Hi,
>
> While testing for device failure cases with failmode=wait we found the
> following :
> - The vdevs are using iscsi targets in the backend.
> - induced IO failure by dropping all packets from the iscsi-target
> - this resulted in the zpool seeing IO errors on the vdevs.
> - the spa_sync() thread was stuck waiting for IOs to complete.
>
> And after this fixed the iscsi path by restoring the packet flow.
>
> And when I tried 'zpool clear' :
>
>    - fails to clear (if failmode=continue) the pool although the
>    underlying iSCSI device has recovered.
>    - And if failmode=wait, the zfs_ioc_clear() although it has failed to
>    reopen the device, does not report an error.And hence, "zpool clear"
>    wrongly assumes that the zfs_ioc_clear() succeeded and
>    issues zfs_ioc_log_history() which gets stuck in the kernel at
>    dmu_tx_assign() waiting for spa_sync() to complete.
>
> Below is the stack where 'zpool clear' is stuck in the kernel :
>
> -- snip --
> PID: 21037 TASK: ffff8800ba96aa80 CPU: 2 COMMAND: "zpool"
> #0 [ffff880220323b30] __schedule at ffffffff816cb514
> #1 <https://github.com/zfsonlinux/zfs/issues/1> [ffff880220323be0]
> schedule at ffffffff816cbc10
> #2 <https://github.com/zfsonlinux/zfs/issues/2> [ffff880220323c00]
> cv_wait_common at ffffffffa08d9275 [spl]
> #3 <https://github.com/zfsonlinux/zfs/issues/3> [ffff880220323c80]
> __cv_wait at ffffffffa08d9305 [spl]
> #4 <https://github.com/zfsonlinux/zfs/issues/4> [ffff880220323c90]
> txg_wait_synced at ffffffffa0999c99 [zfs]
> #5 <https://github.com/zfsonlinux/zfs/issues/5> [ffff880220323ce0]
> dmu_tx_wait at ffffffffa095426e [zfs]
> #6 <https://github.com/zfsonlinux/zfs/issues/6> [ffff880220323d50]
> dmu_tx_assign at ffffffffa095436a [zfs]
> #7 <https://github.com/zfsonlinux/zfs/issues/7> [ffff880220323d80]
> spa_history_log_nvl at ffffffffa099106e [zfs]
> #8 <https://github.com/zfsonlinux/zfs/issues/8> [ffff880220323dd0]
> spa_history_log at ffffffffa0991194 [zfs]
> #9 <https://github.com/zfsonlinux/zfs/issues/9> [ffff880220323e00]
> zfs_ioc_log_history at ffffffffa09c380b [zfs]
> #10 <https://github.com/zfsonlinux/zfs/issues/10> [ffff880220323e40]
> zfsdev_ioctl at ffffffffa09c4357 [zfs]
> #11 <https://github.com/zfsonlinux/zfs/issues/11> [ffff880220323eb0]
> do_vfs_ioctl at ffffffff81216072
> #12 <https://github.com/zfsonlinux/zfs/issues/12> [ffff880220323f00]
> sys_ioctl at ffffffff81216402
> #13 <https://github.com/zfsonlinux/zfs/issues/13> [ffff880220323f50]
> entry_SYSCALL_64_fastpath at ffffffff816cf76e
> RIP: 00007f17bcc9ba77 RSP: 00007ffd708a7f98 RFLAGS: 00000246
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f17bcc9ba77
> RDX: 00007ffd708a7fa0 RSI: 0000000000005a3f RDI: 0000000000000003
> RBP: 0000000001157060 R8: 0000000001157f20 R9: 616332666162362d
> R10: 0000000039373030 R11: 0000000000000246 R12: 000000000061fce0
> R13: 000000000061e980 R14: 000000000061ead0 R15: 000000000000000e
> ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
> -- snip --
>
> Upon further analysis found that the vdev_reopen() was failing because,
> the IOs issued
> by vdev_probe() were failing. vdev_accessible() called by
> zio_vdev_io_start() was failing
> and this was because, vdev->vdev_remove_wanted was set on the vdev. This
> was set when the
> vdev originally faulted. And this should have been cleared by
> spa_async_remove() which should have been spawned by 
> spa_sync()->spa_async_dispatch().
> However, the spa_sync() thread for this pool is stuck waiting for sync-IOs
> to complete.
> -- snip --
> PID: 24311 TASK: ffff8802a03d6a40 CPU: 2 COMMAND: "txg_sync"
> #0 [ffff8802a03eb940] __schedule at ffffffff816cb514
> #1 <https://github.com/zfsonlinux/zfs/issues/1> [ffff8802a03eb9f0]
> schedule at ffffffff816cbc10
> #2 <https://github.com/zfsonlinux/zfs/issues/2> [ffff8802a03eba10]
> schedule_timeout at ffffffff816ce7ca
> #3 <https://github.com/zfsonlinux/zfs/issues/3> [ffff8802a03ebad0]
> io_schedule_timeout at ffffffff816cb1f4
> #4 <https://github.com/zfsonlinux/zfs/issues/4> [ffff8802a03ebb10]
> cv_wait_common at ffffffffa08d6072 [spl]
> #5 <https://github.com/zfsonlinux/zfs/issues/5> [ffff8802a03ebb90]
> __cv_wait_io at ffffffffa08d6228 [spl]
> #6 <https://github.com/zfsonlinux/zfs/issues/6> [ffff8802a03ebba0]
> zio_wait at ffffffffa0a43bdd [zfs]
> #7 <https://github.com/zfsonlinux/zfs/issues/7> [ffff8802a03ebbf0]
> dsl_pool_sync_mos at ffffffffa09993e7 [zfs]
> #8 <https://github.com/zfsonlinux/zfs/issues/8> [ffff8802a03ebc20]
> dsl_pool_sync at ffffffffa099a1e2 [zfs]
> #9 <https://github.com/zfsonlinux/zfs/issues/9> [ffff8802a03ebcb0]
> spa_sync at ffffffffa09be347 [zfs]
> #10 <https://github.com/zfsonlinux/zfs/issues/10> [ffff8802a03ebd60]
> txg_sync_thread at ffffffffa09d8681 [zfs]
> #11 <https://github.com/zfsonlinux/zfs/issues/11> [ffff8802a03ebe80]
> thread_generic_wrapper at ffffffffa08cfb11 [spl]
> #12 <https://github.com/zfsonlinux/zfs/issues/12> [ffff8802a03ebec0]
> kthread at ffffffff810a263c
> #13 <https://github.com/zfsonlinux/zfs/issues/13> [ffff8802a03ebf50]
> ret_from_fork at ffffffff816cfacf
> -- snip --
>
> So, that is sort of a soft deadlock because, the zpool cannot be cleared
> unless the "vdev_remove_wanted" is handled and spa_sync() which is supposed
> to clear it, won't get there unless the IOs complete.
>



-------------------------------------------
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com

Reply via email to