Oops... hit the send a little too early... ignore this email. Will send out the complete email shortly
On Wed, Feb 15, 2017 at 3:56 PM, sanjeev bagewadi < [email protected]> wrote: > Hi, > > While testing for device failure cases with failmode=wait we found the > following : > - The vdevs are using iscsi targets in the backend. > - induced IO failure by dropping all packets from the iscsi-target > - this resulted in the zpool seeing IO errors on the vdevs. > - the spa_sync() thread was stuck waiting for IOs to complete. > > And after this fixed the iscsi path by restoring the packet flow. > > And when I tried 'zpool clear' : > > - fails to clear (if failmode=continue) the pool although the > underlying iSCSI device has recovered. > - And if failmode=wait, the zfs_ioc_clear() although it has failed to > reopen the device, does not report an error.And hence, "zpool clear" > wrongly assumes that the zfs_ioc_clear() succeeded and > issues zfs_ioc_log_history() which gets stuck in the kernel at > dmu_tx_assign() waiting for spa_sync() to complete. > > Below is the stack where 'zpool clear' is stuck in the kernel : > > -- snip -- > PID: 21037 TASK: ffff8800ba96aa80 CPU: 2 COMMAND: "zpool" > #0 [ffff880220323b30] __schedule at ffffffff816cb514 > #1 <https://github.com/zfsonlinux/zfs/issues/1> [ffff880220323be0] > schedule at ffffffff816cbc10 > #2 <https://github.com/zfsonlinux/zfs/issues/2> [ffff880220323c00] > cv_wait_common at ffffffffa08d9275 [spl] > #3 <https://github.com/zfsonlinux/zfs/issues/3> [ffff880220323c80] > __cv_wait at ffffffffa08d9305 [spl] > #4 <https://github.com/zfsonlinux/zfs/issues/4> [ffff880220323c90] > txg_wait_synced at ffffffffa0999c99 [zfs] > #5 <https://github.com/zfsonlinux/zfs/issues/5> [ffff880220323ce0] > dmu_tx_wait at ffffffffa095426e [zfs] > #6 <https://github.com/zfsonlinux/zfs/issues/6> [ffff880220323d50] > dmu_tx_assign at ffffffffa095436a [zfs] > #7 <https://github.com/zfsonlinux/zfs/issues/7> [ffff880220323d80] > spa_history_log_nvl at ffffffffa099106e [zfs] > #8 <https://github.com/zfsonlinux/zfs/issues/8> [ffff880220323dd0] > spa_history_log at ffffffffa0991194 [zfs] > #9 <https://github.com/zfsonlinux/zfs/issues/9> [ffff880220323e00] > zfs_ioc_log_history at ffffffffa09c380b [zfs] > #10 <https://github.com/zfsonlinux/zfs/issues/10> [ffff880220323e40] > zfsdev_ioctl at ffffffffa09c4357 [zfs] > #11 <https://github.com/zfsonlinux/zfs/issues/11> [ffff880220323eb0] > do_vfs_ioctl at ffffffff81216072 > #12 <https://github.com/zfsonlinux/zfs/issues/12> [ffff880220323f00] > sys_ioctl at ffffffff81216402 > #13 <https://github.com/zfsonlinux/zfs/issues/13> [ffff880220323f50] > entry_SYSCALL_64_fastpath at ffffffff816cf76e > RIP: 00007f17bcc9ba77 RSP: 00007ffd708a7f98 RFLAGS: 00000246 > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f17bcc9ba77 > RDX: 00007ffd708a7fa0 RSI: 0000000000005a3f RDI: 0000000000000003 > RBP: 0000000001157060 R8: 0000000001157f20 R9: 616332666162362d > R10: 0000000039373030 R11: 0000000000000246 R12: 000000000061fce0 > R13: 000000000061e980 R14: 000000000061ead0 R15: 000000000000000e > ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b > -- snip -- > > Upon further analysis found that the vdev_reopen() was failing because, > the IOs issued > by vdev_probe() were failing. vdev_accessible() called by > zio_vdev_io_start() was failing > and this was because, vdev->vdev_remove_wanted was set on the vdev. This > was set when the > vdev originally faulted. And this should have been cleared by > spa_async_remove() which should have been spawned by > spa_sync()->spa_async_dispatch(). > However, the spa_sync() thread for this pool is stuck waiting for sync-IOs > to complete. > -- snip -- > PID: 24311 TASK: ffff8802a03d6a40 CPU: 2 COMMAND: "txg_sync" > #0 [ffff8802a03eb940] __schedule at ffffffff816cb514 > #1 <https://github.com/zfsonlinux/zfs/issues/1> [ffff8802a03eb9f0] > schedule at ffffffff816cbc10 > #2 <https://github.com/zfsonlinux/zfs/issues/2> [ffff8802a03eba10] > schedule_timeout at ffffffff816ce7ca > #3 <https://github.com/zfsonlinux/zfs/issues/3> [ffff8802a03ebad0] > io_schedule_timeout at ffffffff816cb1f4 > #4 <https://github.com/zfsonlinux/zfs/issues/4> [ffff8802a03ebb10] > cv_wait_common at ffffffffa08d6072 [spl] > #5 <https://github.com/zfsonlinux/zfs/issues/5> [ffff8802a03ebb90] > __cv_wait_io at ffffffffa08d6228 [spl] > #6 <https://github.com/zfsonlinux/zfs/issues/6> [ffff8802a03ebba0] > zio_wait at ffffffffa0a43bdd [zfs] > #7 <https://github.com/zfsonlinux/zfs/issues/7> [ffff8802a03ebbf0] > dsl_pool_sync_mos at ffffffffa09993e7 [zfs] > #8 <https://github.com/zfsonlinux/zfs/issues/8> [ffff8802a03ebc20] > dsl_pool_sync at ffffffffa099a1e2 [zfs] > #9 <https://github.com/zfsonlinux/zfs/issues/9> [ffff8802a03ebcb0] > spa_sync at ffffffffa09be347 [zfs] > #10 <https://github.com/zfsonlinux/zfs/issues/10> [ffff8802a03ebd60] > txg_sync_thread at ffffffffa09d8681 [zfs] > #11 <https://github.com/zfsonlinux/zfs/issues/11> [ffff8802a03ebe80] > thread_generic_wrapper at ffffffffa08cfb11 [spl] > #12 <https://github.com/zfsonlinux/zfs/issues/12> [ffff8802a03ebec0] > kthread at ffffffff810a263c > #13 <https://github.com/zfsonlinux/zfs/issues/13> [ffff8802a03ebf50] > ret_from_fork at ffffffff816cfacf > -- snip -- > > So, that is sort of a soft deadlock because, the zpool cannot be cleared > unless the "vdev_remove_wanted" is handled and spa_sync() which is supposed > to clear it, won't get there unless the IOs complete. > ------------------------------------------- openzfs-developer Archives: https://www.listbox.com/member/archive/274414/=now RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa Modify Your Subscription: https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c Powered by Listbox: http://www.listbox.com
