There are several reports from FreeBSD users about getting a panic because of
avl_is_empty(&dn->dn_dbufs) assertion in dnode_sync_free(). I was also able to
reproduce the problem with ZFS on Linux 0.6.5. There does not seem to be any
reports from illumos users.
I think that the following stack traces demonstrate the problem rather well (the
stack traces are a little bit unusual as they come from Linux's crash utility,
but should be legible):
crash> foreach UN bt
PID: 703 TASK: ffff88003b8a4440 CPU: 0 COMMAND: "txg_sync"
#0 [ffff880039fa3848] __schedule at ffffffff8160918d
#1 [ffff880039fa38b0] schedule at ffffffff816096e9
#2 [ffff880039fa38c0] spl_panic at ffffffffa0012645 [spl]
#3 [ffff880039fa3a48] dnode_sync at ffffffffa062b7cf [zfs]
#4 [ffff880039fa3b38] dmu_objset_sync_dnodes at ffffffffa0612dd7 [zfs]
#5 [ffff880039fa3b78] dmu_objset_sync at ffffffffa06130d5 [zfs]
#6 [ffff880039fa3c50] dsl_pool_sync at ffffffffa0641a8a [zfs]
#7 [ffff880039fa3cd0] spa_sync at ffffffffa0664408 [zfs]
#8 [ffff880039fa3da0] txg_sync_thread at ffffffffa067b970 [zfs]
#9 [ffff880039fa3e98] thread_generic_wrapper at ffffffffa000e18a [spl]
#10 [ffff880039fa3ec8] kthread at ffffffff8109726f
#11 [ffff880039fa3f50] ret_from_fork at ffffffff81614198
PID: 716 TASK: ffff88003b8a6660 CPU: 0 COMMAND: "trial"
#0 [ffff88003c68f738] __schedule at ffffffff8160918d
#1 [ffff88003c68f7a0] schedule at ffffffff816096e9
#2 [ffff88003c68f7b0] cv_wait_common at ffffffffa0014d15 [spl]
#3 [ffff88003c68f818] __cv_wait at ffffffffa0014e65 [spl]
#4 [ffff88003c68f828] txg_wait_synced at ffffffffa067a70f [zfs]
#5 [ffff88003c68f868] dsl_sync_task at ffffffffa064b017 [zfs]
#6 [ffff88003c68f928] dsl_destroy_head at ffffffffa06eee62 [zfs]
#7 [ffff88003c68f978] dmu_recv_cleanup_ds at ffffffffa06194ed [zfs]
#8 [ffff88003c68fa98] dmu_recv_stream at ffffffffa061a992 [zfs]
#9 [ffff88003c68fc20] zfs_ioc_recv at ffffffffa06b1bad [zfs]
#10 [ffff88003c68fe50] zfsdev_ioctl at ffffffffa06b3c86 [zfs]
#11 [ffff88003c68feb8] do_vfs_ioctl at ffffffff811d9ca5
#12 [ffff88003c68ff30] sys_ioctl at ffffffff811d9f21
#13 [ffff88003c68ff80] system_call_fastpath at ffffffff81614249
RIP: 00007ff39d5c0257 RSP: 00007ff38e5c2008 RFLAGS: 00010206
RAX: 0000000000000010 RBX: ffffffff81614249 RCX: 0000000000000024
RDX: 00007ff38e5c21d0 RSI: 0000000000005a1b RDI: 0000000000000004
RBP: 00007ff38e5c57b0 R8: 342d663438372d62 R9: 636430382d646335
R10: 643266636131612d R11: 0000000000000246 R12: 0000000000000060
R13: 00007ff38e5c3200 R14: 00007ff3880080a0 R15: 00007ff38e5c21d0
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
PID: 31758 TASK: ffff88003b332d80 CPU: 0 COMMAND: "dbu_evict"
#0 [ffff88003b723ca0] __schedule at ffffffff8160918d
#1 [ffff88003b723d08] schedule_preempt_disabled at ffffffff8160a8d9
#2 [ffff88003b723d18] __mutex_lock_slowpath at ffffffff81608625
#3 [ffff88003b723d78] mutex_lock at ffffffff81607a8f
#4 [ffff88003b723d90] dbuf_rele at ffffffffa05fd290 [zfs]
#5 [ffff88003b723db0] dmu_buf_rele at ffffffffa05fe57e [zfs]
#6 [ffff88003b723dc0] bpobj_close at ffffffffa05f78ed [zfs]
#7 [ffff88003b723dd8] dsl_deadlist_close at ffffffffa0636e19 [zfs]
#8 [ffff88003b723e10] dsl_dataset_evict at ffffffffa062d78b [zfs]
#9 [ffff88003b723e28] taskq_thread at ffffffffa000f912 [spl]
#10 [ffff88003b723ec8] kthread at ffffffff8109726f
#11 [ffff88003b723f50] ret_from_fork at ffffffff81614198
In 100% cases where I hit the assertion it was with DMU_OT_BPOBJ dnodes.
Justin thinks that the situation is harmless and the assertion can be removed.
I agree with him.
But on the other hand, I wonder if something could be done in the DSL to avoid
the described situation.
I mean, it seems that bpo_cached_dbuf is a rare (the only?) case where a dbuf
can be held beyond lifetime of its dnode...
--
Andriy Gapon
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer