There are several reports from FreeBSD users about getting a panic because of
avl_is_empty(&dn->dn_dbufs) assertion in dnode_sync_free().  I was also able to
reproduce the problem with ZFS on Linux 0.6.5. There does not seem to be any
reports from illumos users.

I think that the following stack traces demonstrate the problem rather well (the
stack traces are a little bit unusual as they come from Linux's crash utility,
but should be legible):
crash> foreach UN bt
PID: 703    TASK: ffff88003b8a4440  CPU: 0   COMMAND: "txg_sync"
 #0 [ffff880039fa3848] __schedule at ffffffff8160918d
 #1 [ffff880039fa38b0] schedule at ffffffff816096e9
 #2 [ffff880039fa38c0] spl_panic at ffffffffa0012645 [spl]
 #3 [ffff880039fa3a48] dnode_sync at ffffffffa062b7cf [zfs]
 #4 [ffff880039fa3b38] dmu_objset_sync_dnodes at ffffffffa0612dd7 [zfs]
 #5 [ffff880039fa3b78] dmu_objset_sync at ffffffffa06130d5 [zfs]
 #6 [ffff880039fa3c50] dsl_pool_sync at ffffffffa0641a8a [zfs]
 #7 [ffff880039fa3cd0] spa_sync at ffffffffa0664408 [zfs]
 #8 [ffff880039fa3da0] txg_sync_thread at ffffffffa067b970 [zfs]
 #9 [ffff880039fa3e98] thread_generic_wrapper at ffffffffa000e18a [spl]
#10 [ffff880039fa3ec8] kthread at ffffffff8109726f
#11 [ffff880039fa3f50] ret_from_fork at ffffffff81614198

PID: 716    TASK: ffff88003b8a6660  CPU: 0   COMMAND: "trial"
 #0 [ffff88003c68f738] __schedule at ffffffff8160918d
 #1 [ffff88003c68f7a0] schedule at ffffffff816096e9
 #2 [ffff88003c68f7b0] cv_wait_common at ffffffffa0014d15 [spl]
 #3 [ffff88003c68f818] __cv_wait at ffffffffa0014e65 [spl]
 #4 [ffff88003c68f828] txg_wait_synced at ffffffffa067a70f [zfs]
 #5 [ffff88003c68f868] dsl_sync_task at ffffffffa064b017 [zfs]
 #6 [ffff88003c68f928] dsl_destroy_head at ffffffffa06eee62 [zfs]
 #7 [ffff88003c68f978] dmu_recv_cleanup_ds at ffffffffa06194ed [zfs]
 #8 [ffff88003c68fa98] dmu_recv_stream at ffffffffa061a992 [zfs]
 #9 [ffff88003c68fc20] zfs_ioc_recv at ffffffffa06b1bad [zfs]
#10 [ffff88003c68fe50] zfsdev_ioctl at ffffffffa06b3c86 [zfs]
#11 [ffff88003c68feb8] do_vfs_ioctl at ffffffff811d9ca5
#12 [ffff88003c68ff30] sys_ioctl at ffffffff811d9f21
#13 [ffff88003c68ff80] system_call_fastpath at ffffffff81614249
    RIP: 00007ff39d5c0257  RSP: 00007ff38e5c2008  RFLAGS: 00010206
    RAX: 0000000000000010  RBX: ffffffff81614249  RCX: 0000000000000024
    RDX: 00007ff38e5c21d0  RSI: 0000000000005a1b  RDI: 0000000000000004
    RBP: 00007ff38e5c57b0   R8: 342d663438372d62   R9: 636430382d646335
    R10: 643266636131612d  R11: 0000000000000246  R12: 0000000000000060
    R13: 00007ff38e5c3200  R14: 00007ff3880080a0  R15: 00007ff38e5c21d0
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

PID: 31758  TASK: ffff88003b332d80  CPU: 0   COMMAND: "dbu_evict"
 #0 [ffff88003b723ca0] __schedule at ffffffff8160918d
 #1 [ffff88003b723d08] schedule_preempt_disabled at ffffffff8160a8d9
 #2 [ffff88003b723d18] __mutex_lock_slowpath at ffffffff81608625
 #3 [ffff88003b723d78] mutex_lock at ffffffff81607a8f
 #4 [ffff88003b723d90] dbuf_rele at ffffffffa05fd290 [zfs]
 #5 [ffff88003b723db0] dmu_buf_rele at ffffffffa05fe57e [zfs]
 #6 [ffff88003b723dc0] bpobj_close at ffffffffa05f78ed [zfs]
 #7 [ffff88003b723dd8] dsl_deadlist_close at ffffffffa0636e19 [zfs]
 #8 [ffff88003b723e10] dsl_dataset_evict at ffffffffa062d78b [zfs]
 #9 [ffff88003b723e28] taskq_thread at ffffffffa000f912 [spl]
#10 [ffff88003b723ec8] kthread at ffffffff8109726f
#11 [ffff88003b723f50] ret_from_fork at ffffffff81614198

In 100% cases where I hit the assertion it was with DMU_OT_BPOBJ dnodes.
Justin thinks that the situation is harmless and the assertion can be removed.
I agree with him.
But on the other hand, I wonder if something could be done in the DSL to avoid
the described situation.
I mean, it seems that bpo_cached_dbuf is a rare (the only?) case where a dbuf
can be held beyond lifetime of  its dnode...

-- 
Andriy Gapon

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to