I think I found a bug in ZFS on Illumos.  If it indeed is a bug, it got
introduced by:

commit 86bb58aec7165f8a0303564575c65e5a2ad58bf1
Author: Alex Reece <[email protected]>
Date:   Tue Aug 19 10:20:31 2014 -0800

    5095 panic when adding a duplicate dbuf to dn_dbufs

Anyway, here it goes...

Each dnode has an AVL tree of dbufs.  The comparator is dbuf_compare [1] and
it compares:

 - level
 - blkid
 - state
 - address

The comparison happens in this order.

Now, suppose that we have a dnode that has a dbuf.  Let's say that it is for
level=0, blkid=0.  It just so happens that its state is DB_EVICTING.  For
completeness sake let's say that it's address is 0x1234.

Now, someone calls dbuf_hold_impl [2] trying to get level=0, blkid=0.  It
calls dbuf_find which returns NULL since all all the dbufs for that dnode
are DB_EVICTING so we take the branch on line 1917.  We end up calling
dbuf_create [3] which locks the dnode sets up a new dbuf with level=0,
blkid=0, and state=DB_EVICTING [4].  It then inserts the dbuf into the hash
table as well as the dnode's AVL tree.  Then, it sets the state to
DB_UNCACHED [5].

This is the bug.

Let's say that the new dbuf has address of 0x5678.  When dbuf_create calls
avl_add, it's adding a node <0,0,DB_EVICTING,0x5678> which compares as
higher than the one existing node <0,0,DB_EVICTING,0x1234>.  This is fine.
Then, when dbuf_create changes the state, it changes the key of the new dbuf
to <0,0,DB_UNCACHED,0x5678> without ensuring that the AVL tree is still
ordered properly.  It turns out it isn't given the definition of
dbuf_states_t [6]: DB_UNCACHED is 0, DB_EVICTING is 5.  In other words,
changing the state changes the key from:

        <0,0,5,0x5678> which compares as greater than <0,0,5,0x1234>

to:

        <0,0,0,0x5678> which compares as smaller than <0,0,5,0x1234>

After the avl_add, the nodes are ordered as [<0,0,5,0x1234>, <0,0,5,0x5678>].
After the state change, the nodes should be ordered as [<0,0,0,0x5678>,
<0,0,5,0x1234>], but they are not.

IOW, dbuf_create breaks the AVL tree's invariants necessary for it to remain
balanced and functional.

Why doesn't this blow up all the time?  Well, the global hash table is used
for virtually all lookups instead of the AVL tree.


I thought about possible solutions, and the one I like the best so far
involves making the comparator only check the level and blkid.  Then, since
there would only be one place that'd insert dbufs into the AVL tree, there'd
be only one place that could potentially introduce a duplicate key
(dbuf_create).  When executing there, we are already holding the
dn_dbufs_mtx, and so we can check for a conflicting dbuf (it will be
DB_EVICTING by definition) and either attempt to free it there (similar to
what dbuf_clear does, perhaps) or just shove it off to the side on a list of
"evicting dbufs".  I'm not really familiar with the locking around dbufs, so
I'm not really sure which idea is better (or even possible!).

To be clearer with the separate evicting dbufs list... I'm thinking of
changing dnode_t:

-       avl_tree_t dn_dbufs;
+       avl_tree_t dn_dbufs_tree;
+       list_t dn_dbufs_list; /* DB_EVICTING dbufs */

dmu_dbuf_impl_t:

-       avl_node_t db_link;
+       union {
+               avl_node_t tree;
+               list_node_t list;
+       } db_link;

And then changing dbuf_create to do something like (probably via a helper
function):

tmp = avl_find(&dn->dn_dbufs_tree, db, &where);
if (tmp) {
        avl_remove(&dn->dn_dbufs_tree, tmp);
        list_insert_tail(&dn->dn_dbufs_list, tmp);
        
        /* refresh 'where' for the subsequent insertion */
        avl_find(&dn->dn_dbufs_tree, db, &where);
}

avl_insert(&dn->dn_dbufs_tree, db, where);

Of course, dbuf_destroy would have to know whether the node is on the dbuf
tree or list, but that's easy with either an extra bool or not using an
union for the avl tree/linked list node structs.

Any thoughts?

Thanks,

Jeff.

P.S. My emails in the past were not getting delivered to the open-zfs list.
If this is still a problem, who should I talk to?

[1] 
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dnode.c#63
[2] 
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1903
[3] 
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1707
[4] 
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1763
[5] 
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1774
[6] 
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/sys/dbuf.h#74

-- 
I abhor a system designed for the "user", if that word is a coded pejorative
meaning "stupid and unsophisticated."
                - Ken Thompson
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to