On Wed, Dec 03, 2014 at 05:10:11PM -0800, Alex Reece wrote:
> I think you're totally right, that is a bug - we change the dbuf in the avl
> tree in a way that changes its order in the avl tree without updating the
> tree.
> 
> I don't particularly know why this hasn't blown up before, but I suspect
> its because whenever we do an avl_find() in this tree, we are looking for a
> dbuf that will necessarily compare the same to both of the out-of-order
> dbufs (because a dbuf of a different <level,blockid> is different and a
> dbuf of the same <level,blockid> will be a special DB_SEARCH dbuf).
> 
> Unfortunately, there are other code paths that change the db_state so I
> don't know if your proposed solution is sufficient.

Hrm?  I don't see a problem with db_state changing - my proposal takes it
out of the comparator, and conflicts are resolved at insert time.

> After talking about it with Matt, we believe there is an easier solution:
> we can change the comparison to be based on the tuple
> 
>     <level, blockid, is_search, address>
> 
> level, blockid, and address are the same, and is_search is a boolean
> indicating that this is a special dbuf only used for searching (e.g. for an
> avl_find).

Interesting...  So, all dbufs are inserted with is_search == false, and
compare as larger than dbufs with the same level & blockid with is_search ==
true.  That should do it.

> We need the is_search field to ensure that when we do an
> avl_find in dbuf_destroy, we will get all the dbufs of the specified level
> and blockid -- without it, we would have to rely on a known relationship
> between the address of our search dbuf (probably on the stack) and the
> other dbufs in the tree (probably in the heap). Fortunately, we can
> continue to use the existing DB_SEARCH state to determine is_search rather
> than using additional space in the dbuf.

True.

So, the reason I managed to find this is because I tried to remove the
global dbuf hash table - making use of the per-dnode dbuf avl trees to find
dbufs.  I was kinda hoping to make the change a bit more friendlier for
that goal.  (I don't know if it makes sense to ultimately ditch the hash
table, so I suppose it doesn't make sense to worry about it too much.  Just
FYI.)

> I'll try this out and see if it works,

Thanks for taking a stab at this.

Jeff.

> ~Alex
> 
> 
> On Wed, Dec 3, 2014 at 2:53 PM, Josef 'Jeff' Sipek via illumos-zfs <
> [email protected]> wrote:
> 
> > I think I found a bug in ZFS on Illumos.  If it indeed is a bug, it got
> > introduced by:
> >
> > commit 86bb58aec7165f8a0303564575c65e5a2ad58bf1
> > Author: Alex Reece <[email protected]>
> > Date:   Tue Aug 19 10:20:31 2014 -0800
> >
> >     5095 panic when adding a duplicate dbuf to dn_dbufs
> >
> > Anyway, here it goes...
> >
> > Each dnode has an AVL tree of dbufs.  The comparator is dbuf_compare [1]
> > and
> > it compares:
> >
> >  - level
> >  - blkid
> >  - state
> >  - address
> >
> > The comparison happens in this order.
> >
> > Now, suppose that we have a dnode that has a dbuf.  Let's say that it is
> > for
> > level=0, blkid=0.  It just so happens that its state is DB_EVICTING.  For
> > completeness sake let's say that it's address is 0x1234.
> >
> > Now, someone calls dbuf_hold_impl [2] trying to get level=0, blkid=0.  It
> > calls dbuf_find which returns NULL since all all the dbufs for that dnode
> > are DB_EVICTING so we take the branch on line 1917.  We end up calling
> > dbuf_create [3] which locks the dnode sets up a new dbuf with level=0,
> > blkid=0, and state=DB_EVICTING [4].  It then inserts the dbuf into the hash
> > table as well as the dnode's AVL tree.  Then, it sets the state to
> > DB_UNCACHED [5].
> >
> > This is the bug.
> >
> > Let's say that the new dbuf has address of 0x5678.  When dbuf_create calls
> > avl_add, it's adding a node <0,0,DB_EVICTING,0x5678> which compares as
> > higher than the one existing node <0,0,DB_EVICTING,0x1234>.  This is fine.
> > Then, when dbuf_create changes the state, it changes the key of the new
> > dbuf
> > to <0,0,DB_UNCACHED,0x5678> without ensuring that the AVL tree is still
> > ordered properly.  It turns out it isn't given the definition of
> > dbuf_states_t [6]: DB_UNCACHED is 0, DB_EVICTING is 5.  In other words,
> > changing the state changes the key from:
> >
> >         <0,0,5,0x5678> which compares as greater than <0,0,5,0x1234>
> >
> > to:
> >
> >         <0,0,0,0x5678> which compares as smaller than <0,0,5,0x1234>
> >
> > After the avl_add, the nodes are ordered as [<0,0,5,0x1234>,
> > <0,0,5,0x5678>].
> > After the state change, the nodes should be ordered as [<0,0,0,0x5678>,
> > <0,0,5,0x1234>], but they are not.
> >
> > IOW, dbuf_create breaks the AVL tree's invariants necessary for it to
> > remain
> > balanced and functional.
> >
> > Why doesn't this blow up all the time?  Well, the global hash table is used
> > for virtually all lookups instead of the AVL tree.
> >
> >
> > I thought about possible solutions, and the one I like the best so far
> > involves making the comparator only check the level and blkid.  Then, since
> > there would only be one place that'd insert dbufs into the AVL tree,
> > there'd
> > be only one place that could potentially introduce a duplicate key
> > (dbuf_create).  When executing there, we are already holding the
> > dn_dbufs_mtx, and so we can check for a conflicting dbuf (it will be
> > DB_EVICTING by definition) and either attempt to free it there (similar to
> > what dbuf_clear does, perhaps) or just shove it off to the side on a list
> > of
> > "evicting dbufs".  I'm not really familiar with the locking around dbufs,
> > so
> > I'm not really sure which idea is better (or even possible!).
> >
> > To be clearer with the separate evicting dbufs list... I'm thinking of
> > changing dnode_t:
> >
> > -       avl_tree_t dn_dbufs;
> > +       avl_tree_t dn_dbufs_tree;
> > +       list_t dn_dbufs_list; /* DB_EVICTING dbufs */
> >
> > dmu_dbuf_impl_t:
> >
> > -       avl_node_t db_link;
> > +       union {
> > +               avl_node_t tree;
> > +               list_node_t list;
> > +       } db_link;
> >
> > And then changing dbuf_create to do something like (probably via a helper
> > function):
> >
> > tmp = avl_find(&dn->dn_dbufs_tree, db, &where);
> > if (tmp) {
> >         avl_remove(&dn->dn_dbufs_tree, tmp);
> >         list_insert_tail(&dn->dn_dbufs_list, tmp);
> >
> >         /* refresh 'where' for the subsequent insertion */
> >         avl_find(&dn->dn_dbufs_tree, db, &where);
> > }
> >
> > avl_insert(&dn->dn_dbufs_tree, db, where);
> >
> > Of course, dbuf_destroy would have to know whether the node is on the dbuf
> > tree or list, but that's easy with either an extra bool or not using an
> > union for the avl tree/linked list node structs.
> >
> > Any thoughts?
> >
> > Thanks,
> >
> > Jeff.
> >
> > P.S. My emails in the past were not getting delivered to the open-zfs list.
> > If this is still a problem, who should I talk to?
> >
> > [1]
> > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dnode.c#63
> > [2]
> > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1903
> > [3]
> > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1707
> > [4]
> > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1763
> > [5]
> > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1774
> > [6]
> > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/sys/dbuf.h#74
> >
> > --
> > I abhor a system designed for the "user", if that word is a coded
> > pejorative
> > meaning "stupid and unsophisticated."
> >                 - Ken Thompson
> >
> >
> > -------------------------------------------
> > illumos-zfs
> > Archives: https://www.listbox.com/member/archive/182191/=now
> > RSS Feed:
> > https://www.listbox.com/member/archive/rss/182191/24785990-b9be02ef
> > Modify Your Subscription:
> > https://www.listbox.com/member/?member_id=24785990&id_secret=24785990-113172ec
> > Powered by Listbox: http://www.listbox.com
> >

-- 
Evolution, n.:
  A hypothetical process whereby infinitely improbable events occur with
  alarming frequency, order arises from chaos, and no one is given credit.
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to