On Wed, Dec 03, 2014 at 05:10:11PM -0800, Alex Reece wrote: > I think you're totally right, that is a bug - we change the dbuf in the avl > tree in a way that changes its order in the avl tree without updating the > tree. > > I don't particularly know why this hasn't blown up before, but I suspect > its because whenever we do an avl_find() in this tree, we are looking for a > dbuf that will necessarily compare the same to both of the out-of-order > dbufs (because a dbuf of a different <level,blockid> is different and a > dbuf of the same <level,blockid> will be a special DB_SEARCH dbuf). > > Unfortunately, there are other code paths that change the db_state so I > don't know if your proposed solution is sufficient.
Hrm? I don't see a problem with db_state changing - my proposal takes it out of the comparator, and conflicts are resolved at insert time. > After talking about it with Matt, we believe there is an easier solution: > we can change the comparison to be based on the tuple > > <level, blockid, is_search, address> > > level, blockid, and address are the same, and is_search is a boolean > indicating that this is a special dbuf only used for searching (e.g. for an > avl_find). Interesting... So, all dbufs are inserted with is_search == false, and compare as larger than dbufs with the same level & blockid with is_search == true. That should do it. > We need the is_search field to ensure that when we do an > avl_find in dbuf_destroy, we will get all the dbufs of the specified level > and blockid -- without it, we would have to rely on a known relationship > between the address of our search dbuf (probably on the stack) and the > other dbufs in the tree (probably in the heap). Fortunately, we can > continue to use the existing DB_SEARCH state to determine is_search rather > than using additional space in the dbuf. True. So, the reason I managed to find this is because I tried to remove the global dbuf hash table - making use of the per-dnode dbuf avl trees to find dbufs. I was kinda hoping to make the change a bit more friendlier for that goal. (I don't know if it makes sense to ultimately ditch the hash table, so I suppose it doesn't make sense to worry about it too much. Just FYI.) > I'll try this out and see if it works, Thanks for taking a stab at this. Jeff. > ~Alex > > > On Wed, Dec 3, 2014 at 2:53 PM, Josef 'Jeff' Sipek via illumos-zfs < > [email protected]> wrote: > > > I think I found a bug in ZFS on Illumos. If it indeed is a bug, it got > > introduced by: > > > > commit 86bb58aec7165f8a0303564575c65e5a2ad58bf1 > > Author: Alex Reece <[email protected]> > > Date: Tue Aug 19 10:20:31 2014 -0800 > > > > 5095 panic when adding a duplicate dbuf to dn_dbufs > > > > Anyway, here it goes... > > > > Each dnode has an AVL tree of dbufs. The comparator is dbuf_compare [1] > > and > > it compares: > > > > - level > > - blkid > > - state > > - address > > > > The comparison happens in this order. > > > > Now, suppose that we have a dnode that has a dbuf. Let's say that it is > > for > > level=0, blkid=0. It just so happens that its state is DB_EVICTING. For > > completeness sake let's say that it's address is 0x1234. > > > > Now, someone calls dbuf_hold_impl [2] trying to get level=0, blkid=0. It > > calls dbuf_find which returns NULL since all all the dbufs for that dnode > > are DB_EVICTING so we take the branch on line 1917. We end up calling > > dbuf_create [3] which locks the dnode sets up a new dbuf with level=0, > > blkid=0, and state=DB_EVICTING [4]. It then inserts the dbuf into the hash > > table as well as the dnode's AVL tree. Then, it sets the state to > > DB_UNCACHED [5]. > > > > This is the bug. > > > > Let's say that the new dbuf has address of 0x5678. When dbuf_create calls > > avl_add, it's adding a node <0,0,DB_EVICTING,0x5678> which compares as > > higher than the one existing node <0,0,DB_EVICTING,0x1234>. This is fine. > > Then, when dbuf_create changes the state, it changes the key of the new > > dbuf > > to <0,0,DB_UNCACHED,0x5678> without ensuring that the AVL tree is still > > ordered properly. It turns out it isn't given the definition of > > dbuf_states_t [6]: DB_UNCACHED is 0, DB_EVICTING is 5. In other words, > > changing the state changes the key from: > > > > <0,0,5,0x5678> which compares as greater than <0,0,5,0x1234> > > > > to: > > > > <0,0,0,0x5678> which compares as smaller than <0,0,5,0x1234> > > > > After the avl_add, the nodes are ordered as [<0,0,5,0x1234>, > > <0,0,5,0x5678>]. > > After the state change, the nodes should be ordered as [<0,0,0,0x5678>, > > <0,0,5,0x1234>], but they are not. > > > > IOW, dbuf_create breaks the AVL tree's invariants necessary for it to > > remain > > balanced and functional. > > > > Why doesn't this blow up all the time? Well, the global hash table is used > > for virtually all lookups instead of the AVL tree. > > > > > > I thought about possible solutions, and the one I like the best so far > > involves making the comparator only check the level and blkid. Then, since > > there would only be one place that'd insert dbufs into the AVL tree, > > there'd > > be only one place that could potentially introduce a duplicate key > > (dbuf_create). When executing there, we are already holding the > > dn_dbufs_mtx, and so we can check for a conflicting dbuf (it will be > > DB_EVICTING by definition) and either attempt to free it there (similar to > > what dbuf_clear does, perhaps) or just shove it off to the side on a list > > of > > "evicting dbufs". I'm not really familiar with the locking around dbufs, > > so > > I'm not really sure which idea is better (or even possible!). > > > > To be clearer with the separate evicting dbufs list... I'm thinking of > > changing dnode_t: > > > > - avl_tree_t dn_dbufs; > > + avl_tree_t dn_dbufs_tree; > > + list_t dn_dbufs_list; /* DB_EVICTING dbufs */ > > > > dmu_dbuf_impl_t: > > > > - avl_node_t db_link; > > + union { > > + avl_node_t tree; > > + list_node_t list; > > + } db_link; > > > > And then changing dbuf_create to do something like (probably via a helper > > function): > > > > tmp = avl_find(&dn->dn_dbufs_tree, db, &where); > > if (tmp) { > > avl_remove(&dn->dn_dbufs_tree, tmp); > > list_insert_tail(&dn->dn_dbufs_list, tmp); > > > > /* refresh 'where' for the subsequent insertion */ > > avl_find(&dn->dn_dbufs_tree, db, &where); > > } > > > > avl_insert(&dn->dn_dbufs_tree, db, where); > > > > Of course, dbuf_destroy would have to know whether the node is on the dbuf > > tree or list, but that's easy with either an extra bool or not using an > > union for the avl tree/linked list node structs. > > > > Any thoughts? > > > > Thanks, > > > > Jeff. > > > > P.S. My emails in the past were not getting delivered to the open-zfs list. > > If this is still a problem, who should I talk to? > > > > [1] > > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dnode.c#63 > > [2] > > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1903 > > [3] > > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1707 > > [4] > > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1763 > > [5] > > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1774 > > [6] > > http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/sys/dbuf.h#74 > > > > -- > > I abhor a system designed for the "user", if that word is a coded > > pejorative > > meaning "stupid and unsophisticated." > > - Ken Thompson > > > > > > ------------------------------------------- > > illumos-zfs > > Archives: https://www.listbox.com/member/archive/182191/=now > > RSS Feed: > > https://www.listbox.com/member/archive/rss/182191/24785990-b9be02ef > > Modify Your Subscription: > > https://www.listbox.com/member/?member_id=24785990&id_secret=24785990-113172ec > > Powered by Listbox: http://www.listbox.com > > -- Evolution, n.: A hypothetical process whereby infinitely improbable events occur with alarming frequency, order arises from chaos, and no one is given credit. _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
