Have you considered using thread local storage to store the current open transaction so that zfs_zinactive can use it when this recursion occurs?
On Apr 1, 2014, at 1:24 AM, Jorgen Lundman <[email protected]> wrote: > > Hello list, > > > This email is a bit wordy. > > > One of the biggest hassles with the OS X port as been vnode_create(), and > we are still exploring which ways to best handle this call. > > Normal operations, vnode_create() gets you the new vnode, and the > vnode_reclaim thread periodically calls VNOP_RECLAIM(). > > However, when the kernel decides it is low on vnodes, it will call VNOPs > directly, as the calling thread. That is: VNOP_RECLAIM, VNOP_FSYNC, > VNOP_PAGEOUT. > > Take a simple zfs_create, it can end up looking like this; > > zfs_create() > { > > tx = dmu_tx_create(os); > > dmu_tx_hold_sa_create(); > dmu_tx_hold_zap(); > dmu_tx_hold_sa(); > dmu_tx_assign(); > > zfs_mknode(); -------> vnode_create() > zfs_mknode(); ---------> VNOP_FSYNC() > zfs_mknode(); -----------> zil_commit() > zfs_mknode(); -------> vnode_create() > zfs_mknode(); ---------> VNOP_RECLAIM() > zfs_mknode(); -----------> zfs_zinactive() > > zfs_acl_ids_free(); > dmu_tx_commit(); > > zil_commit(); > } > > > So basically, since "current_thread" already holds locks as we enter > vnode_create(), the reclaim will (sometimes) attempt to hold the same locks > and we are either deadlocked, or panic(locking against myself). > > Since we can be in an open txg, and fsync() attempts to call zil_commit() > we will also deadlock. > > (This also happens to vnop_pageout instead of vnop_fsync for mmap files). > > > In OS X, the "struct vnode" is an opaque type, so changing the contents is > "not supposed to happen". > > The *only* time that the "vdata", "vtype" and "vnops" can be assigned to a > vnode, is at the vnode_create() call. There is no way to pre-allocate > vnodes like FreeBSD does, and set those details "later". > > Attempting to defer reclaims from happening leads me to my favorite source > line in the kernel: > > if (VNOP_RECLAIM(vp, ctx)) > panic("vclean: cannot reclaim"); > > > > Initially we attempted to work around this issue by creating a "reclaim > list" in VNOP_RECLAIM. Immediately release z_vnode "vp", as the kernel > demands it, and place the znode on the list. A "reclaim thread" then mops > up the release of the znode independently. > > This does mean we have znodes with NULL z_vnode (or zp without vp). Have to > test for this case in zfs_zget() etc. > > This does not address vnop_fsync and vnop_pageout. Both can not run during > the vnode_create() call, and if we try to fix them after that fact, they no > longer have z_vnode. Issue another zget? Or just skip the call entirely? > > Is it possible to make the txg/zil to handle over-lapping (recursive? > nesting?) txgs? That is generally undesirable, but since vnode_create will > call VNOP_FSYNC(), that could be one way to handle it. > > --- > > We also attempted the "reverse". Let VNOP_RECLAIM() etc happen, but every > time we are about to call vnode_create(), we create a new thread, just to > call vnode_create (no locking against ourselves). (or rather, a > vnode_create thread pool) > > Then we wait for the z_vnode field to be filled in "later", after the > dmu_tx_commit() has been issued. Ie, we run with znode with NULL z_vnode > for a while (!) > > We STILL have znodes with NULL z_vnode. Just at the other end, we have to > make vnop_lookup() detect this and wait for the z_vnode to be filled in. > > Alas, this has complications in zfs_lock_dir(), which holds a dir, then > calls zfs_zget(). Especially so in zfs_rename, which holds two dirs, and > need to wait for both z_vnode. > > So, reclaim, fsync and pageout are now hack-free, but we have to place > blocking calls waiting for z_vnode to be filled in zfs_vnops.c, as well as > finding a solutions to the zfs_lock_dir() problem. > > Possibly we could call zfs_mknode before locking and starting the txg > group, but surely then we are not really transactional any more. > > --- > > At the moment, the best looking solution is the reclaim-thread, and > skipping fsync/pageout whenever we are coming from "inside" a > vnode_create() call. But it concerns me to do so. > > > Thoughts? > > Lund > > > -- > Jorgen Lundman | <[email protected]> > Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) > Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) > Japan | +81 (0)3 -3375-1767 (home) > _______________________________________________ > developer mailing list > [email protected] > http://lists.open-zfs.org/mailman/listinfo/developer _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
