Have you considered using thread local storage to store the current open 
transaction so that zfs_zinactive can use it when this recursion occurs?

On Apr 1, 2014, at 1:24 AM, Jorgen Lundman <[email protected]> wrote:

> 
> Hello list,
> 
> 
> This email is a bit wordy.
> 
> 
> One of the biggest hassles with the OS X port as been vnode_create(), and
> we are still exploring which ways to best handle this call.
> 
> Normal operations, vnode_create() gets you the new vnode, and the
> vnode_reclaim thread periodically calls VNOP_RECLAIM().
> 
> However, when the kernel decides it is low on vnodes, it will call VNOPs
> directly, as the calling thread. That is: VNOP_RECLAIM, VNOP_FSYNC,
> VNOP_PAGEOUT.
> 
> Take a simple zfs_create, it can end up looking like this;
> 
> zfs_create()
> {
> 
> tx = dmu_tx_create(os);
> 
> dmu_tx_hold_sa_create();
> dmu_tx_hold_zap();
> dmu_tx_hold_sa();
> dmu_tx_assign();
> 
> zfs_mknode(); -------> vnode_create()
> zfs_mknode(); ---------> VNOP_FSYNC()
> zfs_mknode(); -----------> zil_commit()
> zfs_mknode(); -------> vnode_create()
> zfs_mknode(); ---------> VNOP_RECLAIM()
> zfs_mknode(); -----------> zfs_zinactive()
> 
> zfs_acl_ids_free();
> dmu_tx_commit();
> 
> zil_commit();
> }
> 
> 
> So basically, since "current_thread" already holds locks as we enter
> vnode_create(), the reclaim will (sometimes) attempt to hold the same locks
> and we are either deadlocked, or panic(locking against myself).
> 
> Since we can be in an open txg, and fsync() attempts to call zil_commit()
> we will also deadlock.
> 
> (This also happens to vnop_pageout instead of vnop_fsync for mmap files).
> 
> 
> In OS X, the "struct vnode" is an opaque type, so changing the contents is
> "not supposed to happen".
> 
> The *only* time that the "vdata", "vtype" and "vnops" can be assigned to a
> vnode, is at the vnode_create() call. There is no way to pre-allocate
> vnodes like FreeBSD does, and set those details "later".
> 
> Attempting to defer reclaims from happening leads me to my favorite source
> line in the kernel:
> 
>  if (VNOP_RECLAIM(vp, ctx))
>       panic("vclean: cannot reclaim");
> 
> 
> 
> Initially we attempted to work around this issue by creating a "reclaim
> list" in VNOP_RECLAIM. Immediately release z_vnode "vp", as the kernel
> demands it, and place the znode on the list. A "reclaim thread" then mops
> up the release of the znode independently.
> 
> This does mean we have znodes with NULL z_vnode (or zp without vp). Have to
> test for this case in zfs_zget() etc.
> 
> This does not address vnop_fsync and vnop_pageout. Both can not run during
> the vnode_create() call, and if we try to fix them after that fact, they no
> longer have z_vnode. Issue another zget? Or just skip the call entirely?
> 
> Is it possible to make the txg/zil to handle over-lapping (recursive?
> nesting?) txgs? That is generally undesirable, but since vnode_create will
> call VNOP_FSYNC(), that could be one way to handle it.
> 
> ---
> 
> We also attempted the "reverse". Let VNOP_RECLAIM() etc happen, but every
> time we are about to call vnode_create(), we create a new thread, just to
> call vnode_create (no locking against ourselves). (or rather, a
> vnode_create thread pool)
> 
> Then we wait for the z_vnode field to be filled in "later", after the
> dmu_tx_commit() has been issued. Ie, we run with znode with NULL z_vnode
> for a while (!)
> 
> We STILL have znodes with NULL z_vnode. Just at the other end, we have to
> make vnop_lookup() detect this and wait for the z_vnode to be filled in.
> 
> Alas, this has complications in zfs_lock_dir(), which holds a dir, then
> calls zfs_zget(). Especially so in zfs_rename, which holds two dirs, and
> need to wait for both z_vnode.
> 
> So, reclaim, fsync and pageout are now hack-free, but we have to place
> blocking calls waiting for z_vnode to be filled in zfs_vnops.c, as well as
> finding a solutions to the zfs_lock_dir() problem.
> 
> Possibly we could call zfs_mknode before locking and starting the txg
> group, but surely then we are not really transactional any more.
> 
> ---
> 
> At the moment, the best looking solution is the reclaim-thread, and
> skipping fsync/pageout whenever we are coming from "inside" a
> vnode_create() call. But it concerns me to do so.
> 
> 
> Thoughts?
> 
> Lund
> 
> 
> -- 
> Jorgen Lundman       | <[email protected]>
> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
> Japan                | +81 (0)3 -3375-1767          (home)
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to