Hello list,

This email is a bit wordy.


One of the biggest hassles with the OS X port as been vnode_create(), and
we are still exploring which ways to best handle this call.

Normal operations, vnode_create() gets you the new vnode, and the
vnode_reclaim thread periodically calls VNOP_RECLAIM().

However, when the kernel decides it is low on vnodes, it will call VNOPs
directly, as the calling thread. That is: VNOP_RECLAIM, VNOP_FSYNC,
VNOP_PAGEOUT.

Take a simple zfs_create, it can end up looking like this;

zfs_create()
{

 tx = dmu_tx_create(os);

 dmu_tx_hold_sa_create();
 dmu_tx_hold_zap();
 dmu_tx_hold_sa();
 dmu_tx_assign();

 zfs_mknode(); -------> vnode_create()
 zfs_mknode(); ---------> VNOP_FSYNC()
 zfs_mknode(); -----------> zil_commit()
 zfs_mknode(); -------> vnode_create()
 zfs_mknode(); ---------> VNOP_RECLAIM()
 zfs_mknode(); -----------> zfs_zinactive()

 zfs_acl_ids_free();
 dmu_tx_commit();

 zil_commit();
}


So basically, since "current_thread" already holds locks as we enter
vnode_create(), the reclaim will (sometimes) attempt to hold the same locks
and we are either deadlocked, or panic(locking against myself).

Since we can be in an open txg, and fsync() attempts to call zil_commit()
we will also deadlock.

(This also happens to vnop_pageout instead of vnop_fsync for mmap files).


In OS X, the "struct vnode" is an opaque type, so changing the contents is
"not supposed to happen".

The *only* time that the "vdata", "vtype" and "vnops" can be assigned to a
vnode, is at the vnode_create() call. There is no way to pre-allocate
vnodes like FreeBSD does, and set those details "later".

Attempting to defer reclaims from happening leads me to my favorite source
line in the kernel:

  if (VNOP_RECLAIM(vp, ctx))
       panic("vclean: cannot reclaim");



Initially we attempted to work around this issue by creating a "reclaim
list" in VNOP_RECLAIM. Immediately release z_vnode "vp", as the kernel
demands it, and place the znode on the list. A "reclaim thread" then mops
up the release of the znode independently.

This does mean we have znodes with NULL z_vnode (or zp without vp). Have to
test for this case in zfs_zget() etc.

This does not address vnop_fsync and vnop_pageout. Both can not run during
the vnode_create() call, and if we try to fix them after that fact, they no
longer have z_vnode. Issue another zget? Or just skip the call entirely?

Is it possible to make the txg/zil to handle over-lapping (recursive?
nesting?) txgs? That is generally undesirable, but since vnode_create will
call VNOP_FSYNC(), that could be one way to handle it.

---

We also attempted the "reverse". Let VNOP_RECLAIM() etc happen, but every
time we are about to call vnode_create(), we create a new thread, just to
call vnode_create (no locking against ourselves). (or rather, a
vnode_create thread pool)

Then we wait for the z_vnode field to be filled in "later", after the
dmu_tx_commit() has been issued. Ie, we run with znode with NULL z_vnode
for a while (!)

We STILL have znodes with NULL z_vnode. Just at the other end, we have to
make vnop_lookup() detect this and wait for the z_vnode to be filled in.

Alas, this has complications in zfs_lock_dir(), which holds a dir, then
calls zfs_zget(). Especially so in zfs_rename, which holds two dirs, and
need to wait for both z_vnode.

So, reclaim, fsync and pageout are now hack-free, but we have to place
blocking calls waiting for z_vnode to be filled in zfs_vnops.c, as well as
finding a solutions to the zfs_lock_dir() problem.

Possibly we could call zfs_mknode before locking and starting the txg
group, but surely then we are not really transactional any more.

---

At the moment, the best looking solution is the reclaim-thread, and
skipping fsync/pageout whenever we are coming from "inside" a
vnode_create() call. But it concerns me to do so.


Thoughts?

Lund


-- 
Jorgen Lundman       | <[email protected]>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to