On 23/11/2014 18:57, Sean Bruno wrote:
31071 100995 rm - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x18d __lockmgr_args+0x9ab vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 zfs_lookup+0x45d zfs_freebsd_lookup+0x6d
VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1
lookup+0x5a1 namei+0x534 kern_rmdirat+0x8d amd64_syscall+0x3fb
31075 100693 mv - mi_switch+0xe1 sleepq_wait+0x3a
sleeplk+0x18d __lockmgr_args+0xd5d vop_stdlock+0x3c VOP_LOCK1_APV+0xab
_vn_lock+0x43 vputx+0x28a zfs_rename_unlock+0x3e zfs_freebsd_rename+0xe39
VOP_RENAME_APV+0xab kern_renameat+0x4a6 amd64_syscall+0x3fb Xfast_syscall+0xfb
Just the stack traces are not sufficient to analyze the problem without
examining the relevant vnodes and vnode locks. But I believe that I have seen
reports about this kind of problem before. And I think that I understand what's
going on. And, as a part of my job, I tried to develop a fix [*] for this
problem and had some positive feedback for it.
But the fix is not just a few lines of changes. It's a lot of modifications to
a lot of files. Besides, my changes alter quite a bit more code than a bare
minimum required to fix the problem, which still would be quite a bit of changes.
So, right now I would like to describe the problem as I understand it. Some
general information about the FreeBSD VFS and its difference from Solaris VFS
[**] can be useful, but is not really required.
I'll try to explain by example. If we look at any mature and "native" FreeBSD
filesystem with read-write support - ffs, maybe tmpfs - then we can make the
following observations. In most of the vnode operation implementations there
are no calls to vnode locking functions. E.g. for an operation like vop_remove
two vnodes in question are already locked at the VFS layer. In some cases VOPs
do locking, but it is very trivial e.g. in vop_create a newly created vnode must
be returned locked.
Naturally, if we look at the VFS code we see a lot of vnode locking for various
purposes. Like locking the vnodes for vop_remove call. Or locking vnodes
during their life cycle management, so that, for example, a vnode is not
destroyed while there is an ongoing operation on it. Also, we can see locking
in VFS namei / lookup implementation where we need to hold onto a directory
vnode while looking up a child vnode by name.
But there are two vnode operation implementations where we can see a non-trivial
vnode locking "dance". Those are vop_rename and vop_lookup. Anyone is welcome
to take a cursory look at the first hundred or so lines of ufs_rename().
The point of the above observations is that both VFS and a filesystem driver do
vnode locking. And, thus, both VFS and the driver must cooperate by using the
same locking protocol.
Now, if we look at the ZFS ZPL code and most prominently at zfs_rename() we see
that there is quite a bit of locking going on there, e.g. see zfs_rename_lock,
but the locks in question are all ZFS internal locks. We do not see the vnode
From this comes a suspicion, or even a conclusion, that ZFS currently does not
use the same vnode locking protocol that is expected from any filesystem driver.
There is a weird form of redundancy between the fine grained ZFS locks that got
ported over and the FreeBSD vnode locks. In some cases the ZFS locks are always
uncontested because the vnode locks held at the VFS level ensure a serialized
access. In other cases there is no protection at all, because one thread is in
VFS code which uses the vnode locks and another thread is in ZFS code which uses
the ZFS locks and thus there is no real synchronization between those threads.
My solution to this problem was to completely eliminate (at least) the following
kmutex_t z_lock; /* znode modification lock */
krwlock_t z_parent_lock; /* parent lock for directories */
krwlock_t z_name_lock; /* "master" lock for dirent locks */
zfs_dirlock_t *z_dirlocks; /* directory entry lock list */
and to ensure that the proper vnode locking protocol is followed.
That required substantial changes to the rename and lookup code.
Finally, this is not really a suggestion to test or discuss my changes, but
rather a call to discuss the problem and other possible ways to fix it.
I do not preclude any options including making changes to our VFS (and thus ti
all the filesystems) :-)
email@example.com mailing list
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"