[OpenAFS-devel] short CacheItems reads - AND - vcache locking for afs_InvalidateAllSegments

Chaskiel M Grundman Fri, 26 Mar 2021 06:14:54 -0700

While investigating a performance issue affecting timeshares at our institution 
(which I am provisionally blaming on other clients driving up IO load on the 
fileservers), I encountered a rerun of an issue that's been reported on 
openafs-info twice before:

[42342.692729] afs: disk cache read error in CacheItems slot 100849 off
8067940/8750020 code -5/80
(repeated)

But this one ends differently than
https://lists.openafs.org/pipermail/openafs-info/2018-October/042576.html or
https://lists.openafs.org/pipermail/openafs-info/2020-April/042930.html

[42342.697743] afs: Failed to invalidate cache chunks for fid NNN.NNN.NNN.NNN;
our local disk cache may be throwing errors. We must invalidate these chunks to
avoid possibly serving incorrect data, so we'll retry until we succeed. If AFS
access seems to hang, this may be why.
[42342.697771] openafs: assertion failed: WriteLocked(&tvc->lock), file:
/var/lib/dkms/openafs/1.8.6-2.el7_9/build/src/libafs/MODLOAD-3.10.0-1160.6.1.el7.x86_64-SP/afs_daemons.c,
line: 606

The first thing I'm going to assert is that this isn't a hardware error. It
affects multiple virtual systems, and no IO errors are logged by the kernel.
My assertion is that EIO is coming from osi_rdwr, which will turn a short read
or write into EIO. The supposition of myself and others who have looked at this
is that the source of the problem is using ext4 as a cache (and perhaps also
the dedicated cache filesystem being >80% full), and we're remediating that on
these systems.

This does leave us with two problems in openafs:

* The use of EIO, leading to claims that people have hardware errors when
they may not.
* The lock breakage.

For the former, I'd recommend that either the short IOs be logged, or a
different code (perhaps ENODATA if available?) used to differentiate it from
hardware errors.

For the latter, I believe that there's inconsistency about the locking
requirements of afs_InvalidateAllSegments.
This comment claims the lock is held:
/*
* Ask a background daemon to do this request for us. Note that _we_
hold
* the write lock on 'avc', while the background daemon does the work.
This
* is a little weird, but it helps avoid any issues with lock ordering
* or if our caller does not expect avc->lock to be dropped while
* running.
*/
When called from afs_StoreAllSegments's error path, avc->lock is clearly held,
because StoreAllSegments itself downgrades and upgrades the lock.
When called from afs_dentry_iput via afs_InactiveVCache, it seems like it isn't.
None of the callers on any platform seems to lock the cache before calling
inactive. (unless on some platforms there's aliasing between a VFS level lock
and vc->lock).
afs_remunlink expects to be called with avc unlocked.

[OpenAFS-devel] short CacheItems reads - AND - vcache locking for afs_InvalidateAllSegments

Reply via email to