On Wed, Apr 01, 2020 at 01:53:04PM +0100, Chris Cooke wrote:
> Hi,
>
> A machine of ours recently became unresponsive - these are the messages
> reported by journalctl for the time it happened:
>
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 320036 off 25602900/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 319975 off 25598020/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 231566 off 18525300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 320007 off 25600580/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 239740 off 19179220/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 229899 off 18391940/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 319838 off 25587060/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 653166 off 52253300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 653166 off 52253300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: failed to store file (5/0)
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in
> CacheItems slot 653166 off 52253300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: openafs: afs_InvalidateAllSegments
> tdc count
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: ------------[ cut here ]------------
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: kernel BUG at
> /builddir/build/BUILD/openafs-1.8.4/src/libafs/MODLOAD-3.10.0-1062.7.1.el7.x86_64-SP/afs_segments.c:556!
>
> Nothing further was logged until a reboot nearly an hour later.
> The machine runs Scientific Linux 7.6, and here's the output of "rpm -q
> kernel openafs" :
>
> kernel-3.10.0-1062.7.1.el7.x86_64
> openafs-1.8.4-1.el7.x86_64
That line number in 1.8.4 is a panic call in afs_InvalidateAllSegments():
543 if (afs_indexUnique[index] == avc->f.fid.Fid.Unique) {
544 tdc = afs_GetValidDSlot(index);
545 if (!tdc) {
546 /* In the case of fatal errors during stores, we MUST
547 * invalidate all of the relevant chunks. Otherwise,
the chunks
548 * will be left with the 'new' data that was never
successfully
549 * written to the server, but the DV in the dcache is
still the
550 * old DV. So, we may indefinitely serve data to
applications
551 * that is not actually in the file on the fileserver.
If we
552 * cannot afs_GetValidDSlot the appropriate entries,
currently
553 * there is no way to ensure the dcache is invalidated.
So for
554 * now, to avoid risking serving bad data from the
cache, panic
555 * instead. */
556 osi_Panic("afs_InvalidateAllSegments tdc count");
(The previous log message comes from afs_UFSGetDSlot() before it returns
failure, which would trickle back up to the afs_GetValidDSlot() call.)
Essentially, we think that we should have something in a cache file, e.g.,
because we already have an in-memory handle to it, but it wasn't there when
we went looking for it. Had this machine been running for a long time
without restart or needing to flush the (AFS) cache? How full is/was the
partition that the disk cache lives on?
There's not a whole lot to go on if we only have the one instance of the
crash, I fear.
-Ben
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info