On 20 Feb 2017, at 0:25, Benjamin Kaduk wrote:
On Sun, Feb 19, 2017 at 11:49:40AM -0500, Garance A Drosehn wrote:
Is there something I could do with those core files which would help
to figure out what the problem is with this file server? I also
have plenty of log files, if those would provide some clues.
Well, it's not entirely clear. One could of course load them up in
gdb and see what the backtrace looks like, of course, but given the
described behavior, if I was in this situation, I would be looking
at hardware diagnostics on this machine (memtest86, SMART output,
bonnie++, etc.). I do not believe that openafs is expected to be
particularly robust against failing hardware...
-Ben
This file server is a virtual machine on hardware and disk-storage
that is being shared with several other virtual machines. I realize
the main culprit could certainly be hardware, but if it is then I
won't be the only sysadmin who is will be unhappy about that! :)
That also makes it harder to run true HW-level diagnostics.
It has been awhile since I've tried to use gdb to investigate a core
file, so right now I'm trying to resurrect those ancient corners of
my brain.
FWIW, here's an anonymized version of the log-file entry of the last
vos-move which worked. This is from the Volser log file on the
(just-upgraded) destination file server, which is the one I'm
claiming is the broken one:
- 12:14:24 - VReadVolumeDiskHeader: Couldn't open header for volume
53...561 (errno 2).
- 12:14:24 - admin_gad on gads_mac.rpi.edu is executing CreateVolume
'c._worked_.i41'
- 12:14:24 - 1 Volser: CreateVolume: volume 53...561 (c._worked_.i41)
created
- 12:14:24 - <LocalAuth> on fs_src.rpi.edu is executing Restore
53...561
- 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume
53...561
- 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume
53...561
- 12:14:26 - <LocalAuth> on fs_src.rpi.edu is executing Restore
53...561
- 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume
53...561
- 12:14:26 - RestoreVolume Cleanup: Removed 0 inodes for volume
53...561
- 12:14:27 - VReadVolumeDiskHeader: Couldn't open header for volume
53...563 (errno 2).
- 12:14:27 - admin_gad on gads_mac.rpi.edu is executing Clone Volume
new name=c._worked_.i41.backup
- 12:14:27 - 1 Volser: Clone: Cloning volume 53...561 to new volume
53...563
And here's the info for the first vos-move which failed:
- 12:14:28 - VReadVolumeDiskHeader: Couldn't open header for volume
53...175 (errno 2).
- 12:14:28 - admin_gad on gads_mac.rpi.edu is executing CreateVolume
'c._failed_.2006a'
- 12:14:28 - 1 Volser: CreateVolume: volume 53...175
(c._failed_.2006a) created
- 12:14:28 - <LocalAuth> on fs_src.rpi.edu is executing Restore
53...175
- 12:14:28 - 1 Volser: ReadVnodes: IH_CREATE: Structure needs cleaning
- restore aborted
- 12:14:28 - SYNC_ask: negative response on circuit 'FSSYNC'
- 12:14:28 - FSYNC_askfs: FSSYNC request denied for reason=101
- 12:14:28 - SYNC_ask: negative response on circuit 'FSSYNC'
- 12:14:28 - FSYNC_askfs: FSSYNC request denied for reason=101
- 12:17:26 - 1 Volser: GetVolInfo: Could not attach volume 53...175
(/vicepa:V053...175.vol) error=101
Does that suggest hardware issues? (I have no idea if it does...)
Looking at the log files, I can also see some 'vos release's, where an
RO-instance of the replicated volume is on this broken server. If I do
a 'vos examine' of the volume, the RO-instance on this server is the
"old release" while the instances on other file servers are the "new
release". I'm going to guess that this is NotGood(tm).
--
Garance Alistair Drosehn = [email protected]
Senior Systems Programmer or [email protected]
Rensselaer Polytechnic Institute; Troy, NY; USA
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info