On 20 Feb 2017, at 16:07, Garance A Drosehn wrote: > On 20 Feb 2017, at 0:25, Benjamin Kaduk wrote: > >> [...] if I was in this situation, I would be looking at >> hardware diagnostics on this machine (memtest86, SMART >> output, bonnie++, etc.). I do not believe that openafs >> is expected to be particularly robust against failing >> hardware...
> [...skipping lots...] > > In any case, it now seems almost certain that the crash on > Feb 8th is the primary cause for all the problems we're seeing. In case anyone is curious, I was successful at moving volumes off the broken file server. As I mentioned elsewhere, I was lucky in that most of the busier volumes had been moved off this server before the crash happened. Many of the remaining ones were solo-RO instances, where the RW volume is on a different file server. So for those I just destroyed the RO-instance and then re-created a new RO-instance on a different file server. With the others, I ran into a problem where a plain 'vos move' did not work. However a 'vos move -live' did work. And since all of these are volumes that were not being actively modified, I assume that a 'vos move -live' was not much of a risk. But I also did the moves late in the evening, just to reduce the risk a bit more. As of right now, a 'listvol' of the broken file server shows only four volumes. All four volumes are ones where a 'vos move' *to* the broken file server had failed. It's those failures which were the first obvious signs that something was broken. While 'listvol' of the broken file server shows those volumes, if I do a 'vos examine' for each volume then the VLDB shows the volumes exist on other file servers. This makes sense, since the vos-moves had failed before they finished. I'm pretty confident that all I need to do now is enter the commands so the file server is completely removed from our cell. Then I'll destroy these virtual disks, create some new virtual disks and use those to build a new file server, from scratch. So all things considered, this could have gone much worse than it did go. -- Garance Alistair Drosehn = [email protected] Senior Systems Programmer or [email protected] Rensselaer Polytechnic Institute; Troy, NY; USA _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
