Hi, On one of my 1.6.0 fileservers I am having some intermittent (once every few months) trouble with the disk array where for no discernible reason it starts kicking disks out of a RAID5 until the RAID is offlined. This results in many "rejected I/O to offline device" kernel messages, and eventually the kernel gives up on the disk entirely and the device node disappears from /dev. I can power cycle the array and it comes back. But so far I have had to also reboot the server to straighten it out because the AFS fileserver cannot be recovered for the following reason.
Other userspace programs doing I/O to the disk array fail out with -EIO eventually and I can umount -f the other mounts. Unfortunately, I have not been able to figure out how to get rid of the fileserver processes so I can umount -f the vice partitions that are still pointing to the dead device and straighten everything out from there. The fileserver process is in D state presumably wedged in I/O. Sending it kill -9 has no effect. Is there something in the design of the fileserver that would prevent it from failing and dying cleanly if something evil happens to the underlying data store? Sorry if this is a bit confusing, it's hard to explain what is going on. -- Ryan C. Underwood, <[email protected]> _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
