Hi,

On one of my 1.6.0 fileservers I am having some intermittent (once every few
months) trouble with the disk array where for no discernible reason it starts
kicking disks out of a RAID5 until the RAID is offlined.  This results in many
"rejected I/O to offline device" kernel messages, and eventually the kernel
gives up on the disk entirely and the device node disappears from /dev.  I can
power cycle the array and it comes back.  But so far I have had to also reboot
the server to straighten it out because the AFS fileserver cannot be recovered
for the following reason.

Other userspace programs doing I/O to the disk array fail out with -EIO
eventually and I can umount -f the other mounts.  Unfortunately, I have not
been able to figure out how to get rid of the fileserver processes so I can
umount -f the vice partitions that are still pointing to the dead device and
straighten everything out from there.  The fileserver process is in D state
presumably wedged in I/O.  Sending it kill -9 has no effect.  Is there
something in the design of the fileserver that would prevent it from failing
and dying cleanly if something evil happens to the underlying data store?

Sorry if this is a bit confusing, it's hard to explain what is going on.

-- 
Ryan C. Underwood, <[email protected]>
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to