On 3/13/2015 5:52 PM, mathog wrote:
A bit off topic, but some of you may have run into something similar.

Today I was called in to try and fix a server which had stopped working. Not my machine, the usual sysop is out sick. The
model is a Dell PowerEdge T320 with a Raid PERC H710P controller.

The symptoms reported were "it stopped working, could not find 'ls', and wouldn't reboot past grub". (Evidently it could find 'reboot'.)

Got into the BIOS and ran RAID consistency check, which took 3 hours. It didn't say if it had passed or failed, or put up any sort of status message whatsoever, but there were no failure lights lit on the disks.

On a reboot it gives:

  grub error 8: kernel must be loaded before booting.

It is a Centos 6.5 system, so booted it with an installation disk of that flavor, and dropped down into a shell.

This is where it gets strange.

/boot is in /dev/sdb1.  When mounted that directory is empty but
when unmounted fsck shows 10 files in it taking up about 12Mb. Pretty clear why it wouldn't boot with nothing in /boot. Not sure what the 10 files fsck sees are, perhaps part of the filesystem. (ext2 I think). I had never tried running fsck on an empty file system in a partition before.

/bin is missing entirely, so that's why "ls" stopped working. /usr/bin is still there, which is why reboot was OK.

/var/log/messages shows that the machine was logging what look like corrected disk errors (sense errors) for /dev/sdb1 for days before it failed.

Tried copying the contents of another machine's /boot (which is supposed to be an exact copy of this one) into /boot, and rebooting, but grub didn't get any farther than it had before. Probably grub needs to be reinstalled, but with /bin missing, and who knows what else gone besides, it seems like a full OS reinstall would be in order.

Off the top of my head, if it weren't for the sense errors on /dev/sdb1, I would think that this might have been the result of an accidental (or hacker's)

  rm -rf /

Anybody run into a hardware/software glitch with symptoms like this on a similar system???

Is there some way on these sorts of Dell's to run per disk diagnostics from BIOS or UEFI even if they are already grouped into a virtual disk by the controller? I suspect that the disk which is /dev/sdb may really be on its way out, but I couldn't get smartctl to work off the DVD or from the copy on disk. (The smartctl commands used were tested on the twin machine, and they worked there.) The BIOS showed that SMART was disabled on all of the disks. Web searches for diagnostics for this controller all referenced software that requires a running OS, nothing built into the BIOS/UEFI. (It is set to use BIOS.)

I might start looking at non-RAID problems first. Maybe you have some bad memory or CPU? Errant rm could do it too, as you mentioned.

Skylar
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to