Re: [Beowulf] RAID question

Skylar Thompson Sat, 14 Mar 2015 07:51:06 -0700

On 3/13/2015 5:52 PM, mathog wrote:

A bit off topic, but some of you may have run into something similar.
Today I was called in to try and fix a server which had stoppedworking. Not my machine, the usual sysop is out sick. The
model is a Dell PowerEdge T320 with a Raid PERC H710P controller.
The symptoms reported were "it stopped working, could not find 'ls',and wouldn't reboot past grub". (Evidently it could find 'reboot'.)
Got into the BIOS and ran RAID consistency check, which took 3 hours.It didn't say if it had passed or failed, or put up any sort of statusmessage whatsoever, but there were no failure lights lit on the disks.
On a reboot it gives:

  grub error 8: kernel must be loaded before booting.
It is a Centos 6.5 system, so booted it with an installation disk ofthat flavor, and dropped down into a shell.
This is where it gets strange.

/boot is in /dev/sdb1.  When mounted that directory is empty but
when unmounted fsck shows 10 files in it taking up about 12Mb. Prettyclear why it wouldn't boot with nothing in /boot. Not surewhat the 10 files fsck sees are, perhaps part of the filesystem. (ext2I think). I had never tried running fsck on an empty file system in apartition before.
/bin is missing entirely, so that's why "ls" stopped working. /usr/binis still there, which is why reboot was OK.
/var/log/messages shows that the machine was logging what look likecorrected disk errors (sense errors) for /dev/sdb1 for days before itfailed.
Tried copying the contents of another machine's /boot (which issupposed to be an exact copy of this one) into /boot, and rebooting,but grub didn't get any farther than it had before. Probably grubneeds to be reinstalled, but with /bin missing, and who knows whatelse gone besides, it seems like a full OS reinstall would be in order.
Off the top of my head, if it weren't for the sense errors on/dev/sdb1, I would think that this might have been the result of anaccidental (or hacker's)
  rm -rf /
Anybody run into a hardware/software glitch with symptoms like this ona similar system???
Is there some way on these sorts of Dell's to run per disk diagnosticsfrom BIOS or UEFI even if they are already grouped into a virtual diskby the controller? I suspect that the disk which is /dev/sdb mayreally be on its way out, but I couldn't get smartctl to work off theDVD or from the copy on disk. (The smartctl commands used weretested on the twin machine, and they worked there.) The BIOS showedthat SMART was disabled on all of the disks. Web searches fordiagnostics for this controller all referenced software that requiresa running OS, nothing built into the BIOS/UEFI. (It is set to use BIOS.)

I might start looking at non-RAID problems first. Maybe you have somebad memory or CPU? Errant rm could do it too, as you mentioned.


Skylar
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] RAID question

Reply via email to