Feargal Reilly wrote:

Hi,

I have a server which went down overnight, and
would not subsequently boot. A reboot was performed by
facilities staff before I got to look at it so I don't know what
was showing on the console. The reason for the outage is
unknown, and nothing showed in /var/log/messages, other than
routine ntpd time sync messages.

The server in question is a Intel SR1425BK1 server running
FreeBSD 6.2 amd64 GENERIC with a SATA RAID-1 array
provided by an onboard LSILogic MegaRAID controller.

When booted, it would pass the various BIOS screens without
problem, the RAID utility would say that the array was optimal,
and then FreeBSD would start to boot, but it couldn't get past
boot2:

FreeBSD/amd64 BOOT
Default: 0:ad(0,a)/boot/loader
boot:

At this point, the server emitted a single continous beep, and
nothing else happened. Keyboard input did nothing, although
Ctrl-Alt-Del still worked, and at one point a heart symbol
appeared after I hit keys randomly for a while.

My question is, what could have caused this failure?
My initial guesses were either a memory failure or a really
badly corrupted boot sector, but I'm not convinced by either
explanation, for reasons outlined below.

I urgently needed the data to be online again, so I yanked one disk out of the machine and inserted it into another host, and
took the server back to the office.

There, I yanked a memory module, and it booted fine, albeit
complaining about the degraded RAID array. However, when I
reinserted the memory, it continued to boot. I didn't have the
foresight to try it before I fiddled with the disks, but I can't
imagine that it had been seated incorrectly as the server had
been up for two months without problem. Also, the BIOS tests
passed, although I know they aren't too in depth. I'll run
sysutils/memtest anyway, and see what that throws up.

Meanwhile, I inserted a replacement disk and rebuilt the RAID-1
array, and it is still booting fine, so my best guess now is a
corrupted boot sector. The disk that I removed to insert into
another host was ad4, which I'm guessing is the disk that it
would have being trying to boot from in the first place. So a
bad sector could be responsible, but it would seem to be very
convenient, as there does not appear to be any other data
corruption on the disk.

Also, I've run a short SMART test, and everything is okay as far
as it is concerned. I'm in the process of running a long test,
but that won't finish before I leave the office. If it were a
corrupted sector, would it be able to get to boot2?

Any other suggestions as to what caused the failure? I know I've
changed the conditions and may never be able to reproduce it
(nor do I want to), but if I've failing hardware, I'd like a
best guess as to where it is.

Thanks for your time,

-fr.

Aloha,

I have had memory chips walk out of the slots on several occasions. Sometimes its vibration or in Hawaii we have humidity issues occasionally that tend to cause this too. I have learned to spray the sockets and card connections with contact cleaner about every 6 months to avaid this problem. Especially in areas where servers are not in a cool environment.



~Al Plant - Honolulu, Hawaii -  Phone:  808-284-2740
 + http://hawaiidakine.com + http://freebsdinfo.org + [EMAIL PROTECTED] +
 + http://internetohana.org   - Supporting - FreeBSD 6.* - 7.* +
"All that's really worth doing is what we do for others."- Lewis Carrol


_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to