R. Scott Belford wrote:
Warren mentioned an issue familiar to me when he brought up some troubles that he was having with a server locking down. He and Ray have figured it to be a hardware problem. I have experienced something familiar, and I am wondering what conditions can lead to a server lock up with no hints in the logs and why it definitely is a hardware problem.

In my scenario, I have had a server lock up after about 5 days of hard use. It has happened twice. Both times I was using Redhat's 7.2 Enterprise kernel (2.4.9-34enterprise). I blamed it on a default kernel setting that I did not understand. I changed to the stock 2.4.9-34smp kernel with Rhat 7.2. After about 30 days, the same lockup. By lockup I mean that both remote and local terminal sessions are frozen. Pressing ctrl + alt + del will not reboot. My only hint is a series of "failed to set personality on (some pid #)" on the screen. An ugly power down is the only "fix."

Upon reboot, there are no hints in the logs. This is to say, there are no hints in the var/log directory. Perhaps I could look somewhere else. As far as the logs and server are concerned, everything is just hunky-dorry. Here is what I wonder:

What can cause this? Is the machine that is locking up on Warren and Ray staying up for as many days as mine? Can hardware problems take 30 days to manifest themselves?

I have been told that /proc/sys/fs/file-max must be set high enough to handle one's active files. If this number is reached, does a server lock? Is there a way to check how many files are open?

Is there another software or kernel setting that can lead to a lock down, say, max inodes or something?

If you have any suggestions or insights or experiences that you can share, I would be most gracious.

scott

A great thing to do would be run memtest86 on the system, especially if you have thigns randomly segfaulting. Bad memory can be a tricky thing to spot and diagnose.

Another thing that happened to someone recently was the motherboard not setting the voltage correctlty with AUTO. Forcing the voltage to that in the spec sheets fixed his problems.

This definately sounds like a hardware issue (possibly thermal shutdown?). Normally the kernel manages to at least throw up an Oops on hardware failure, but occasionally hard locks are the result. If you can find something that reliably triggers the problem, you can go a great way to diagnosing the cause. Another possibility if it is software is a problem in an interrupt handler or some other situation where the kernel can't be interrupted but control is never returned to the kernel by a driver.

--MonMotha

Attachment: pgpiW6LSFJf3z.pgp
Description: PGP signature

Reply via email to