Casey,
We've been seeing issues like this for probably the last year. I was never able to pinpoint it to any action. We implemented remote reboot hardware and called it a day.
Some of them had strange activity, but over a larger group of machines I could never find a pattern to it. It almost seems as if it cannot spawn any new processes.
I can't help except to say your not alone.
Rob
Casey Allen Shobe - SeattleServer Mailing Lists wrote:
Hey all,
We're seeing occasional issues with a bunch of machines we have in a datacenter, most of which are currently running Gentoo. The machines will run solid and fine for days, weeks, even months, and then just lock up solid - the box still pings and an nmap scan shows all the normal ports open, but nothing responds on any port, nothing shows up in system logs, and the times we've had console access to a machine at the time, a login prompt would show up, but it would just hang if you tried to log in.
This generally indicates hardware issues to me, but it has been happening across a wide array of both well-tested and new machines. In addition, it happens on machines that are running Red Hat 7.1 through 9.0 as well as Gentoo. The problem seems random, and there is almost always close to zero load on the machine when it locks up (only once were we presently using the machine, and it locked up while uncompressing a tar file).
The Gentoo systems use the deadline I/O scheduler as it's deemed the most reliable, but this has shown up with the default anticipatory I/O scheduler as well.
The only common factor seems to be that they are all plugged into a questionable HP Procurve switch that we've been contemplating replacing. Would that simply be wasting our time (I don't think a buggy switch should be able to lock up boxes...)? Any recommendations for what to investigate at this point?
Cheers,
-- [email protected] mailing list
