I've posted several questions (under two other ids though the name Chris)
since March trying to put up a Tyan quad dual s4882. I've run it on 6.0
STABLE as of about March, 6.1 RELEASE in several flavors from May
through September and finally 6.2 PRERELEASE as of mid-October. I
found issues early on with transition states on the bge interface, found
a memory chip that was marginal and have tested and tested throughout
this period. Every time we place the system back in production, we see a
hang without any indications of what the problem would be, after 4-7 days
of running.

I've tried to think of where the problems could be and it would seem that
6.x AMD64 exhibits this type of issue for many individuals who put a
server under heavy load. I've seen many unresolved posts here and
elsewhere that describe strikingly similar scenarios. When in full production,
it's running 5 websites out of a prefork non-ssl Apache 2.2.3, light
ports-installed mysql 4.19 access via perl cgi (not mod_perl) and heavy
access to perl generated and flat html archives pages (for discussion just
counted 300K page views for a day on one of the sites). This computer
does not breath hard at all with peak hours showing top staying at 80+%
idle. I've not opened up any service to where it can fill the 8Gb RAM in
spawning too many processes. Process count peaks at about 180 because
it services the request backlog so quickly. Active memory is usually about
250 MB and inactive varies. The configuration is very simple and it runs
nothing else other than rsyncd and sshd. The hang seems to have nothing
to do with peak access times, in fact, it will suddenly hang at our slowest
time of the day. I ran for over a month without a hang when leaving the
machine relegated to low traffic websites.

We've spent a lot to get clean dedicated power and installed a monitoring
hardware device to let us see what's going on, no help. Temperature of
the computer room is nicely down given that it's winter here and the
facility is kept fairly cold. No AC but the computer room remains about
70 degrees F.

I'm aware of the warning about 6.2 PR in production but the symptoms
have not deviated amongst any 6.x version and 6.2 PR was the only way
to pick up the extensive changes to the bge driver without hacking. I need
opinions on how to debug and possibly even who I should go to and pay
to take a closer look at this scenario. Here are questions and ideas I've
thought of, is there any validity in these or have you other ideas?

1. I've wondered if AMD64 SMP was a bad idea. Should I be using i386
for stability? It one thing I've not tried.
2. Should acpi be off as a precaution just to rule it out. It's not blacklisted. I'd turned it off for a long time when testing but the results were muddy.
3. Should I reduce the system to 4GB ram to attempt to skirt the issue.
Is 6.x less reliable over 4GB?
4. Where can I find the meanings of all vmstat -z variables, I'm dumping
them to another server every two minutes giving the percentage change
on each sample, but am unsure if I can correlate this to much of
anything meaningful without good definitions. Just started this but will
need information.
5. Does mysql use linux threads and could that be the mistake that's
taking us out?

Even wild goose chases will be welcome at this point ;-).

Chris Pratt

