OK. This has happenned before, but I thought by removing references to
non-existent hardware, it was solved... (like USB...)
It has always happenned in the afternoon, almost always between noon and
4pm. But, as I said, I thought it was fixed.....
I really feel bad having to bring this into a public forum, but it has
finally exceeded me ability level to troubleshoot...
The following is a (slightly) modified report I made to the other 2
admins of the box in question. (I CC'd this version to them, as
well. Please feel free to cc them in your
responsed: [EMAIL PROTECTED])
Ok... here's the facts:
Rigor (rigor.mortis.org) is our primary server. It is a P133/48 with a
~4GB hdd as / and a 20GB hdd as /home
It is running Mandrake 7.0 (kernel 2.2.14-15mdk), and it currently
handles all our services. (httpd, sshd, sendmail, pop3, imap, and DNS)
Between 13:11:15 and 13:34:33 this afternoon Rigor crashed.
There is only one entry in /var/log/messages that indicated any errors
immediately before the crash:
May 27 13:11:15 rigor named[391]: bad referral (199.198.151.in-addr.arpa
!< 2.199.198.151.IN-ADDR.ARPA)
199.198.151 class C is owned entirely by Ault Foods Ltd. in Ontario, CA,
and has absolutely nothing to do with us. I can find no reserence to it
in out system.
The previous entry was named clearing it'd cache and posting the stats,
almost an hour before that, at 12:23:47.
When I arrived, the screen showed some sort of hex dump.
Several lines of stuff like:
[<c01e00>] [<c01100669>] ... ad nauseum.
Then at the bottom I saw thus: (reproduced from the scribbled notes I
took)
Code: 89 50 04 b8 01 00 00 00 eb 03 90 31 c0 c7 41 04 00 00 00 00
Aiee, killing interrupt handler
Kernel Panic: Attempted to kill the idle task!
In interrupt handler - not syncing
And that's all she wrote. No response from anything but a hard reboot.
Because I had seen this sort of thing before (I thought it was a
hardware problem) I had written a cron job to keep an hourly log od such
things as top, ps -A, netstat, and w.)
Now once it was back up, I looked at the most recent logs captured by my
little "eyeball" script, which had dutifully recorded what it could at
13:01. Nothing looks too much out of the ordinary. Preliminary
investigation says it all jives with /var/log/messages in that no one
was ssh'd in at the time.
Beyond that the only thing that caught my eye was a LOT of httpd
connections from fw.tctv.ne.jp - but since I reduced the max
connections, this shouldn't have caused a kernel panic.
Looking again....
top shows something funky in the CPU states:
CPU states: 2.1% user, 1.6% system, 0.0% nice, 0.6% idle
A little math says that only accounts for 4.3%. Anyone have a clue
where the other 95.7% of our processor went to? It's a bit odd to say
the least, as top, it self was shown as using 6.3% CPU.
As a comparison, here's the current CPU states:
CPU states: 3.3% user, 2.7% system, 0.0% nice, 93.9% idle
Adds up to 99.9%, which is close enough after rounding.
Next, ps -Acf showed nothing out of the ordinary that I could see...
Mailq was empty...
w showed no one logged in.
netstat showed tons of connections from fw.tctv.ne.jp but all of them
www connections... nothing funky. All connections were shown as www,
including someone surfing in from what appears to be his
nameserver: ns1.guetali.fr
I found no evidence of a core dump. (ran slocate -u and locate core and
found nothing dated from today.)
While I was rebooting, we were sufferring many very minor brownouts, but
the UPS should have taken care of that.
That's all I found.
The full eyeball log is available if anyone feels like reading it.
The /var/log/messages is pretty long, but I can send more of it if
needed.
Conclusion.....
I see 2 categories of what might have happenned:
1. Named's bad referral caused something to freeze.
2. Whatever caused the kernel panic left no sign before freezing.
Anyone have any ideas? I'm at a loss.
Brian
---------------------------------------------------------------
| [EMAIL PROTECTED] Spam me and DIE! |
| Http://iwww.datasquire.net |
| Co-Founder & Co-Owner of |
| Data Squire Internet Services |
---------------------------------------------------------------
**********************************************************
To unsubscribe from this list, send mail to
[EMAIL PROTECTED] with the following text in the
*body* (*not* the subject line) of the letter:
unsubscribe gnhlug
**********************************************************