Brian Kroth wrote:
I have no problems with 2.6.20-r10. I ran it for 4 hours last night and some weeks before this. 2.6.20-r6 before that, again no problems. 2.6.22-r8 and 2.6.23 both die as soon as cactid or nagios start running. I really don't think this is bad ram anymore. I'll see if I can get an exact test for others to try. Any other kernel debug tweaks I should try?Thanks for all your help, Brian
I haven't found a way of reproducing this on other machines yet because it takes lots of time to setup cacti. In playing around with cactid though what I've found is that the error happens /nearly/ everytime I specify something like this:
cactid --verbosity=5 -f 1 -l 100 but not ever (yet) with this cactid --verbosity=5 -f 1 -l 10With sec monitoring kern.log for "Bad page state in 'cactid'" and killing cactid when that happens I've noticed that that last line of output from cactid is always something like this:
10/31/2007 10:22:32 PM - CACTID: Poller[0] Host[42] DEBUG: The POPEN returned the following File Descriptor 5
The kern.log shows this: Oct 31 22:30:09 tux-mc Bad page state in process 'cactid'Oct 31 22:30:09 tux-mc page:c14070c0 flags:0x40000001 mapping:00000000 mapcount:0 count:0
Oct 31 22:30:09 tux-mc Trying to fix it up, but a reboot is needed Oct 31 22:30:09 tux-mc Backtrace: Oct 31 22:30:09 tux-mc [<c044bf67>] bad_page+0x63/0x92 Oct 31 22:30:09 tux-mc [<c044c90c>] free_hot_cold_page+0x7c/0x17f Oct 31 22:30:09 tux-mc [<c0455c24>] do_wp_page+0x223/0x3ed Oct 31 22:30:09 tux-mc [<c0456f24>] __handle_mm_fault+0x2ad/0x305 Oct 31 22:30:09 tux-mc [<c0414616>] do_page_fault+0x1da/0x7d5 Oct 31 22:30:09 tux-mc [<c041c2d5>] do_fork+0x15d/0x217 Oct 31 22:30:09 tux-mc [<c041443c>] do_page_fault+0x0/0x7d5 Oct 31 22:30:09 tux-mc [<c06e8db5>] error_code+0x75/0x80 Oct 31 22:30:09 tux-mc [<c06e0000>] svc_defer+0xfa/0x139 Oct 31 22:30:09 tux-mc =======================The version of cactid in portage is slightly old. After updating from 0.8.6i-r1 to 0.8.6j the problem seems to happen less frequently, but still happens. With that in mind might this actually be a software problem and not a kernel problem? Shouldn't PAX be preventing userland software from screwing up the page table?
I can send more kernel output if anyone's interested. Any thoughts on what else I should be doing to test this?
Thanks, Brian
smime.p7s
Description: S/MIME Cryptographic Signature
