John McSwain wrote:
> Wit wrote:
>> John McSwain wrote:
>> <snip>
>>
>>> Having been trying to compile a new kernel today and I keep getting
>>> various errors no matter which kernel. Just tried to compile the
>>> same kernel still in use with same .config and get errors.
>> Are they repeatable? That includes same "place", etc? If not, what's
>> the nature of these? Kernel panics? Failures in compile, link?
>
> No, stopped in different place. Some errors said it was a gcc error and
> to report it as a bug. Others gave a line where it stopped. I did get
> his from the kernel one time:
Well, it could still be a kernel problem, but very often this would tend
to indicate an *intermittent* hardware problem. We can't rule out the
build utilities, but that's a very low probability, IMO.
>
> Feb 22 12:56:39 server kernel: Bad page state in process 'cc1'
> Feb 22 12:56:39 server kernel: page:c114d240 flags:0x80000000
> mapping:00000000 mapcount:-2048 count:0
> Feb 22 12:56:39 server kernel: Trying to fix it up, but a reboot is
> needed
> Feb 22 12:56:39 server kernel: Backtrace:
> Feb 22 12:56:39 server kernel: [<c013f529>] bad_page+0x69/0xa0
> Feb 22 12:56:39 server kernel: [<c013fe35>]
> get_page_from_freelist+0x385/0x3c0
> Feb 22 12:56:39 server kernel: [<c013feda>] __alloc_pages+0x6a/0x310
> Feb 22 12:56:39 server kernel: [<c0148767>]
> __handle_mm_fault+0x747/0x890
> Feb 22 12:56:39 server kernel: [<c01775ad>] mntput_no_expire+0x2d/0xb0
> Feb 22 12:56:39 server kernel: [<c015b45c>] __fput+0x11c/0x150
> Feb 22 12:56:39 server kernel: [<c0112946>] do_page_fault+0x136/0x63f
> Feb 22 12:56:39 server kernel: [<c0112810>] do_page_fault+0x0/0x63f
> Feb 22 12:56:39 server kernel: [<c010325b>] error_code+0x4f/0x54
> Feb 22 12:56:39 server kernel: Bad page state in process 'cc1'
> Feb 22 12:56:39 server kernel: page:c114d240 flags:0x80000000
> mapping:00000000 mapcount:-2048 count:0
> Feb 22 12:56:39 server kernel: Trying to fix it up, but a reboot is
> needed
> Feb 22 12:56:39 server kernel: Backtrace:
> Feb 22 12:56:39 server kernel: [<c013f529>] bad_page+0x69/0xa0
> Feb 22 12:56:39 server kernel: [<c013fe35>]
> get_page_from_freelist+0x385/0x3c0
> Feb 22 12:56:39 server kernel: [<c013feda>] __alloc_pages+0x6a/0x310
> Feb 22 12:56:39 server kernel: [<c0148767>]
> __handle_mm_fault+0x747/0x890
> Feb 22 12:56:39 server kernel: [<c01775ad>] mntput_no_expire+0x2d/0xb0
> Feb 22 12:56:39 server kernel: [<c015b45c>] __fput+0x11c/0x150
> Feb 22 12:56:39 server kernel: [<c0112946>] do_page_fault+0x136/0x63f
> Feb 22 12:56:39 server kernel: [<c0112810>] do_page_fault+0x0/0x63f
> Feb 22 12:56:39 server kernel: [<c010325b>] error_code+0x4f/0x54
Another strong indication of hardware related, I think. That's the
trouble with all these new fangled gadgets (anything more recent than
pencil and paper) - so much is hidden and interrelated that it becomes
impossible to diagnose without lots of tools and experience. And when
they are intermittent, it's like trying to find that occasional noise in
the rear of the car that never happens when you are looking for it.
I believe you can rest assured that the user applications (cc et al) are
not the cause of this.
Looks like trouble in memory management land. Could be kernel bug. Could
be hardware MM fault - Translate Lookaside Buffer (TLB), page tables,
swap related, on-board cache (CPU or mainboard), bad memory, power
supply sized just a tad too small (I had a nice case of this when I once
put a new EPOX MB into a chassis I originally built on-the-cheap for an
Acer using a PS large enough for the Acer. Random errors, apparently
memory or CPU related abounded. Thanks to my background, I immediately
whipped out the Epox user manual and saw a nice little phrase about
minimum and recommended PS capacity - which happened to be about 50
watts more than the Acer, *at a minimum*. A nice new 575 watt with
really good capacities on all the rails eliminated all randomness! :-O )
If the problem occurs again and the new kernel exhibits similar
symptoms, you could selectively disable cache in the BIOS to try and see
if that makes a difference. Things may run noticeably slower.
Do you have a swap space? "swapon -s" will show size and usage. Might
want to do something like this after firing up a compile again.
while [ 1 ] ; do
/sbin/swapon -s
sleep 15
done
It can be interrupted with a <CTL-C>.
If you see swap usage grow to very near or exceed capacity, this may be
related, although the kernel is supposed to take some preemptive action
to prevent catastrophic results from this cause. If you have a bad spot
in the swap space, it could also cause this. However, with a year old
unit, we should be able to presume the HDs will properly handle
soft-error recovery and do alternate block assignment on hard errors,
eliminating that particular spot as a problem on the next attempt to use
it. Often many spots will occur at the same time and it may take a lot
of cycles for them to finally be re-mapped and cease causing problems.
In single user mode, you might want to do a dd with if=/dev/zero and
of=<your swap partition of file> several times. If your swap is a
partition, it should end with an error indicating it couldn't write the
last block(s). If it's a file, BE SURE TO LIMIT THE NUMBER OF BLOCKS
WRITTEN with an appropriate count= parameter. Also, use a block size
that is large to speed things up a bit.
Then REMEMBER TO MKSWAP on it again.
>
>> As Ken said in the other post, and I suggested in my original reply,
>> memtest that booger. Note if failures are repeatable and how long it
>> takes. Does it take multiple passes (indicating slow overheating)
>> before the errors appear.
>
> After the above I did get and run memtest for a few minutes with no
> errors. Will find a period to run it longer next time.
>
>>>> <snip>
>> Were all the fans running when you powered it up *before* closing it
>> back up? Did you carefully push each memory stick, PCI card, etc. into
>> the slots? <snip>
> Fans were running and I pushed and shoved memory and cards pretty well.
> This computer is in a fairly cold area this time of year about 55
> degrees today. During the summer the temperature would often be in the
> low 80's and no problems at that time.
Hmmm ... it's amazing the way associative memory works in hum0ns. The
conversational interactions causing recall of things long forgotten.
Speaking of "pushing", I once had an MB that either had a "cold solder
joint" (less likely in todays environment) or had developed a crack in
the circuit traces (it was a cheap 2 layer board). Random errors.
Stymied. Finally noticed it seemed to be temperature related. But could
not reliably reproduce until I though of the possibility of circuit
weakness. I opened/fired it up and began gently pushing on memory, cards
and the MB itself until it crashed and burned again, and again, ...
repeatable on-demand.
Back then, it was an expensive fix as MBs were *not* cheap.
><snip>
>> have you looked at the boot logs to see if the journaling file system
>> (ext3) reports any oddities?
>
> No oddities.
>
>> Another possibility. Weak power. Even with surge suppression strips,
>> there is a possibility some surge weakened the power supply. Or if
>> it's old and just large enough for the load, maybe it got really warm
>> one day and gave up some capacity.
>
> System is on an APC UPS
That should protect against surges not of the most violent kind. What
size is your PS? Does the MB manual state a recommended size? Are the
rails capable of enough amperage for whatever you have (we haven't asked
if this is a Cadillac or Mini Cooper). The HDs, CDs, etc. should list
power requirements if you suspect this might be an issue. A good PS will
list not only max wattage, but also have good capacity on the rails (in
todays environment, I have no idea what is "sufficient".
>
>> How old is the unit? <snip>
> Unit built by me in August 2006 and placed in current service in January
> 2007 where it has performed flawlessly until the last several weeks.
>
> As an additional note, after running memtest86 for just a few minutes I
> decided to try to build a 2.6.16.60 kernel figuring if I restarted the
> "make" enough maybe it would build. This time make ran straight through
> with no errors. When the system startes having problems again I'll boot
> to this new kerenel and see what happens.
Golly, that sure makes it sound like heat related. Not ambient
temperature, which would have an effect, but component temperatures. If
you have lm-sensors installed, and an instrumented CPU (may be since
it's not that old), you can monitor CPU temperature and see if that
seems to correlate to problem occurrence. There are also some other
monitoring packages available you might try.
Most CPU literature will tell the acceptable operating ranges. If it's
an Intel CPU, they run hotter than AMDs and use more "juice".
</ end of core dump> 8-O
<snip sig stuff>
HTH
--
Bill
--
http://linuxfromscratch.org/mailman/listinfo/lfs-support
FAQ: http://www.linuxfromscratch.org/lfs/faq.html
Unsubscribe: See the above information page