Re: System based on LFS 6.2 has become unstable over time

Wit Sat, 23 Feb 2008 05:04:03 -0800

John McSwain wrote:
> Wit wrote:
>> John McSwain wrote:
>> <snip>
>>
>>> Having been trying to compile a new kernel today and I keep getting
>>> various errors no matter which kernel.  Just tried to compile the
>>> same kernel still in use with same .config and get errors.
>> Are they repeatable? That includes same "place", etc? If not, what's
>> the nature of these? Kernel panics? Failures in compile, link?
> 
> No, stopped in different place.  Some errors said it was a gcc error and 
> to report it as a bug.  Others gave a line where it stopped.  I did get 
> his from the kernel one time:


Well, it could still be a kernel problem, but very often this would tend 
to indicate an *intermittent* hardware problem. We can't rule out the 
build utilities, but that's a very low probability, IMO.

> 
> Feb 22 12:56:39 server kernel: Bad page state in process 'cc1'
> Feb 22 12:56:39 server kernel: page:c114d240 flags:0x80000000 
> mapping:00000000 mapcount:-2048 count:0
> Feb 22 12:56:39 server kernel: Trying to fix it up, but a reboot is 
> needed
> Feb 22 12:56:39 server kernel: Backtrace:
> Feb 22 12:56:39 server kernel:  [<c013f529>] bad_page+0x69/0xa0
> Feb 22 12:56:39 server kernel:  [<c013fe35>] 
> get_page_from_freelist+0x385/0x3c0
> Feb 22 12:56:39 server kernel:  [<c013feda>] __alloc_pages+0x6a/0x310
> Feb 22 12:56:39 server kernel:  [<c0148767>] 
> __handle_mm_fault+0x747/0x890
> Feb 22 12:56:39 server kernel:  [<c01775ad>] mntput_no_expire+0x2d/0xb0
> Feb 22 12:56:39 server kernel:  [<c015b45c>] __fput+0x11c/0x150
> Feb 22 12:56:39 server kernel:  [<c0112946>] do_page_fault+0x136/0x63f
> Feb 22 12:56:39 server kernel:  [<c0112810>] do_page_fault+0x0/0x63f
> Feb 22 12:56:39 server kernel:  [<c010325b>] error_code+0x4f/0x54
> Feb 22 12:56:39 server kernel: Bad page state in process 'cc1'
> Feb 22 12:56:39 server kernel: page:c114d240 flags:0x80000000 
> mapping:00000000 mapcount:-2048 count:0
> Feb 22 12:56:39 server kernel: Trying to fix it up, but a reboot is 
> needed
> Feb 22 12:56:39 server kernel: Backtrace:
> Feb 22 12:56:39 server kernel:  [<c013f529>] bad_page+0x69/0xa0
> Feb 22 12:56:39 server kernel:  [<c013fe35>] 
> get_page_from_freelist+0x385/0x3c0
> Feb 22 12:56:39 server kernel:  [<c013feda>] __alloc_pages+0x6a/0x310
> Feb 22 12:56:39 server kernel:  [<c0148767>] 
> __handle_mm_fault+0x747/0x890
> Feb 22 12:56:39 server kernel:  [<c01775ad>] mntput_no_expire+0x2d/0xb0
> Feb 22 12:56:39 server kernel:  [<c015b45c>] __fput+0x11c/0x150
> Feb 22 12:56:39 server kernel:  [<c0112946>] do_page_fault+0x136/0x63f
> Feb 22 12:56:39 server kernel:  [<c0112810>] do_page_fault+0x0/0x63f
> Feb 22 12:56:39 server kernel:  [<c010325b>] error_code+0x4f/0x54

Another strong indication of hardware related, I think. That's the 
trouble with all these new fangled gadgets (anything more recent than 
pencil and paper) - so much is hidden and interrelated that it becomes 
impossible to diagnose without lots of tools and experience. And when 
they are intermittent, it's like trying to find that occasional noise in 
the rear of the car that never happens when you are looking for it.

I believe you can rest assured that the user applications (cc et al) are 
not the cause of this.

Looks like trouble in memory management land. Could be kernel bug. Could 
be hardware MM fault - Translate Lookaside Buffer (TLB), page tables, 
swap related, on-board cache (CPU or mainboard), bad memory, power 
supply sized just a tad too small (I had a nice case of this when I once 
put a new EPOX MB into a chassis I originally built on-the-cheap for an 
Acer using a PS large enough for the Acer.  Random errors, apparently 
memory or CPU related abounded. Thanks to my background, I immediately 
whipped out the Epox user manual and saw a nice little phrase about 
minimum and recommended PS capacity - which happened to be about 50 
watts more than the Acer, *at a minimum*. A nice new 575 watt with 
really good capacities on all the rails eliminated all randomness! :-O  )

If the problem occurs again and the new kernel exhibits similar 
symptoms, you could selectively disable cache in the BIOS to try and see 
if that makes a difference. Things may run noticeably slower.

Do you have a swap space? "swapon -s" will show size and usage. Might 
want to do something like this after firing up a compile again.

     while [ 1 ] ; do
        /sbin/swapon -s
        sleep 15
     done

It can be interrupted with a <CTL-C>.

If you see swap usage grow to very near or exceed capacity, this may be 
related, although the kernel is supposed to take some preemptive action 
to prevent catastrophic results from this cause. If you have a bad spot 
in the swap space, it could also cause this. However, with a year old 
unit, we should be able to presume the HDs will properly handle 
soft-error recovery and do alternate block assignment on hard errors, 
eliminating that particular spot as a problem on the next attempt to use 
it. Often many spots will occur at the same time and it may take a lot 
of cycles for them to finally be re-mapped and cease causing problems.

In single user mode, you might want to do a dd with if=/dev/zero and 
of=<your swap partition of file> several times. If your swap is a 
partition, it should end with an error indicating it couldn't write the 
last block(s). If it's a file, BE SURE TO LIMIT THE NUMBER OF BLOCKS 
WRITTEN with an appropriate count= parameter. Also, use a block size 
that is large to speed things up a bit.

Then REMEMBER TO MKSWAP on it again.


> 
>> As Ken said in the other post, and I suggested in my original reply,
>> memtest that booger. Note if failures are repeatable and how long it
>> takes. Does it take multiple passes (indicating slow overheating)
>> before the errors appear.
> 
> After the above I did get and run memtest for a few minutes with no 
> errors.  Will find a period to run it longer next time.
> 
>>>> <snip>

>> Were all the fans running when you powered it up *before* closing it
>> back up? Did you carefully push each memory stick, PCI card, etc. into
>> the slots? <snip>

>  Fans were running and I pushed and shoved memory and cards pretty well. 
> This computer is in a fairly cold area this time of year about 55 
> degrees today.  During the summer the temperature would often be in the 
> low 80's and no problems at that time.

Hmmm ... it's amazing the way associative memory works in hum0ns. The 
conversational interactions causing recall of things long forgotten.

Speaking of "pushing", I once had an MB that either had a "cold solder 
joint" (less likely in todays environment) or had developed a crack in 
the circuit traces (it was a cheap 2 layer board). Random errors. 
Stymied. Finally noticed it seemed to be temperature related. But could 
not reliably reproduce until I though of the possibility of circuit 
weakness. I opened/fired it up and began gently pushing on memory, cards 
and the MB itself until it crashed and burned again, and again, ... 
repeatable on-demand.

Back then, it was an expensive fix as MBs were *not* cheap.

><snip>

>> have you looked at the boot logs to see if the journaling file system
>> (ext3) reports any oddities?
> 
> No oddities.
> 
>> Another possibility. Weak power. Even with surge suppression strips,
>> there is a possibility some surge weakened the power supply. Or if
>> it's old and just large enough for the load, maybe it got really warm
>> one day and gave up some capacity.
> 
> System is on an APC UPS

That should protect against surges not of the most violent kind. What 
size is your PS? Does the MB manual state a recommended size? Are the 
rails capable of enough amperage for whatever you have (we haven't asked 
if this is a Cadillac or Mini Cooper). The HDs, CDs, etc. should list 
power requirements if you suspect this might be an issue. A good PS will 
list not only max wattage, but also have good capacity on the rails (in 
todays environment, I have no idea what is "sufficient".

> 
>> How old is the unit? <snip>

> Unit built by me in August 2006 and placed in current service in January 
> 2007 where it has performed flawlessly until the last several weeks.
> 
> As an additional note, after running memtest86 for just a few minutes I 
> decided to try to build a 2.6.16.60 kernel figuring if I restarted the 
> "make" enough maybe it would build.  This time make ran straight through 
> with no errors. When the system startes having problems again I'll boot 
> to this new kerenel and see what happens.

Golly, that sure makes it sound like heat related. Not ambient 
temperature, which would have an effect, but component temperatures. If 
you have lm-sensors installed, and an instrumented CPU (may be since 
it's not that old), you can monitor CPU temperature and see if that 
seems to correlate to problem occurrence. There are also some other 
monitoring packages available you might try.

Most CPU literature will tell the acceptable operating ranges. If it's 
an Intel CPU, they run hotter than AMDs and use more "juice".

</ end of core dump>  8-O

<snip sig stuff>

HTH
-- 
Bill
-- 
http://linuxfromscratch.org/mailman/listinfo/lfs-support
FAQ: http://www.linuxfromscratch.org/lfs/faq.html
Unsubscribe: See the above information page

Re: System based on LFS 6.2 has become unstable over time

Reply via email to