Re: System based on LFS 6.2 has become unstable over time

John McSwain Fri, 22 Feb 2008 16:07:18 -0800

Wit wrote:
> John McSwain wrote:
> <snip>
>
>> Having been trying to compile a new kernel today and I keep getting
>> various errors no matter which kernel.  Just tried to compile the
>> same kernel still in use with same .config and get errors.
>
> Are they repeatable? That includes same "place", etc? If not, what's
> the nature of these? Kernel panics? Failures in compile, link?


No, stopped in different place.  Some errors said it was a gcc error and 
to report it as a bug.  Others gave a line where it stopped.  I did get 
his from the kernel one time:

Feb 22 12:56:39 server kernel: Bad page state in process 'cc1'
Feb 22 12:56:39 server kernel: page:c114d240 flags:0x80000000 
mapping:00000000 mapcount:-2048 count:0
Feb 22 12:56:39 server kernel: Trying to fix it up, but a reboot is 
needed
Feb 22 12:56:39 server kernel: Backtrace:
Feb 22 12:56:39 server kernel:  [<c013f529>] bad_page+0x69/0xa0
Feb 22 12:56:39 server kernel:  [<c013fe35>] 
get_page_from_freelist+0x385/0x3c0
Feb 22 12:56:39 server kernel:  [<c013feda>] __alloc_pages+0x6a/0x310
Feb 22 12:56:39 server kernel:  [<c0148767>] 
__handle_mm_fault+0x747/0x890
Feb 22 12:56:39 server kernel:  [<c01775ad>] mntput_no_expire+0x2d/0xb0
Feb 22 12:56:39 server kernel:  [<c015b45c>] __fput+0x11c/0x150
Feb 22 12:56:39 server kernel:  [<c0112946>] do_page_fault+0x136/0x63f
Feb 22 12:56:39 server kernel:  [<c0112810>] do_page_fault+0x0/0x63f
Feb 22 12:56:39 server kernel:  [<c010325b>] error_code+0x4f/0x54
Feb 22 12:56:39 server kernel: Bad page state in process 'cc1'
Feb 22 12:56:39 server kernel: page:c114d240 flags:0x80000000 
mapping:00000000 mapcount:-2048 count:0
Feb 22 12:56:39 server kernel: Trying to fix it up, but a reboot is 
needed
Feb 22 12:56:39 server kernel: Backtrace:
Feb 22 12:56:39 server kernel:  [<c013f529>] bad_page+0x69/0xa0
Feb 22 12:56:39 server kernel:  [<c013fe35>] 
get_page_from_freelist+0x385/0x3c0
Feb 22 12:56:39 server kernel:  [<c013feda>] __alloc_pages+0x6a/0x310
Feb 22 12:56:39 server kernel:  [<c0148767>] 
__handle_mm_fault+0x747/0x890
Feb 22 12:56:39 server kernel:  [<c01775ad>] mntput_no_expire+0x2d/0xb0
Feb 22 12:56:39 server kernel:  [<c015b45c>] __fput+0x11c/0x150
Feb 22 12:56:39 server kernel:  [<c0112946>] do_page_fault+0x136/0x63f
Feb 22 12:56:39 server kernel:  [<c0112810>] do_page_fault+0x0/0x63f
Feb 22 12:56:39 server kernel:  [<c010325b>] error_code+0x4f/0x54

>
> As Ken said in the other post, and I suggested in my original reply,
> memtest that booger. Note if failures are repeatable and how long it
> takes. Does it take multiple passes (indicating slow overheating)
> before the errors appear.

After the above I did get and run memtest for a few minutes with no 
errors.  Will find a period to run it longer next time.

>
>>
>>> I still think it's hardware *unless* there have been recent changes
>>> in software. Could even be file system corruption exposing a bug in
>>> apps/kernel by corrupting data or code.
>>
>> I'm beginning to think it is hardware also.  I opened the case this
>> morning, cleaned and checked but found nothing.
>
> Were all the fans running when you powered it up *before* closing it
> back up? Did you carefully push each memory stick, PCI card, etc. into
> the slots? Thermal expansion and contraction over extended periods can
> cause components to "walk" up out of fully seated position. Has the
> unit been recently moved? With todays cases, the can be a fair amount
> of flex that cause cards to tilt in the slot. I have one unit that
> I've learned to push the video card down after any handling. 'Course,
> I sometimes go on a tear and move a lot of things a lot. And set them
> on precarious perches, off level, other items on top,... well I *do*
> know better.
>
 Fans were running and I pushed and shoved memory and cards pretty well. 
This computer is in a fairly cold area this time of year about 55 
degrees today.  During the summer the temperature would often be in the 
low 80's and no problems at that time.

>>
>>> I wonder how long since fsck has been run. If the drives are "smart"
>>> capable, maybe any problems there can be seen.

I did a shutdown -r -F to force fsck.   No problems.

>>
>> I'm running ext3 on all drives.  The drives are smart capable and I
>> looked in proc but didn't really know what I was seeing.  Nothing
>> stood out at me.
>
> have you looked at the boot logs to see if the journaling file system
> (ext3) reports any oddities?

No oddities.

>
> Another possibility. Weak power. Even with surge suppression strips,
> there is a possibility some surge weakened the power supply. Or if
> it's old and just large enough for the load, maybe it got really warm
> one day and gave up some capacity.

System is on an APC UPS

> How old is the unit? Unbeknown to most, electronics used to "age" and
> eventually die or become extremely "flaky". I don't know if they still
> tend to do that (haven't had the desire to keep up) but on an older
> box...

Unit built by me in August 2006 and placed in current service in January 
2007 where it has performed flawlessly until the last several weeks.

As an additional note, after running memtest86 for just a few minutes I 
decided to try to build a 2.6.16.60 kernel figuring if I restarted the 
"make" enough maybe it would build.  This time make ran straight through 
with no errors. When the system startes having problems again I'll boot 
to this new kerenel and see what happens.

>
>>
>>> --
>>> Wit

-- 
John McSwain
http://www.lakemcgregor.com/ 



-- 
http://linuxfromscratch.org/mailman/listinfo/lfs-support
FAQ: http://www.linuxfromscratch.org/lfs/faq.html
Unsubscribe: See the above information page

Re: System based on LFS 6.2 has become unstable over time

Reply via email to