Re: System based on LFS 6.2 has become unstable over time

John McSwain Sat, 23 Feb 2008 06:19:03 -0800

Wit wrote:
> John McSwain wrote:
>> Wit wrote:
<snip>
> Well, it could still be a kernel problem, but very often this would
> tend to indicate an *intermittent* hardware problem. We can't rule
> out the build utilities, but that's a very low probability, IMO.


I'm moving more and more to a hardware proplem.    This machine runs 24 
hours a day and ran fine until the weather cooled off.  Makes me think a 
metal contraction is pulling something loose.

<snip>
>
> Looks like trouble in memory management land. Could be kernel bug.
> Could be hardware MM fault - Translate Lookaside Buffer (TLB), page
> tables, swap related, on-board cache (CPU or mainboard), bad memory,
> power supply sized just a tad too small (I had a nice case of this
> when I once put a new EPOX MB into a chassis I originally built
> on-the-cheap for an Acer using a PS large enough for the Acer.
> Random errors, apparently memory or CPU related abounded. Thanks to
> my background, I immediately whipped out the Epox user manual and saw
> a nice little phrase about minimum and recommended PS capacity -
> which happened to be about 50 watts more than the Acer, *at a
> minimum*. A nice new 575 watt with really good capacities on all the
> rails eliminated all randomness! :-O  )

This machine has a AMD Sempron 1.6G on a ECS755-A2 with 1024 GB RAM, an 
80 GB SATA Drive (main operating system and swap space) and a 160 GB ide 
drive (home directories and swap sapce).  Has a floopy and CD.  Uses MB 
network connection and has a Enermax EG495P-UE PS which should be 
sufficient.  It has an old simple vidoe card of some type as I rarely do 
anything with it from the monitor.  This machine is a server for my 
network that sits in the DMZ, providing mail, web server, SSH, NTP, etc.

>
> If the problem occurs again and the new kernel exhibits similar
> symptoms, you could selectively disable cache in the BIOS to try and
> see if that makes a difference. Things may run noticeably slower.
>
> Do you have a swap space? "swapon -s" will show size and usage. Might
> want to do something like this after firing up a compile again.
>
>     while [ 1 ] ; do
>        /sbin/swapon -s
>        sleep 15
>     done
>
> It can be interrupted with a <CTL-C>.
>
> If you see swap usage grow to very near or exceed capacity, this may
> be related, although the kernel is supposed to take some preemptive
> action to prevent catastrophic results from this cause. If you have a
> bad spot in the swap space, it could also cause this. However, with a
> year old unit, we should be able to presume the HDs will properly
> handle soft-error recovery and do alternate block assignment on hard
> errors, eliminating that particular spot as a problem on the next
> attempt to use it. Often many spots will occur at the same time and
> it may take a lot of cycles for them to finally be re-mapped and
> cease causing problems.
>
> In single user mode, you might want to do a dd with if=/dev/zero and
> of=<your swap partition of file> several times. If your swap is a
> partition, it should end with an error indicating it couldn't write
> the last block(s). If it's a file, BE SURE TO LIMIT THE NUMBER OF
> BLOCKS WRITTEN with an appropriate count= parameter. Also, use a
> block size that is large to speed things up a bit.
>
> Then REMEMBER TO MKSWAP on it again.

Will do, thanks for above suggestions.

<snip>

> Back then, it was an expensive fix as MBs were *not* cheap.

I'm thinking it could easily be an MB problem.  I've used AMD processors 
in the past and I had another AMD machine that I always had problem 
building LFS on because in each compile it would stop with an error but 
eventually finish it.  My own impression is that the problem with AMD is 
not the CPU but the chipsets that others build for them.  I've never had 
any problem on an intel chipset MB.

<snip>
> Golly, that sure makes it sound like heat related. Not ambient
> temperature, which would have an effect, but component temperatures.
> If you have lm-sensors installed, and an instrumented CPU (may be
> since it's not that old), you can monitor CPU temperature and see if
> that seems to correlate to problem occurrence. There are also some
> other monitoring packages available you might try.
>
> Most CPU literature will tell the acceptable operating ranges. If it's
> an Intel CPU, they run hotter than AMDs and use more "juice".

May be heat but this thing is as cool as it can be.  The case is a very 
large server case and it pretty darn empty.  The PS sits at the top 
pulling air out and I have a fan case at the bottom that pulls air in. 
Nothing inside feels hot at all.

Thanks everyone for all your help.  You have given me some areas to 
follow up on.  I'd like to think that the problem will get no worse and 
I can live with it like this for awhile.  But maybe if it gets worse I 
can find the problem.  Worse case scenario I build another unit or first 
replace the MB and processor in this unit.  Not exactly what I want to 
do though!

-- 
John McSwain
http://www.lakemcgregor.com/ 



-- 
http://linuxfromscratch.org/mailman/listinfo/lfs-support
FAQ: http://www.linuxfromscratch.org/lfs/faq.html
Unsubscribe: See the above information page

Re: System based on LFS 6.2 has become unstable over time

Reply via email to