[gentoo-doc-cvs] cvs commit: hardware-stability-p1.xml

Xavier Neys Thu, 28 Jul 2005 06:21:27 -0700

neysx       05/07/28 13:21:06

  Added:       xml/htdocs/doc/en/articles hardware-stability-p1.xml
                        hardware-stability-p2.xml
  Log:
  #100524 xmlified hardware stability articles


Revision  Changes    Path
1.1                  xml/htdocs/doc/en/articles/hardware-stability-p1.xml

file : 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
plain: 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

Index: hardware-stability-p1.xml
===================================================================
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
<!-- $Header: 
/var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p1.xml,v 1.1 
2005/07/28 13:21:06 neysx Exp $ -->

<guide link="/doc/en/articles/hardware-stability-p1.xml" lang="en">
<title>Linux hardware stability guide, Part 1</title>

<author title="Author">
  <mail link="[EMAIL PROTECTED]">Daniel Robbins</mail>
</author>
<author title="Editor">
  <mail link="[EMAIL PROTECTED]">Joshua Saddler</mail>
</author>

<abstract>
In this article, Daniel Robbins shows you how to diagnose and fix CPU
flakiness, as well as how to test your RAM for defects. By the end of this
article, you'll have the skills to ensure that your Linux system is as stable
as it possibly can be.
</abstract>

<!-- The original version of this article was first published on IBM 
developerWorks, and is property of Westtech Information Services. This 
document is an updated version of the original article, and contains
various improvements made by the Gentoo Linux Documentation team -->

<version>1.0</version>
<date>2005-07-28</date>

<chapter>
<title>CPU troubleshooting</title>
<section>
<body>

<note>
The original version of this article was first published on IBM developerWorks,
and is property of Westtech Information Services. This document is an updated
version of the original article, and contains various improvements made by the
Gentoo Linux Documentation team.
</note>

<p>
Many of us in the Linux world have been bitten by nasty hardware problems.  How
many of us have set up a Linux box, installed our favorite distribution,
compiled and installed some additional apps, and gotten everything working
perfectly only to find that our new system has an (argh!) fatal hardware bug?
Whether the symptoms are random segmentation faults, data corruption, hard
locks, or lost data is irrelevant -- the hardware glitch effectively makes our
normally reliable Linux operating system barely able to stay afloat. In this
article, we'll take an in-depth look at how to detect flaky CPUs and RAM --
allowing you to replace the defective parts before they do some serious 
damage.
</p>

<p>
If you're experiencing instability problems and suspect they are hardware
related, I encourage you to test both your CPU and memory to ensure that
they're working OK. However, even if you haven't experienced these problems,
it's still a good idea to perform these CPU and memory tests. In doing so, you
may detect a hardware problem that could have bitten you at an inopportune
time, something that could have caused data loss or hours of frustration in a
frantic search for the source of the problem. The proper, proactive application
of these techniques can help you to avoid a lot of headaches, and if your
system passes the tests, you'll have the peace of mind that your system is up
to spec.
</p>

</body>
</section>
<section>
<title>CPU issues</title>
<body>

<p>
If you have a horribly defective CPU, your machine may be unable to boot Linux
or may only run for a few minutes before locking up. CPUs in this ragged state
are easy to diagnose as defective because the symptoms are so obvious. But
there are more subtle CPU defects that aren't so easy to detect; generally, the
less obvious errors are the ones that cause machines to either lock up every
now and then for no apparent reason, or cause certain processes to die
unexpectedly. Most CPU instabilities can be triggered by "exercising" the CPU
-- giving it a bunch of work to do, causing it to heat up and possibly flake
out.  Let's look at some ways to stress-test the CPU.
</p>

<p>
You may be surprised to hear that one of the best tests of CPU stability is
built in to Linux -- the kernel compile. The gcc compiler is a great tool for
testing general CPU stability, and a kernel build uses gcc a whole lot. By
creating and running the following script from your <path>/usr/src/linux</path>
directory, you can give your machine an industrial-strength kernel compile
stress test:
</p>

<pre caption="The cpubuild script">
#!/bin/bash
make dep
while [ "foo" = "foo" ]
do
  make clean
  make -j2 bzImage
  if [ $? -ne 0 ]
  then
    echo OUCH OUCH OUCH OUCH
    exit 1
  fi
done
</pre>

<p>
You'll notice that this script <e>repeatedly</e> compiles the kernel.  The
reason for this is simple -- some CPUs have intermittent glitches, allowing
them to compile the kernel perfectly 95% of the time, but causing the kernel
compile to bomb out every now and then. Normally, this is because it may take
five or more kernel compiles before the processor heats up to the point where
it becomes unstable.
</p>

<p>
In the above script, make sure to adjust the <c>-j</c> option so that the
number following it is one greater than the number of CPUs in your system; in
other words, use "2" for uniprocessors, "3" for dual-processors, etc. The
<c>-j</c> option tells <c>make</c> to build the kernel in parallel, ensuring
that there's always at least one gcc process on deck after each source file is
compiled -- ensuring that the stress on your CPU is maximized. If your Linux
box is going to be unused for the afternoon, go ahead and run this script, and
let the machine recompile the kernel for a few hours.
</p>

</body>
</section>
<section>
<title>Possible CPU problems</title>
<body>

<p>
If the script runs perfectly for several hours, congratulations! Your CPU has
passed the first test. However, it's possible that the above script dies
unexpectedly. How do you know you're having a CPU problem as opposed to
something else? Well, if gcc spat out an error like this, then there's a very
good possibility that your CPU is defective:
</p>

<pre caption="GCC error">
gcc: Internal compiler error: program cc1 got fatal signal 11
</pre>

<p>
At this point, you have about three possibilities as to the state of your CPU:
</p>

<ul>
  <li>
    If you type <c>make bzImage</c> to resume the kernel compilation, and the
    compiler dies on the exact same file, keep typing <c>make bzImage</c> over
    and over again. If after about ten tries the build process continues to die
    on this particular file, then the problem is most likely caused by a (rare)
    gcc compiler bug that's being triggered by this particular source file,
    rather than a flaky CPU. However, these days, gcc is quite stable, so this
    isn't likely to happen.
  </li>
  <li>
    If you type <c>make bzImage</c> to resume kernel compilation, and you get
    another signal 11 a little bit later, then your CPU is most likely on its
    last legs.
  </li>
  <li>
    If you type <c>make bzImage</c> to resume kernel compilation and the kernel
    compiles successfully, this doesn't mean that your CPU is OK.  Normally,
    this means that your CPU glitch only shows up every now and then, normally
    only when the CPU rises above a certain temperature (a CPU will get hotter
    when it is being used for an extended period of time, and may take several
    kernel compiles to get to that critical point).
  </li>
</ul>

</body>
</section>
<section>
<title>Rescuing your CPU</title>
<body>

<p>
If your CPU is experiencing random intermittent errors when placed under heavy
load, it's possible that your CPU isn't defective at all -- maybe it simply
isn't being cooled properly. Here are some things that you can check:
</p>

<ul>
  <li>Is your CPU fan plugged in?</li>
  <li>Is it relatively dust-free?</li>
  <li>
    Does the fan actually spin (and spin at the proper speed) when the power is
    on?
  </li>
  <li>Is the heat sink seated properly on the CPU?</li>
  <li>Is there thermal grease between the CPU and the heat sink?</li>
  <li>Does your case have adequate ventilation?</li>
</ul>



1.1                  xml/htdocs/doc/en/articles/hardware-stability-p2.xml

file : 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
plain: 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

Index: hardware-stability-p2.xml
===================================================================
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
<!-- $Header: 
/var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p2.xml,v 1.1 
2005/07/28 13:21:06 neysx Exp $ -->

<guide link="/doc/en/articles/hardware-stability-p2.xml" lang="en">
<title>Linux hardware stability guide, Part 2</title>

<author title="Author">
  <mail link="[EMAIL PROTECTED]">Daniel Robbins</mail>
</author>
<author title="Editor">
  <mail link="[EMAIL PROTECTED]">Joshua Saddler</mail>
</author>

<abstract>
In this article, Daniel Robbins shares his experiences in getting his NVIDIA
TNT graphics card working under Linux using NVIDIA's accelerated drivers. As he
does, he'll show you how to diagnose and fix IRQ and PCI latency timer issues
-- techniques you can use to ensure that your systems don't experience
lock-ups, inconsistent behavior, or data loss.
</abstract>

<!-- The original version of this article was first published on IBM 
developerWorks, and is property of Westtech Information Services. This 
document is an updated version of the original article, and contains
various improvements made by the Gentoo Linux Documentation team -->

<version>1.0</version>
<date>2005-07-28</date>

<chapter>
<title>Drivers</title>
<section>
<title>The many causes of instability</title>
<body>

<note>
The original version of this article was first published on IBM developerWorks,
and is property of Westtech Information Services. This document is an updated
version of the original article, and contains various improvements made by the
Gentoo Linux Documentation team.
</note>

<p>
A stability problem is often not caused by defective hardware, but by improper
hardware configuration or flaky drivers. My experience in this area began when
I tried to get Linux working on my Diamond Viper V550, an NVIDIA TNT-based AGP
card, using NVIDIA's own accelerated drivers.
</p>

<p>
NVIDIA has their own display drivers for Linux, a collaboration between NVIDIA,
SGI, and VA Linux. These drivers have a lot of advantages over the standard
2d-only NVIDIA drivers included with Xfree86 4.0. For one, they have full
accelerated 3D support. Even better, they feature an official OpenGL 1.2
implementation, rather than just an enhanced version of Mesa. So, all in all,
these accelerated drivers are the ones you want to be using if you own an
NVIDIA-based graphics card, at least in theory.  My attempt to get them working
properly turned out to be an excellent learning experience, to say the least.
</p>

<p>
After I installed the accelerated Linux NVIDIA drivers (see <uri
link="#resources">Resources</uri> later in this article), I started up Xfree86
and began playing around with all my 3D applications, now wonderfully
accelerated as they should be. Up until that point, I had had to reboot into
Windows NT in order to take advantage of 3D acceleration. Now, while I don't
mind NT, having to reboot to use 3D apps was somewhat annoying, and I was glad
to have one less reason to leave Linux and reboot my machine. However, after
playing around for an hour or so, I experienced a fatal setback to my Linux 3D
aspirations -- my machine locked up. My mouse simply stopped moving and the
screen froze, and I had to reboot my system.
</p>

<p>
Yes, I was having some kind of stability problem. But I didn't know exactly
what was causing the problem. Did I have flaky hardware, or was the card
misconfigured? Or maybe it was a problem with the driver -- did it not like my
VIA KT133-based Athlon motherboard? Whatever the problem, I wanted to resolve
it quickly. In this article, I'm going to share with you the procedure that I
went through to fix my hardware stability problem. Although you may not be
struggling with exactly the same issue, the steps that I used to diagnose and
(mostly) fix the problem are general in nature and applicable to many different
types of Linux hardware problems.
</p>

</body>
</section>
<section>
<title>First, the hardware</title>
<body>

<p>
The first thing that crossed my mind was that I might have flaky or
under-cooled hardware. On the one hand, my Diamond Viper V550 seemed to have no
problems under Windows NT. On the other hand, maybe Linux was somehow pushing
the chip harder and triggering heat-related lock-ups. My V550 did get
<e>extremely</e> hot, and its OEM heatsink seemed at best barely adequate. The
combination of the lock-ups and the fact that this card was being marginally
cooled convinced me to head over to PC Power and Cooling (see <uri
link="#resources">Resources</uri>) to purchase a mini integrated heatsink/fan
for my V550.
</p>

<p>
So, I received my Video Cool, popped off the OEM heatsink on the video card
(voiding the warranty), cleaned off the TNT chip and affixed the Video Cool to
the top of the chip. Verdict? My video card didn't get extremely hot anymore,
but the lockups continued. The lesson I learned from this particular experience
is this -- if you ensure that your system is adequately cooled to begin with,
you'll never need to worry about components malfunctioning due to inadequate
cooling. This in itself is a good reason to invest some time and effort in
making sure that your workstations and servers run coolly. Now that I had taken
care of the heat issue, I knew that the lock-ups were most likely not due to
flaky hardware, and I began to look elsewhere.
</p>

</body>
</section>
<section>
<title>New drivers -- and a possible solution?</title>
<body>

<p>
I partly suspected that NVIDIA's drivers were themselves the cause of the
problem. Fortunately , a new version of the drivers had just been released, so
I immediately upgraded in the hope that this would solve my stability problem.
Unfortunately, it didn't, and after checking with others on the #nvidia channel
on openprojects.net, I found out that while not everyone was able to get the
driver to operate stably, it was working for a lot of people.
</p>

<p>
On #nvidia, someone suggested that I make sure that the V550 wasn't sharing an
IRQ with another card. Unlike the standard XFree86 driver, the accelerated
NVIDIA driver requires an IRQ for proper operation. To see if it had its own
dedicated IRQ, I typed <c>cat /proc/interrupts</c>, and lo and behold, my V550
was sharing an interrupt with my IDE controller. Before I explain how I solved
this particular problem, I'd like to give you a brief background on IRQs.
</p>

<p>
PCs use IRQs, and hardware interrupts in general, to allow peripheral devices,
such as the video card and the disk controllers, to signal the CPU that they
have data that's ready to be processed. In the old days before the PCI bus
existed, it was critical that each device in the machine had its own, dedicated
IRQ. In case you are still using ISA peripherals in your machine, this is still
true -- all non-PCI devices should have their own dedicated IRQ.
</p>

</body>
</section>
</chapter>

<chapter>
<title>IRQs</title>
<section>
<title>IRQs and PCI</title>
<body>

<p>
However, things are a little different with the PCI bus. PCI allocates four
IRQs that can be used by the PCI/AGP cards in your system. In general, these
IRQs <e>can</e> be shared among multiple devices. (If you do this, make sure
that all the devices doing the sharing are PCI and AGP devices.) IRQ sharing is
important, especially for modern machines that may have five PCIs and one AGP
slot. Without IRQ sharing, you would be unable to have more than four IRQ-using
cards in your system.
</p>

<p>
There are, however, some limitations to PCI IRQ sharing. While modern
motherboards' BIOS and Linux kernels generally support PCI IRQ sharing, certain
PCI cards may simply refuse to work properly when sharing an IRQ with another
device. If you're experiencing random system lockups, especially lockups that
appear to be correlated with the use of a specific hardware device, you may
want to try and get all your PCI devices to use their own IRQs, just to be on
the safe side. The first step is to see if any devices in your system are
sharing IRQs to begin with. To do this, follow these steps:
</p>

<ol>
  <li>
    Use the various hardware devices in your system, such as disk, sound,
    video, SCSI, etc. This ensures that Linux will handle interrupts for these
    various devices.
  </li>
  <li>
    <c>cat /proc/interrupts</c>, which will display a list and count of all
    interrupts which the Linux kernel has handled so far. Look in the far right
    column in this list. If two or more devices are listed in a single row,
    then they're sharing that particular IRQ.
  </li>
</ol>

<p>
If one of the devices in question is a non-PCI device (ISA or other legacy
cards) then you've found yourself an IRQ conflict, which you can attempt to fix
with your BIOS, the isapnptools package, or the physical jumpers on your



-- 
[email protected] mailing list

[gentoo-doc-cvs] cvs commit: hardware-stability-p1.xml

Reply via email to