Re: [CentOS] Server hangs on CentOS 5.5

2011-03-11 Thread Alexander Arlt
Am 03/11/2011 03:03 AM, schrieb Nico Kadel-Garcia:
 On Thu, Mar 10, 2011 at 6:49 PM, B.J.
 McClurekeepert...@bellsouth.net  wrote:

 B.J. McClure keepert...@bellsouth.net

 Sent from MacBook-Air


 On Mar 10, 2011, at 5:28 PM, Nico Kadel-Garcia wrote:

 On Thu, Mar 10, 2011 at 11:13 AM, Michael Eagerea...@eagerm.com
 wrote:

 Previous cleaning have been with canned compressed air. Thanks
 for the caution about vacuums and static.  I may use the vacuum
 on the case fans from the outside.  The case should provide an
 adequate static shield.

 I've had good results with a damp, soft cloth or Q-tip with
 distilled water for awkward bits. and filters, and that cloth for
 the case itself. It also looks noticeably newer, which helps with
 walking investors through a small machine room.

 I must respectfully disagree with any application of water,
 distilled or otherwise to things electronic.  I was taught in the
 Navy, and my engineering career has confirmed, that cleaning of
 electronic components should be done with low pressure, dried,
 compressed air.  50 psi max.  If some solvent must be used, try
 alcohol.  Evaporates quickly, leaves no residue and has an affinity
 for water.

 Typical drug-store alcohol is rubbing alcohol, and is 30% water.

 I designed medical electronics for a dozen years. Acohol has its
 uses, but water is much cheaper, safer, and you don't have fumes to
 deal with. Shall we discuss the effectives of surface etch resist
 and cladding in protecting circuit boards from damage, and the
 effects of alcohol on low cost electronic sockets?

I agree with Nico, I have been working for a large PC-Manufacturer in 
Europe for many years and alcohol was never a good idea for cleaning 
pcbs, not in production nor in the field.

Either we used trichloroethane or trichlorotrifluoroethane for washing 
and cleaning of mainboards (which became a bit unpopular due to its 
effects on the ozone layer...) or we used water-based cleaning fluids 
(aka 'water'). But that was only in the production process of the pcbs. 
Almost never in the field, except when real repairs on the mainboard had 
to be done on site (soldering).

Yes, it can be true with 'navy-strength' electronics that you actually 
can use alcohol for the purpose of cleaning electronic boards, but in 
low-cost electronics, it's a total no-go, because it disolves the 
coating of the pcbs and most often harms - as Nico wrote - the sockets 
and chip packages. We're talking about low-cost electronics here...

Though, when cleaning machines in the field, I very rarely ever used 
something else then compressed air. Actually, I would suggest to 
everyone not to clean the inside of a box with any kind of fluid, since 
it actually won't do anything positive besides changing the looks.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-11 Thread Simon Matter
 Am 03/11/2011 03:03 AM, schrieb Nico Kadel-Garcia:
 On Thu, Mar 10, 2011 at 6:49 PM, B.J.
 McClurekeepert...@bellsouth.net  wrote:

 B.J. McClure keepert...@bellsouth.net

 Sent from MacBook-Air


 On Mar 10, 2011, at 5:28 PM, Nico Kadel-Garcia wrote:

 On Thu, Mar 10, 2011 at 11:13 AM, Michael Eagerea...@eagerm.com
 wrote:

 Previous cleaning have been with canned compressed air. Thanks
 for the caution about vacuums and static.  I may use the vacuum
 on the case fans from the outside.  The case should provide an
 adequate static shield.

 I've had good results with a damp, soft cloth or Q-tip with
 distilled water for awkward bits. and filters, and that cloth for
 the case itself. It also looks noticeably newer, which helps with
 walking investors through a small machine room.

 I must respectfully disagree with any application of water,
 distilled or otherwise to things electronic.  I was taught in the
 Navy, and my engineering career has confirmed, that cleaning of
 electronic components should be done with low pressure, dried,
 compressed air.  50 psi max.  If some solvent must be used, try
 alcohol.  Evaporates quickly, leaves no residue and has an affinity
 for water.

 Typical drug-store alcohol is rubbing alcohol, and is 30% water.

 I designed medical electronics for a dozen years. Acohol has its
 uses, but water is much cheaper, safer, and you don't have fumes to
 deal with. Shall we discuss the effectives of surface etch resist
 and cladding in protecting circuit boards from damage, and the
 effects of alcohol on low cost electronic sockets?

 I agree with Nico, I have been working for a large PC-Manufacturer in
 Europe for many years and alcohol was never a good idea for cleaning
 pcbs, not in production nor in the field.

 Either we used trichloroethane or trichlorotrifluoroethane for washing
 and cleaning of mainboards (which became a bit unpopular due to its
 effects on the ozone layer...) or we used water-based cleaning fluids
 (aka 'water'). But that was only in the production process of the pcbs.
 Almost never in the field, except when real repairs on the mainboard had
 to be done on site (soldering).

 Yes, it can be true with 'navy-strength' electronics that you actually
 can use alcohol for the purpose of cleaning electronic boards, but in
 low-cost electronics, it's a total no-go, because it disolves the
 coating of the pcbs and most often harms - as Nico wrote - the sockets
 and chip packages. We're talking about low-cost electronics here...

 Though, when cleaning machines in the field, I very rarely ever used
 something else then compressed air. Actually, I would suggest to
 everyone not to clean the inside of a box with any kind of fluid, since
 it actually won't do anything positive besides changing the looks.

After decades in the high precision and electronics industry, I can tell
you for sure that compressed air is not seen as a good choice. It blows
the dust where it doesn't belong. That may not be a big problem with a
cheap PC, but it's not professional at all.

If you want to do it the professional way, go to an ESD protected room,
take an ESD vac and an ESD brush, wear your ESD shoes and wrist strap, and
clean *carefully*. Compressed air may additionally be used in certain
places, but not more.

Simon

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-11 Thread Nico Kadel-Garcia
On Fri, Mar 11, 2011 at 8:51 AM, Simon Matter simon.mat...@invoca.ch wrote:

 After decades in the high precision and electronics industry, I can tell
 you for sure that compressed air is not seen as a good choice. It blows
 the dust where it doesn't belong. That may not be a big problem with a
 cheap PC, but it's not professional at all.

 If you want to do it the professional way, go to an ESD protected room,
 take an ESD vac and an ESD brush, wear your ESD shoes and wrist strap, and
 clean *carefully*. Compressed air may additionally be used in certain
 places, but not more.

 Simon

Is it worth a discussion here of overall PC manufacturing safety tips?
It's not really CentOS specific, but it is interesting.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread John Hodrien
On Wed, 9 Mar 2011, compdoc wrote:

 +36C and +39C are likely your cpu and motherboard temps. You have to look at
 the temps in the cmos and match them.

 The +87C is likely just a miss-reading by lm_sensors. Anything running that
 hot won't be stable.

In testing nVidia graphics cards to destruction (not entirely deliberately) we
found that anything up to about 110C was likely to work fine, anything past
that was likely to cause visual corruption.  Anything past 125C was pretty
much guaranteed to cause permanent damage.

But you're right, I doubt that's correct, and lm_sensors is prone to reporting
duff information.  AMD list 70C as the max recommended for that chip.  In the
past it'd also depend a lot on where the temperature probe was (so varied a
lot motherboard by motherboard), but they're on package now aren't they?

jh
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Simon Matter
 compdoc wrote:
 According to the man page, it apparently needs a kernel driver
 named OpenIMPI, which it claims is installed in standard
 distributions.  I don't find it on my system.


 lm_sensors is another, and I think installs ready to use from the repos.

 sensors says that the three temp sensors read +36C, +39C, and +87C.
 These appear to be AMD K10 temp sensors, although I might be
 misreading sensors-detect.  Low/highs are (+127/+127, +127/+90,
 +127/+127) respectively.  (I'm not sure if these are alarm set
 points or something else.)

 One fan is listed as 0 rpm.   Something to look into.

Hmm, much has been said now in this thread and I know how difficult it can
be to find such an issue. However, I suggest not to throw in too many new
tools in parallel. And, be careful of how to interpret any information
gathered by tools like lm_sensors. They can only report as good as the
mainboard and it's sensors were designed and built, both can be
suboptimal. I've seen all kind of things like temp sensors not mounted
where they should. Of course, builtin sensors like thiose of a CPU should
be taken very serious.

So, may I give some more tips how I'd try to find what is wrong:
- Take a vacuum cleaner and *carefully* clean the whole box. Dust can
really do bad things because it is not a perfect insulator.
- If you feel you have to remove any device like CPU, make sure you up
everything, have a good quality heat sink paste at hand and make sure
everything is seated well after mounting it again.
- For the memory part, do you have ECC? If not, then it is really a
problem and if the box is used as a server, ECC is a must, if yes, then
most errors will be corrected by ECC but what is more important, memory
errors are usually logged. You should be able to find a list of those
errors in the BIOS, you may see how many times errors occur and where,
does something like that exist?
- For the temparatures, 87C is not so uncommon, but yes, it looks a little
bit high. Someone else posted 80C to be the max for your CPU, that seems
correct, at least our 12core Opterons have Caution: 75C; Critical: 80C
but they usually run at 45C-55C under normal load. So if 87C is really
correct, under normal load, that may be already too much, and then
consider what happens at peak times?
- When you look at the lm_sensors values, do they correspund with what is
shown in the BIOS (if is has this kind of diagnostics)?

Simon

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread John Hodrien
On Thu, 10 Mar 2011, Simon Matter wrote:

 - Take a vacuum cleaner and *carefully* clean the whole box. Dust can
 really do bad things because it is not a perfect insulator.

Take the wrong vacuum cleaner and static your machine to death.

jh
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Rudi Ahlers
On Thu, Mar 10, 2011 at 12:10 PM, John Hodrien j.h.hodr...@leeds.ac.uk wrote:
 On Thu, 10 Mar 2011, Simon Matter wrote:

 - Take a vacuum cleaner and *carefully* clean the whole box. Dust can
 really do bad things because it is not a perfect insulator.

 Take the wrong vacuum cleaner and static your machine to death.

 jh
 ___



I prefer to use a dust blower instead. It doesn't risk pulling loose
components with dry or loose soldering


-- 
Kind Regards
Rudi Ahlers
SoftDux

Website: http://www.SoftDux.com
Technical Blog: http://Blog.SoftDux.com
Office: 087 805 9573
Cell: 082 554 7532
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Alexander Arlt
Am 03/10/2011 11:04 AM, schrieb Simon Matter:
 - Take a vacuum cleaner and *carefully* clean the whole box. Dust can
 really do bad things because it is not a perfect insulator.

Never ever do that. Especially not inside the machine. There is a real 
risk of simply vacuuming smaller components like smd-resistors of the 
board. And, as already mentioned, you also have the chance of killing 
components by electrostatic discharge. Always use compressed air, even 
if just using canned one. Vacuuming is a pretty bad advice.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Lamar Owen
On Thursday, March 10, 2011 05:35:29 am Rudi Ahlers wrote:
 I prefer to use a dust blower instead. It doesn't risk pulling loose
 components with dry or loose soldering

I use both: antistatic canned air to blow the dust and a metal-tubed vacuum 
rested on a part of the case away from any boards to grab the dust that's being 
blown.  Works great, and you don't 'recycle' the dust.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Michael Eager
Simon Matter wrote:

 One fan is listed as 0 rpm.   Something to look into.
 
 Hmm, much has been said now in this thread and I know how difficult it can
 be to find such an issue. However, I suggest not to throw in too many new
 tools in parallel. And, be careful of how to interpret any information
 gathered by tools like lm_sensors. They can only report as good as the
 mainboard and it's sensors were designed and built, both can be
 suboptimal. I've seen all kind of things like temp sensors not mounted
 where they should. Of course, builtin sensors like thiose of a CPU should
 be taken very serious.

Thanks for the suggestions.

 So, may I give some more tips how I'd try to find what is wrong:
 - Take a vacuum cleaner and *carefully* clean the whole box. Dust can
 really do bad things because it is not a perfect insulator.
 - If you feel you have to remove any device like CPU, make sure you up
 everything, have a good quality heat sink paste at hand and make sure
 everything is seated well after mounting it again.
 - For the memory part, do you have ECC? If not, then it is really a
 problem and if the box is used as a server, ECC is a must, if yes, then
 most errors will be corrected by ECC but what is more important, memory
 errors are usually logged. You should be able to find a list of those
 errors in the BIOS, you may see how many times errors occur and where,
 does something like that exist?

The MB docs/website don't mention ECC support, but I presume it is as part
of the DDR2 spec.  I'll check whether the memory has ECC.  If not, this is
a reasonable upgrade.

 - For the temparatures, 87C is not so uncommon, but yes, it looks a little
 bit high. Someone else posted 80C to be the max for your CPU, that seems
 correct, at least our 12core Opterons have Caution: 75C; Critical: 80C
 but they usually run at 45C-55C under normal load. So if 87C is really
 correct, under normal load, that may be already too much, and then
 consider what happens at peak times?

The most recent crash was overnight and not discovered until morning.
Probably not related to load.  But if it really is running over temp,
then almost anything can happen.

 - When you look at the lm_sensors values, do they correspund with what is
 shown in the BIOS (if is has this kind of diagnostics)?

Something I'll check when the system is taken down.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Michael Eager
Alexander Arlt wrote:
 Am 03/10/2011 11:04 AM, schrieb Simon Matter:
 - Take a vacuum cleaner and *carefully* clean the whole box. Dust can
 really do bad things because it is not a perfect insulator.
 
 Never ever do that. Especially not inside the machine. There is a real 
 risk of simply vacuuming smaller components like smd-resistors of the 
 board. And, as already mentioned, you also have the chance of killing 
 components by electrostatic discharge. Always use compressed air, even 
 if just using canned one. Vacuuming is a pretty bad advice.

Previous cleaning have been with canned compressed air.
Thanks for the caution about vacuums and static.  I may
use the vacuum on the case fans from the outside.  The
case should provide an adequate static shield.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Brunner, Brian T.
centos-boun...@centos.org wrote:
 Simon Matter wrote:
 
 The MB docs/website don't mention ECC support, but I presume
 it is as part of the DDR2 spec.  
 I'll check whether the memory has ECC.  If not, this is a reasonable
upgrade.

Your board does not support DDR2.
See
http://service.msicomputer.com/index.php?func=proddescmaincat_no=1cat2
_no=cat3_no=prod_no=273
Support 2.5v DDR200/266/333 DDR SDRAM DIMM 

That's straight old DDR.  3 slots of up to 3GB.  No ECC.

BIOS listed is A6380VMS.570

So many instrumentation suggestions have been made, that I think to
note:  The CPU bandwidth is rather modest, and might not support all
that instrumentation *and* its previous job load.  Also, some
instrumentation packages suggested might not support socket A
(pre-Barton) motherboards, verify
VIA(r) KT333 (552 BGA) Chipset
and
VIA(r) VT8233A (376 BGA) Chipset
are comprehended


Insert spiffy .sig here:
Life is complex: it has both real and imaginary parts.

//me
***
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the system manager. This footnote also confirms that this
email message has been swept for the presence of computer viruses.
www.Hubbell.com - Hubbell Incorporated**

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Brunner, Brian T.
centos-boun...@centos.org wrote:
 On Thu, Mar 10, 2011 at 12:31 AM, Michael Eager
 ea...@eagerm.com wrote:
 Dr. Ed Morbius wrote:
 
 If the issue is repeated but rare system failures on one of a set of
 similarly configured hosts, I'd RMA the box and get a replacement.
 End of story.
 
 I'll repeat:  this is a house-made system.  There's no vendor to RMA
 to. 
 
 I don't know where you are, 

His signature list CA/USA.

 but in our country we can RMA anything and
 everything. Apart from CPU's. So, even a cheap desktop mobo could be
 RMA'd, as long as I can prove to the suppliers it's faulty, and it's
 within the warrenty period

Here in the USA we can RMA stuff if we can show it is dysfunctional.
Michael's position is that he has no evidence of a dysfunctional part, which 
could be RMA'd.
He has evidence of a dysfunctional gestalt, comprising hardware, software, 
environment, and data stream.


Insert spiffy .sig here:
Life is complex: it has both real and imaginary parts.

//me
***
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the system manager. This footnote also confirms that this
email message has been swept for the presence of computer viruses.
www.Hubbell.com - Hubbell Incorporated**

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread compdoc
Your board does not support DDR2. (url for MSI KT3 Ultra)
Support 2.5v DDR200/266/333 DDR SDRAM DIMM

The OP says this:

House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

Somehow, info has gotten crossed...


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Brunner, Brian T.
centos-boun...@centos.org wrote:
 Your board does not support DDR2. (url for MSI KT3 Ultra)
 Support 2.5v DDR200/266/333 DDR SDRAM DIMM
 
 The OP says this:
 
 House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.
 
 Somehow, info has gotten crossed...

Possibility...  Please excuse...

Insert spiffy .sig here:
Life is complex: it has both real and imaginary parts.

//me
***
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the system manager. This footnote also confirms that this
email message has been swept for the presence of computer viruses.
www.Hubbell.com - Hubbell Incorporated**

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Nico Kadel-Garcia
On Thu, Mar 10, 2011 at 11:13 AM, Michael Eager ea...@eagerm.com wrote:

 Previous cleaning have been with canned compressed air.
 Thanks for the caution about vacuums and static.  I may
 use the vacuum on the case fans from the outside.  The
 case should provide an adequate static shield.

I've had good results with a damp, soft cloth or Q-tip with distilled
water for awkward bits. and filters, and that cloth for the case
itself. It also looks noticeably newer, which helps with walking
investors through a small machine room.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread B.J. McClure

B.J. McClure
keepert...@bellsouth.net

Sent from MacBook-Air


On Mar 10, 2011, at 5:28 PM, Nico Kadel-Garcia wrote:

 On Thu, Mar 10, 2011 at 11:13 AM, Michael Eager ea...@eagerm.com wrote:
 
 Previous cleaning have been with canned compressed air.
 Thanks for the caution about vacuums and static.  I may
 use the vacuum on the case fans from the outside.  The
 case should provide an adequate static shield.
 
 I've had good results with a damp, soft cloth or Q-tip with distilled
 water for awkward bits. and filters, and that cloth for the case
 itself. It also looks noticeably newer, which helps with walking
 investors through a small machine room.

I must respectfully disagree with any application of water, distilled or 
otherwise to things electronic.  I was taught in the Navy, and my engineering 
career has confirmed, that cleaning of electronic components should be done 
with low pressure, dried, compressed air.  50 psi max.  If some solvent must be 
used, try alcohol.  Evaporates quickly, leaves no residue and has an affinity 
for water.

Just my $0.02.

Cheers,
B.J.

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-10 Thread Nico Kadel-Garcia
On Thu, Mar 10, 2011 at 6:49 PM, B.J. McClure keepert...@bellsouth.net wrote:

 B.J. McClure
 keepert...@bellsouth.net

 Sent from MacBook-Air


 On Mar 10, 2011, at 5:28 PM, Nico Kadel-Garcia wrote:

 On Thu, Mar 10, 2011 at 11:13 AM, Michael Eager ea...@eagerm.com wrote:

 Previous cleaning have been with canned compressed air.
 Thanks for the caution about vacuums and static.  I may
 use the vacuum on the case fans from the outside.  The
 case should provide an adequate static shield.

 I've had good results with a damp, soft cloth or Q-tip with distilled
 water for awkward bits. and filters, and that cloth for the case
 itself. It also looks noticeably newer, which helps with walking
 investors through a small machine room.

 I must respectfully disagree with any application of water, distilled or 
 otherwise to things electronic.  I was taught in the Navy, and my engineering 
 career has confirmed, that cleaning of electronic components should be done 
 with low pressure, dried, compressed air.  50 psi max.  If some solvent must 
 be used, try alcohol.  Evaporates quickly, leaves no residue and has an 
 affinity for water.

Typical drug-store alcohol is rubbing alcohol, and is 30% water.

I designed medical electronics for a dozen years. Acohol has its uses,
but water is much cheaper, safer, and you don't have fumes to deal
with. Shall we discuss the effectives of surface etch resist and
cladding in protecting circuit boards from damage, and the effects of
alcohol on low cost electronic sockets?
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Leen de Braal
 m.r...@5-cent.us wrote:
 Michael Eager wrote:

 House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

 Any chance the problem's with the video card?

 Video is on the MB.  It doesn't seem likely that it's
 the video, since the system doesn't respond to network
 when it crashes.

 It could be anything.  That's why I'm looking for
 something that would give me a bit of a hint what
 to look at.  With an infrequent failure, it's not
 practical to replace components piecemeal.

While you open the case, check for the bulging capacitor problem.
Will have the effect you describe, freezing up the system so that even
bios routines don't work (your fans).
If that's the case, replace mainboard.


 --
 Michael Eager  ea...@eagercon.com
 1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos



-- 
L. de Braal
BraHa Systems
NL - Terneuzen
T +31 115 649333
F +31 115 649444

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Rudi Ahlers
On Wed, Mar 9, 2011 at 10:24 AM, Leen de Braal l...@braha.nl wrote:
 m.r...@5-cent.us wrote:
 Michael Eager wrote:

 House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

 Any chance the problem's with the video card?

 Video is on the MB.  It doesn't seem likely that it's
 the video, since the system doesn't respond to network
 when it crashes.

 It could be anything.  That's why I'm looking for
 something that would give me a bit of a hint what
 to look at.  With an infrequent failure, it's not
 practical to replace components piecemeal.

 While you open the case, check for the bulging capacitor problem.
 Will have the effect you describe, freezing up the system so that even
 bios routines don't work (your fans).
 If that's the case, replace mainboard.



Or replace the CAPS if you're not afraid of a soldering iron :)



-- 
Kind Regards
Rudi Ahlers
SoftDux

Website: http://www.SoftDux.com
Technical Blog: http://Blog.SoftDux.com
Office: 087 805 9573
Cell: 082 554 7532
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Leen de Braal
 On Wed, Mar 9, 2011 at 10:24 AM, Leen de Braal l...@braha.nl wrote:
 m.r...@5-cent.us wrote:
 Michael Eager wrote:

 House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

 Any chance the problem's with the video card?

 Video is on the MB.  It doesn't seem likely that it's
 the video, since the system doesn't respond to network
 when it crashes.

 It could be anything.  That's why I'm looking for
 something that would give me a bit of a hint what
 to look at.  With an infrequent failure, it's not
 practical to replace components piecemeal.

 While you open the case, check for the bulging capacitor problem.
 Will have the effect you describe, freezing up the system so that even
 bios routines don't work (your fans).
 If that's the case, replace mainboard.



 Or replace the CAPS if you're not afraid of a soldering iron :)

Very often resulting in a damaged board, because you damage the via's when
pulling the caps. But it is worth a try.




 --
 Kind Regards
 Rudi Ahlers
 SoftDux

 Website: http://www.SoftDux.com
 Technical Blog: http://Blog.SoftDux.com
 Office: 087 805 9573
 Cell: 082 554 7532



-- 
L. de Braal
BraHa Systems
NL - Terneuzen
T +31 115 649333
F +31 115 649444

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread John R Pierce
On 03/09/11 1:34 AM, Leen de Braal wrote:
 Very often resulting in a damaged board, because you damage the via's when
 pulling the caps. But it is worth a try.


sure, if your time is worthless.  you can easily burn a couple hours 
recapping a motherboard, which typically exceeds the boards worth.


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Lamar Owen
On Tuesday, March 08, 2011 04:44:54 pm Dr. Ed Morbius wrote:
 I'd very strongly recommend you configure netconsole. 

Ok, now this is useful indeed.  Thanks for the information, even though I'm not 
the OP  While I suspected the facility might be there, I hadn't really dug 
for it, but if this will catch things after filesystems go r/o (ext3 journal 
things, ya know) it could be worth its weight in gold for catching kernel 
errors from VMware guests (serial console not really an option with the hosts I 
have, although I'm sure some enterprising soul has figured out how to redirect 
the VM guest serial port to something else). 
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
Dr. Ed Morbius wrote:
 on 09:24 Tue 08 Mar, Michael Eager (ea...@eagerm.com) wrote:
 Hi --

 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?
 
 I'd very strongly recommend you configure netconsole.  Though not entire
 clear from the name, it's actually an in-kernel network logging module,
 which is very useful for kicking out kernel panics which otherwise
 aren't logged to disk and can't be seen on a (nonresponsive) monitor.

I'll take a look at netconsole.

 Alternately, a serial console which actually retains all output sent to
 it (some remote access systems support this, some don't) may help.
 
 Barring that, I'd start looking at individual HW components, starting
 with RAM.

The problem with randomly replacing various components, other than
the downtime and nuisance, is that there's no way to know that the
change actually fixed any problem.  When the base rate is one
unknown system hang every few weeks, how many wees should I wait
without a failure to conclude that the replaced component was the
cause?  A failure which happens infrequently isn't really amenable
to a random diagnostic approach.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread John Hodrien
On Wed, 9 Mar 2011, Michael Eager wrote:

 The problem with randomly replacing various components, other than
 the downtime and nuisance, is that there's no way to know that the
 change actually fixed any problem.  When the base rate is one
 unknown system hang every few weeks, how many wees should I wait
 without a failure to conclude that the replaced component was the
 cause?  A failure which happens infrequently isn't really amenable
 to a random diagnostic approach.

So you pitch the whole thing over to being a test rig, and buy all new
hardware?

jh
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Brunner, Brian T.
centos-boun...@centos.org wrote:
 On Wed, 9 Mar 2011, Michael Eager wrote:
 
 The problem with randomly replacing various components, other than
 the downtime and nuisance, is that there's no way to know that the
 change actually fixed any problem.  When the base rate is one
 unknown system hang every few weeks, how many weeks should I wait
 without a failure to conclude that the replaced component was the
 cause?  A failure which happens infrequently isn't really amenable
 to a random diagnostic approach.
 
 So you pitch the whole thing over to being a test rig, and buy all
 new hardware? 

This would be far cheaper than the time spent troubleshooting the
running (sometimes hanging) system.
Starting with RAM and Power Supply is not random ... They're The Usual
Suspects.


Insert spiffy .sig here:
Life is complex: it has both real and imaginary parts.

//me
***
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the system manager. This footnote also confirms that this
email message has been swept for the presence of computer viruses.
www.Hubbell.com - Hubbell Incorporated**

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Lamar Owen
On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T. wrote:
 This would be far cheaper than the time spent troubleshooting the
 running (sometimes hanging) system.

Let me interject here, that from a budgeting standpoint 'cheaper' has to be 
interpreted in the context of which budget the costs are coming out of.  New 
hardware is capex, and thus would come out of the capital budget, and admin 
time is opex, and thus would come out of the operating budget.  There may be 
sufficient funds in the operating budget to pay an admin $x,000 but the funds 
in the capital budget may be insufficient to buy a server costing $y,000, where 
y=x.  And if this is an educational institution, and there are grants involved, 
it may be the reverse situation.  So 'cheaper' only has meaning when the costs 
are coming out of the same budget.  So, yes, while it's easy for a 
single-budget entity to make this decision, it's not so easy when you have 
multiple budgets involved with different spending parameters and different 
funding entities. 

 Starting with RAM and Power Supply is not random ... They're The Usual
 Suspects.

This is a very true statement.  

Heat and airflow are two others.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread m . roth
Lamar Owen wrote:
 On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T. wrote:
snip
 Starting with RAM and Power Supply is not random ... They're The Usual
 Suspects.

 This is a very true statement.

 Heat and airflow are two others.

Hmmm... has the a/c been changed lately? Or maybe stuff outside the rack
been moved, and so obstructed the airflow?

 mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Brunner, Brian T.
centos-boun...@centos.org wrote:
 On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T. wrote:
 This would be far cheaper than the time spent troubleshooting the
 running (sometimes hanging) system.
 
 Let me interject here, that from a budgeting standpoint
 'cheaper' has to be interpreted in the context of which
 budget the costs are coming out of.  

This degenerates into Your dollars are cheaper than my dollars.

 New hardware is capex,
 and thus would come out of the capital budget, and admin time
 is opex, and thus would come out of the operating budget.

This is where mental ossification amongst bean-counters can kill a
company.
Economic Opportunity Cost should raise its head here: What would we do
with the $capex if we paid $opex vs what would we do with the $opex if
we paid $capex.  The Time Value of Money vs The Money Value of Time is
another phrasing of this point-of-view.  Unfortunately this is no longer
a CentOS topic.

 Starting with RAM and Power Supply is not random ... They're The
 Usual Suspects.
 
 This is a very true statement.
 
 Heat and airflow are two others.

RAM and PowerSupply are easy starting points: swap RAM between two
systems and see (in the next 3 months) if the problem moved,  swapping
power supplies is a bit trickier but doable if the systems are similar
enough.  Again, several months watching to see where the problem
manifests is a test of patience and diligence.  It's possible that doing
this will make the problem stop arising (RAM and PS are both good
enough, they just don't play well together).

Heat  airflow are harder to swap (says the guy who opened an office
desktop, and vacuumed out enough hair, lint, dust, dander, and ashes to
knit a grey angora hamster (with lung cancer)).


Insert spiffy .sig here:
Life is complex: it has both real and imaginary parts.

//me
***
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the system manager. This footnote also confirms that this
email message has been swept for the presence of computer viruses.
www.Hubbell.com - Hubbell Incorporated**

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread compdoc
sure, if your time is worthless.  you can easily burn a couple hours
recapping a motherboard, which typically exceeds the boards worth.

Amen. It's not enough to replace the bulging caps - you need to replace all
the caps of the same brand as the damaged ones. Otherwise you'll just be
doing it again later.

And after ordering the exact replacements, and soldering them in, you've
been down for days/weeks, and you'll lucky if it hasn't been damaged in
other ways from lack of filtered power.

Recycle the motherboard (its hazardous waste) and buy a modern one.

By the way - don't forget to check the caps inside the PSU.




___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Les Mikesell
On 3/9/2011 9:55 AM, Brunner, Brian T. wrote:

 This is where mental ossification amongst bean-counters can kill a
 company.
 Economic Opportunity Cost should raise its head here: What would we do
 with the $capex if we paid $opex vs what would we do with the $opex if
 we paid $capex.  The Time Value of Money vs The Money Value of Time is
 another phrasing of this point-of-view.  Unfortunately this is no longer
 a CentOS topic.

The admin/operator's time is usually seen as a fixed cost and keeping a 
machine working is not supposed to take unplanned time.  So, if you want 
to keep something running you really need to buy 3 of them in the first 
place.  One as primary in production, one as a backup, and one to be 
developing/testing the next version on.  In some cases you can replace 
the third one with a virtual setup, and you might be able to have one 
backup as a spare for more than one live server but you can't skimp much 
more than that.  Everything breaks, so if one thing breaking causes a 
big problem, it wasn't planned realistically.  This should be a 'swap in 
the backup' while you run extensive diagnostics or get a warranty repair 
on the broken thing.  And if you are running Centos the one thing you 
don't need is to pay for extra licenses to cover the backup/development 
instances.

-- 
   Les Mikesell
 lesmikes...@gmail.com
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Lamar Owen
On Wednesday, March 09, 2011 10:48:29 am m.r...@5-cent.us wrote:
 Lamar Owen wrote:
  Heat and airflow are two others.

 Hmmm... has the a/c been changed lately? Or maybe stuff outside the rack
 been moved, and so obstructed the airflow?

To followup a little, I had a motherboard one time, with a factory-installed 
CPU, heatsink, and fan, that would not run for more than four or five hours 
before hanging.  This motherboard was in a system that was donated to us as 
being 'flaky' so I don't know the warranty status or what the original owner 
had or had not done, but it did have a factory seal sticker strip between the 
heatsink and the CPU and the motherboard socket, and that sticker was 
tamper-evident type, and there had been no tampering.

I decided I would refresh the heatsink compound, and, since even if it were 
still covered by the warranty that would have only been valid for the original 
purchaser.  So I pulled the sticker strip, which left little 'voids' on things, 
and pulled the heatsink.  At that point I laughed so hard I cried, as the 
heatsink still had the clear plastic protector film between the CPU and the 
heatsink compound.  From the factory.  I pulled the film, reinstalled the 
heatsink, and that system is and has been for several years rock-solid stable.

The issue of dust buildup follows from the heat and airflow.

There is another potential culprit, though, especially if this system has been 
in a raised floor environment, that some might find odd.  That culprit, or, 
rather, those culprits, are zinc whiskers.  Also, the metal components in the 
electronics themselves can exude whiskers; see the wikipedia article on the 
subject for more information ( 
https://secure.wikimedia.org/wikipedia/en/wiki/Whisker_%28metallurgy%29 )
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Lamar Owen
On Wednesday, March 09, 2011 11:45:06 am Les Mikesell wrote:
 And if you are running Centos the one thing you 
 don't need is to pay for extra licenses to cover the backup/development 
 instances.

And this is significant, and really highlights the reasoning of the CentOS team 
in 'bug-for'bug' binary compatibility with the upstream EL.

That is, in your hypothetical 'three of everything' approach you'd run a fully 
entitled copy of the upstream on the production unit, and save costs by running 
CentOS on the backup and the backup backup.

This is another fine financial point, and I'll not use the semi-derogatory 
'bean counters' thing, because some money really is cheaper than other money, 
and I'm not making that up, it is reality.  In particular, capital can be 
donated, but rarely will opex be donation-driven.  I have quite a bit of 
donated capital here, capital that I don't have replacement capex budget for.  
Also, many grants are awarded with 'capex-only' stipulations in the awards; it 
is a violation of the grant agreement to use that grant money on opex.  
Likewise, there are some grants that have exactly the opposite stipulation, and 
there are a few that have both, and have further direct versus indirect opex 
stipulations.

The point is that CentOS saves on opex; not personnel opex, but subscription 
opex.  Support subscriptions are opex, not capex.  And while that fine of a 
point might be lost to some, it is a point I deal with on virtually a daily 
basis.  I literally have to think about that distinction, and the various grant 
stipulations for monies that fund my salary, when filling out my biweekly 
timesheet; though salaried I am, that salary is funded between several grants, 
and most of those have different direct versus indirect cost budgets.

And helping keep things simpler is something that CentOS has helped me in 
significant ways.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Leen de Braal
sure, if your time is worthless.  you can easily burn a couple hours
recapping a motherboard, which typically exceeds the boards worth.

 Amen. It's not enough to replace the bulging caps - you need to replace
 all
 the caps of the same brand as the damaged ones. Otherwise you'll just be
 doing it again later.

 And after ordering the exact replacements, and soldering them in, you've
 been down for days/weeks, and you'll lucky if it hasn't been damaged in
 other ways from lack of filtered power.

 Recycle the motherboard (its hazardous waste) and buy a modern one.

 By the way - don't forget to check the caps inside the PSU.

Very true. Had one server two weeks ago with a broken PSU because of caps.
Only after moving it, it showed because it rebooted several times even
before completing POST, and then stopped completely.






 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos



-- 
L. de Braal
BraHa Systems
NL - Terneuzen
T +31 115 649333
F +31 115 649444

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
John Hodrien wrote:
 On Wed, 9 Mar 2011, Michael Eager wrote:
 
 The problem with randomly replacing various components, other than
 the downtime and nuisance, is that there's no way to know that the
 change actually fixed any problem.  When the base rate is one
 unknown system hang every few weeks, how many wees should I wait
 without a failure to conclude that the replaced component was the
 cause?  A failure which happens infrequently isn't really amenable
 to a random diagnostic approach.
 
 So you pitch the whole thing over to being a test rig, and buy all new
 hardware?

I'll repeat from my original post:

I don't see anything in /var/log/messages or elsewhere
to indicate any problem or offer any clue why the system
was hung.

Any suggestions where I might look for a clue?

I'm looking for diagnostics to focus on the cause of the crash.
My thanks for the several suggestions in this area.

I'm not particularly interested in a listing of the myriad of
hypothetical causes absent observable evidence and some of
which are contradicted by evidence (such as overheating).

I've encountered my share of bad power supplies, bad RAM,
poorly seated cards, etc.  I've replaced failing capacitors
in monitors (never on a motherboard).  I've replaced video
cards, hard drives, bad cables.  And so forth.  Each of these
had characteristics which pointed to the problem: kernel oops,
POST failures, flickering screens, etc.  The problem I have is
that there is a lack of diagnostic information to focus on the
cause of the server failure.

I don't mean to appear unappreciative, but suggestions which
amount to spending many hours making a series of unfocused
modifications to the server, hoping that one of these random
alterations fixes an infrequent problem, doesn't strike me as
useful.  At the other extreme, the suggestions that I not look
for the cause of the system failure and instead replace the
server with one or three servers also doesn't seem to be a
useful diagnostic approach either.

During the next server downtime, I'll re-seat RAM and
cables, check for excess dust, and do normal maintenance
as folks have suggested.  I might also run a memory diag.
I'll also look at the several excellent and appreciated
suggestions (some of which I've already installed) on how
to get a better picture on the state of the server when/if
there is a future failure.

Thanks all!



-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread m . roth
Michael Eager wrote:
 John Hodrien wrote:
 On Wed, 9 Mar 2011, Michael Eager wrote:

 The problem with randomly replacing various components, other than
 the downtime and nuisance, is that there's no way to know that the
 change actually fixed any problem.  When the base rate is one
 unknown system hang every few weeks, how many wees should I wait
 without a failure to conclude that the replaced component was the
 cause?  A failure which happens infrequently isn't really amenable
 to a random diagnostic approach.

 So you pitch the whole thing over to being a test rig, and buy all new
 hardware?

 I'll repeat from my original post:

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?

 I'm looking for diagnostics to focus on the cause of the crash.
 My thanks for the several suggestions in this area.

 I'm not particularly interested in a listing of the myriad of
 hypothetical causes absent observable evidence and some of
 which are contradicted by evidence (such as overheating).
snip
Here's one more, off-the-wall thought: do the setterm --powersave off, and
find some way to make it work, so that you can see what's on the screen
when it dies. What may be very important here is I recently had a problem
with a honkin' big server crashing... and it turned out that a user was
running a parallel processing job that kicked off three? four? dozen
threads, and towards the end of the job, every single thread wanted 10G...
on a system with 256G RAM (which size still boggles my mind). The
OOM-Killer didn't even have a chance to do its thing Yes, he's limited
what his job requests, and the system hasn't crashed since.

mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Les Mikesell
On 3/9/2011 11:32 AM, Michael Eager wrote:

 I'm not particularly interested in a listing of the myriad of
 hypothetical causes absent observable evidence and some of
 which are contradicted by evidence (such as overheating).

Note that overheating can be localized or a bad heat sink mounting or 
fan on a CPU.

 I've encountered my share of bad power supplies, bad RAM,
 poorly seated cards, etc.  I've replaced failing capacitors
 in monitors (never on a motherboard).  I've replaced video
 cards, hard drives, bad cables.  And so forth.  Each of these
 had characteristics which pointed to the problem: kernel oops,
 POST failures, flickering screens, etc.  The problem I have is
 that there is a lack of diagnostic information to focus on the
 cause of the server failure.

Anything that happens quickly isn't going to show up in a log.

 I don't mean to appear unappreciative, but suggestions which
 amount to spending many hours making a series of unfocused
 modifications to the server, hoping that one of these random
 alterations fixes an infrequent problem, doesn't strike me as
 useful.  At the other extreme, the suggestions that I not look
 for the cause of the system failure and instead replace the
 server with one or three servers also doesn't seem to be a
 useful diagnostic approach either.

There's not really a good way to approach intermittent failures.  It may 
only break when you aren't looking.  Major component swaps or taking it 
offline for extended diagnostics hoping to catch a glimpse of the cause 
when it fails is about all you can do.

 During the next server downtime, I'll re-seat RAM and
 cables, check for excess dust, and do normal maintenance
 as folks have suggested.  I might also run a memory diag.
 I'll also look at the several excellent and appreciated
 suggestions (some of which I've already installed) on how
 to get a better picture on the state of the server when/if
 there is a future failure.

Memory diagnostics may take days to catch a problem.  Did you check for 
a newer bios for your MB?  I mentioned before that it seemed strange, 
but I've seen that fix mysterious problems even after the machines had 
previously been reliable for a long time (and even more oddly, all the 
machines in the lot weren't affected).

-- 
   Les Mikesell
 lesmikes...@gmail.com


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
m.r...@5-cent.us wrote:
 Michael Eager wrote:
 John Hodrien wrote:
 On Wed, 9 Mar 2011, Michael Eager wrote:

 The problem with randomly replacing various components, other than
 the downtime and nuisance, is that there's no way to know that the
 change actually fixed any problem.  When the base rate is one
 unknown system hang every few weeks, how many wees should I wait
 without a failure to conclude that the replaced component was the
 cause?  A failure which happens infrequently isn't really amenable
 to a random diagnostic approach.
 So you pitch the whole thing over to being a test rig, and buy all new
 hardware?
 I'll repeat from my original post:

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?

 I'm looking for diagnostics to focus on the cause of the crash.
 My thanks for the several suggestions in this area.

 I'm not particularly interested in a listing of the myriad of
 hypothetical causes absent observable evidence and some of
 which are contradicted by evidence (such as overheating).
 snip
 Here's one more, off-the-wall thought: do the setterm --powersave off, and
 find some way to make it work, so that you can see what's on the screen
 when it dies. 

Yes, I did this.  Switched to console screen.  The correct command
is setterm -powersave off -blank off, otherwise the screen gets
blanked.  Turned the monitor off.  I hope it shows something
useful on the next fault.

 What may be very important here is I recently had a problem
 with a honkin' big server crashing... and it turned out that a user was
 running a parallel processing job that kicked off three? four? dozen
 threads, and towards the end of the job, every single thread wanted 10G...
 on a system with 256G RAM (which size still boggles my mind). The
 OOM-Killer didn't even have a chance to do its thing Yes, he's limited
 what his job requests, and the system hasn't crashed since.

Strange.  OOM-Killer should get priority.  That's what it's for.
Although it usually seems to kill the innocent bystanders before
it gets around to killing the offenders.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread m . roth
Michael Eager wrote:
 m.r...@5-cent.us wrote:
 Michael Eager wrote:
 John Hodrien wrote:
 On Wed, 9 Mar 2011, Michael Eager wrote:
 snip
 Here's one more, off-the-wall thought: do the setterm --powersave off,
 and find some way to make it work, so that you can see what's on the
screen
 when it dies.

 Yes, I did this.  Switched to console screen.  The correct command
 is setterm -powersave off -blank off, otherwise the screen gets
 blanked.  Turned the monitor off.  I hope it shows something
 useful on the next fault.

Best of luck. And thanks, I may try that.

 What may be very important here is I recently had a problem
 with a honkin' big server crashing... and it turned out that a user was
 running a parallel processing job that kicked off three? four? dozen
 threads, and towards the end of the job, every single thread wanted
 10G... on a system with 256G RAM (which size still boggles my mind). The
 OOM-Killer didn't even have a chance to do its thing Yes, he's
 limited what his job requests, and the system hasn't crashed since.

 Strange.  OOM-Killer should get priority.  That's what it's for.
 Although it usually seems to kill the innocent bystanders before
 it gets around to killing the offenders.

Yeah, but apparently too many of them hit too quickly - that's all I can
think.

  mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
Les Mikesell wrote:

 Note that overheating can be localized or a bad heat sink mounting or 
 fan on a CPU.

I'll re-seat the CPU, heatsink, and fan on the next downtime.

Heat related problems usually present as a system which fails
and will not reboot immediately, but will after they sit for a
while to cool down.  This system doesn't do that.

I'll install sensord to log CPU temps in case this is a problem.

 There's not really a good way to approach intermittent failures.  It may 
 only break when you aren't looking.  Major component swaps or taking it 
 offline for extended diagnostics hoping to catch a glimpse of the cause 
 when it fails is about all you can do.
 
 During the next server downtime, I'll re-seat RAM and
 cables, check for excess dust, and do normal maintenance
 as folks have suggested.  I might also run a memory diag.
 I'll also look at the several excellent and appreciated
 suggestions (some of which I've already installed) on how
 to get a better picture on the state of the server when/if
 there is a future failure.
 
 Memory diagnostics may take days to catch a problem.  Did you check for 
 a newer bios for your MB?  I mentioned before that it seemed strange, 
 but I've seen that fix mysterious problems even after the machines had 
 previously been reliable for a long time (and even more oddly, all the 
 machines in the lot weren't affected).

Yes, most memory diagnostics are not very effective.

I'll have to stop the server to find out what the installed bios version
is and see whether there is an update.  Most bios updates appear to only
change supported CPUs.  Something else for the next downtime.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread compdoc
 During the next server downtime, I'll re-seat RAM


If the ram is passing memtest86+, I think reseating only serves to introduce
dust and dirt into an area where a tight connection was previously keeping
it out.

Gently press them down to make sure they're seated, sure. But pulling them
out only allows dirt to fall into the cavity, and increases chances of
damage from insertion or static electricity, etc.

No to mention causing wear on the memory socket itself...





___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Dr. Ed Morbius
on 10:05 Wed 09 Mar, Lamar Owen (lo...@pari.edu) wrote:
 On Tuesday, March 08, 2011 04:44:54 pm Dr. Ed Morbius wrote:
  I'd very strongly recommend you configure netconsole. 
 
 Ok, now this is useful indeed.  Thanks for the information, even
 though I'm not the OP  While I suspected the facility might be
 there, I hadn't really dug for it, but if this will catch things after
 filesystems go r/o (ext3 journal things, ya know) it could be worth
 its weight in gold for catching kernel errors from VMware guests
 (serial console not really an option with the hosts I have, 

Yep, it is.

Netconsole made me fall in love with Linux all over again.

 although I'm sure some enterprising soul has figured out how to
 redirect the VM guest serial port to something else). 

-- 
Dr. Ed Morbius, Chief Scientist /|
  Robot Wrangler / Staff Psychologist| When you seek unlimited power
Krell Power Systems Unlimited|  Go to Krell!
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Dr. Ed Morbius
on 07:06 Wed 09 Mar, Michael Eager (ea...@eagerm.com) wrote:
 Dr. Ed Morbius wrote:
 on 09:24 Tue 08 Mar, Michael Eager (ea...@eagerm.com) wrote:
 Hi --
 
 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.
 
 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.
 
 Any suggestions where I might look for a clue?
 
 I'd very strongly recommend you configure netconsole.  Though not entire
 clear from the name, it's actually an in-kernel network logging module,
 which is very useful for kicking out kernel panics which otherwise
 aren't logged to disk and can't be seen on a (nonresponsive) monitor.
 
 I'll take a look at netconsole.
 
 Alternately, a serial console which actually retains all output sent to
 it (some remote access systems support this, some don't) may help.
 
 Barring that, I'd start looking at individual HW components, starting
 with RAM.
 
 The problem with randomly replacing various components, other than the
 downtime and nuisance, is that there's no way to know that the change
 actually fixed any problem.  When the base rate is one unknown system
 hang every few weeks, how many wees should I wait without a failure to
 conclude that the replaced component was the cause?  A failure which
 happens infrequently isn't really amenable to a random diagnostic
 approach.

This is where vendor management/relations starts coming into the
picture.

Your architecture should also support single-point failures.

If the issue is repeated but rare system failures on one of a set of
similarly configured hosts, I'd RMA the box and get a replacement.  End
of story.

If that's not the case, well, then, I suppose YOUR problem is to figure
out when you've resolved the issue.  I've outlined the steps I'd take.
If this means weeks of uncertainty, then I'd communicate this fact, in
no uncertain terms, to my manager, along with the financial implications
of downtime.

If downtime is more expensive than system replacement costs, the
decision is pretty obvious, even if painful.

Note that most system problems /are/ single-source.  If you'd post
details of the host, more logging information, netconsole panic logs,
etc., it might be possible to narrow down possible causes.

With what you've posted to date, it's not.

-- 
Dr. Ed Morbius, Chief Scientist /|
  Robot Wrangler / Staff Psychologist| When you seek unlimited power
Krell Power Systems Unlimited|  Go to Krell!
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Dr. Ed Morbius
on 10:37 Wed 09 Mar, Lamar Owen (lo...@pari.edu) wrote:
 On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T. wrote:
  This would be far cheaper than the time spent troubleshooting the
  running (sometimes hanging) system.
 
 Let me interject here, that from a budgeting standpoint 'cheaper' has
 to be interpreted in the context of which budget the costs are coming
 out of.  New hardware is capex, and thus would come out of the capital
 budget, and admin time is opex, and thus would come out of the
 operating budget.  There may be sufficient funds in the operating
 budget to pay an admin $x,000 but the funds in the capital budget may
 be insufficient to buy a server costing $y,000, where y=x.  

That represents an accounting failure, as opex is now subsidizing capex.
Troubleshooting of known bad equipment should be an opex chargeback
against capex or some capital reserve.

This requires clueful beancounters.  Recent economic/business/finance
history suggests a significant shortage of same.  Cue supply/demand and
incentives off-topic digression.

The answer is still to communicate the issue upstream.  Estimating
replacement costs and likelihood will help in the relevant business /
organizational decision.

-- 
Dr. Ed Morbius, Chief Scientist /|
  Robot Wrangler / Staff Psychologist| When you seek unlimited power
Krell Power Systems Unlimited|  Go to Krell!
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread m . roth
Michael Eager wrote:
snip
 I'll have to stop the server to find out what the installed bios version
 is and see whether there is an update.  Most bios updates appear to only
 change supported CPUs.  Something else for the next downtime.

Nope: dmidecode, or lshw, is your friend.

 mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Dr. Ed Morbius
on 11:52 Wed 09 Mar, Les Mikesell (lesmikes...@gmail.com) wrote:
 On 3/9/2011 11:32 AM, Michael Eager wrote:

 Memory diagnostics may take days to catch a problem.  Did you check for 
 a newer bios for your MB?  I mentioned before that it seemed strange, 
 but I've seen that fix mysterious problems even after the machines had 
 previously been reliable for a long time (and even more oddly, all the 
 machines in the lot weren't affected).

BIOS issues would tend to present similar issues on numerous systems,
especially if they're similarly configured.

Mind:  we've encountered a DSTATE bug with recent Dell PowerEdge systems
(r610, r410, r310), which has resulted in several BIOS revisions, the
latest of which simply disables the option entirely.  It's one of the
first things Dell techs mention when you call them these days (much to
our amusement).

If it's a single system (and assuming there are others similarly
configured), I'm leaning toward hardware or build-quality issues:  bad
RAM, other componentry, poor cable seating, etc.

-- 
Dr. Ed Morbius, Chief Scientist /|
  Robot Wrangler / Staff Psychologist| When you seek unlimited power
Krell Power Systems Unlimited|  Go to Krell!
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Les Mikesell
On 3/9/2011 12:47 PM, Dr. Ed Morbius wrote:

 That represents an accounting failure, as opex is now subsidizing capex.
 Troubleshooting of known bad equipment should be an opex chargeback
 against capex or some capital reserve.

 This requires clueful beancounters.  Recent economic/business/finance
 history suggests a significant shortage of same.  Cue supply/demand and
 incentives off-topic digression.

Statistical stuff doesn't play out well in one-off situations.  If you 
have a large number of boxes you'll know about the right amount of spare 
parts and on-hand spares you need.  But individual units are about like 
light bulbs in breaking at random and if the only one you have breaks 
today it won't matter that their average life is in years.

-- 
   Les Mikesell
lesmikes...@gmail.com



___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Dr. Ed Morbius
on 10:29 Wed 09 Mar, Michael Eager (ea...@eagerm.com) wrote:
 Les Mikesell wrote:
 
  Note that overheating can be localized or a bad heat sink mounting or 
  fan on a CPU.
 
 I'll re-seat the CPU, heatsink, and fan on the next downtime.

Very strongly advised.  It's a simple and very cheap approach.  I'd
check /all/ cables (power, disk) as well.

Visually scan for bad caps while you're doing this.  The pandemic of the
mid 2000s seems to have abated, but they can still ruin your whole day.
 
 Heat related problems usually present as a system which fails
 and will not reboot immediately, but will after they sit for a
 while to cool down.  This system doesn't do that.

Maybe, maybe not.
 
 I'll install sensord to log CPU temps in case this is a problem.

Good call.
 
  There's not really a good way to approach intermittent failures.  It
  may only break when you aren't looking.  Major component swaps or
  taking it offline for extended diagnostics hoping to catch a glimpse
  of the cause when it fails is about all you can do.

I disagree with this statement:  you start with the bleeding obvious and
easy to do (the cheap diagnostics), same as any garage mechanic or
doctor.  You instrument and increase log scrutiny.  You make damned sure
you're logging remotely as one of the first things a hosed system does
is stop writing to disk.

 Yes, most memory diagnostics are not very effective.
 
 I'll have to stop the server to find out what the installed bios version
 is and see whether there is an update.  Most bios updates appear to only
 change supported CPUs.  Something else for the next downtime.

You haven't stated who's built this system, but many LOM / OMC systems
will provide basic information such as this.  dmidecode and lshw are
also very helpful here.

-- 
Dr. Ed Morbius, Chief Scientist /|
  Robot Wrangler / Staff Psychologist| When you seek unlimited power
Krell Power Systems Unlimited|  Go to Krell!
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread compdoc
 I'll re-seat the CPU, heatsink, and fan on the next downtime.

Is the CPU overheating? Pointless to reseat the cpu or even remove the
heatsink, if not.


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread John R Pierce
On 03/09/11 10:29 AM, Michael Eager wrote:
 I'll re-seat the CPU, heatsink, and fan on the next downtime.

do have on hand the suppplies to clean off the old heatsink goo (I use 
alcohol pads for this), and some fresh heatsink goop

check all fans when its powered off that they spin easily.  I've seen 
fans that were still spinning but felt a little stiff, and failed not 
long thereafter.  and of course, clean out most of the dust that tends 
to collect everywhere.



___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
compdoc wrote:
 I'll re-seat the CPU, heatsink, and fan on the next downtime.
 
 Is the CPU overheating? Pointless to reseat the cpu or even remove the
 heatsink, if not.

No evidence to suggest that it is.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
m.r...@5-cent.us wrote:
 Michael Eager wrote:
 snip
 I'll have to stop the server to find out what the installed bios version
 is and see whether there is an update.  Most bios updates appear to only
 change supported CPUs.  Something else for the next downtime.
 
 Nope: dmidecode, or lshw, is your friend.

Thanks.  Looks like there might be a newer bios available,
although the vendor identifies it as 'beta'.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread m . roth
Michael Eager wrote:
 compdoc wrote:
 I'll re-seat the CPU, heatsink, and fan on the next downtime.

 Is the CPU overheating? Pointless to reseat the cpu or even remove the
 heatsink, if not.

 No evidence to suggest that it is.

Have you used ipmitool to see what the temperatures are?

 mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
Dr. Ed Morbius wrote:

 If the issue is repeated but rare system failures on one of a set of
 similarly configured hosts, I'd RMA the box and get a replacement.  End
 of story.

I'll repeat:  this is a house-made system.  There's no vendor to RMA to.
It seems obvious to me:  RMA is not a diagnostic tool.

 If you'd post
 details of the host, more logging information, netconsole panic logs,
 etc., it might be possible to narrow down possible causes.

The problem is that there are NO DIAGNOSTICS generated when the
system hangs.  There's no panic and nothing in the logs which
indicates any problem.  This is what I indicated from the get go.

 With what you've posted to date, it's not.

I could waste my time posting logs for you to tell me that they don't
point to any problem.  I'd rather skip that step.


-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
m.r...@5-cent.us wrote:
 Michael Eager wrote:
 compdoc wrote:
 I'll re-seat the CPU, heatsink, and fan on the next downtime.
 Is the CPU overheating? Pointless to reseat the cpu or even remove the
 heatsink, if not.
 No evidence to suggest that it is.
 
 Have you used ipmitool to see what the temperatures are?

No, I'm not familiar with ipmitool.   I just installed it and
the man page will take some time to read.  It looks like it
does everything and then more.

According to the man page, it apparently needs a kernel driver
named OpenIMPI, which it claims is installed in standard
distributions.  I don't find it on my system.   Running
impitool sdr type Temperature results in an error message
saying that it could not open /dev/imp0, etc.


-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Rudi Ahlers
On Thu, Mar 10, 2011 at 12:31 AM, Michael Eager ea...@eagerm.com wrote:
 Dr. Ed Morbius wrote:

 If the issue is repeated but rare system failures on one of a set of
 similarly configured hosts, I'd RMA the box and get a replacement.  End
 of story.

 I'll repeat:  this is a house-made system.  There's no vendor to RMA to.



I don't know where you are, but in our country we can RMA anything and
everything. Apart from CPU's. So, even a cheap desktop mobo could be
RMA'd, as long as I can prove to the suppliers it's faulty, and it's
within the warrenty period

-- 
Kind Regards
Rudi Ahlers
SoftDux

Website: http://www.SoftDux.com
Technical Blog: http://Blog.SoftDux.com
Office: 087 805 9573
Cell: 082 554 7532
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread compdoc
 compdoc wrote:
 I'll re-seat the CPU, heatsink, and fan on the next downtime.
 
 Is the CPU overheating? Pointless to reseat the cpu or even remove the
 heatsink, if not.

No evidence to suggest that it is.


As much as I love telling anecdotes, I have none to tell you concerning cpu
reseating. I've never seen it fix a problem.

Maybe that was something they needed to do back in 1998, but cpu and ram
sockets are a reliable technology these days.

Removing and then reinserting is likely to do more damage than it will fix.

I think you're on the right track - use diagnostic tools and see what you
can find. The more poking around you do the better.

I do agree about bad caps - even one with a bulging top can cause
crashing/rebooting. They need to be checked both on the motherboard and
inside the PSU.

However, if the motherboard is 2 years old or less, capacitor problems on
the motherboard will become less likely the newer it is. They've been making
some excellent low cost boards with solid caps for a while.

The older boards with that problem are still around but most have died by
now. Cheaper PSUs have a cap problem even these days, though.

Oh, and both the motherboard and PSU circuit board should be examined for
burned components. We have some hellacious lighting strikes here in Denver,
and stuff blows up.

Hey, I did manage an anecdote after all!




___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread compdoc
According to the man page, it apparently needs a kernel driver
named OpenIMPI, which it claims is installed in standard
distributions.  I don't find it on my system.


lm_sensors is another, and I think installs ready to use from the repos.

Failing that, you should reboot and look in the motherboard's bios/cmos. It
should display all that good stuff: fan speeds, voltage levels, temps.




___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
Rudi Ahlers wrote:
 On Thu, Mar 10, 2011 at 12:31 AM, Michael Eager ea...@eagerm.com wrote:
 Dr. Ed Morbius wrote:

 If the issue is repeated but rare system failures on one of a set of
 similarly configured hosts, I'd RMA the box and get a replacement.  End
 of story.
 I'll repeat:  this is a house-made system.  There's no vendor to RMA to.
 
 
 
 I don't know where you are, but in our country we can RMA anything and
 everything. Apart from CPU's. So, even a cheap desktop mobo could be
 RMA'd, as long as I can prove to the suppliers it's faulty, and it's
 within the warrenty period

I responded to Dr. Morbius' suggestion that I RMA the box.
There is vendor to RMA the box to.

If I knew that it was a motherboard problem, I could RMA it.
Or disk, or PSU, or network card, or whatever.  But, as I've mentioned,
there's no indication what causes the system to hang.  There is no
way at this point to prove that it is a defective motherboard.


-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Don Krause
On Mar 9, 2011, at 3:06 PM, compdoc wrote:

 compdoc wrote:
 I'll re-seat the CPU, heatsink, and fan on the next downtime.
 
 Is the CPU overheating? Pointless to reseat the cpu or even remove the
 heatsink, if not.
 
 No evidence to suggest that it is.
 
 
 As much as I love telling anecdotes, I have none to tell you concerning cpu
 reseating. I've never seen it fix a problem.


Funny, we actually had a whole stack of HP 4600s that needed the cpus 
reinstalled in order to function.

When we removed the heatsinks, the cpus came up with them, even though the 
socket lever was down in the lock position. 

We had to twist the CPU off the bottom of the heatsink, reinstall it in the 
socket, reinstall the heatsink, and the machines were fine.

--
Don Krause   







smime.p7s
Description: S/MIME cryptographic signature
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Dr. Ed Morbius
on 14:31 Wed 09 Mar, Michael Eager (ea...@eagerm.com) wrote:
 Dr. Ed Morbius wrote:
 
 If the issue is repeated but rare system failures on one of a set of
 similarly configured hosts, I'd RMA the box and get a replacement.  End
 of story.
 
 I'll repeat:  this is a house-made system.  There's no vendor to RMA to.
 It seems obvious to me:  RMA is not a diagnostic tool.

You fab your own silicon?

I saw your reference to a homebrew machine after I'd posted.  You'd
neglected to provide this information initially.

Knowing some basic stuff like:  CPU architecture, memory allocation,
disk subsystem, kernel modules, etc.,
 
 If you'd post
 details of the host, more logging information, netconsole panic logs,
 etc., it might be possible to narrow down possible causes.
 
 The problem is that there are NO DIAGNOSTICS generated when the
 system hangs.  There's no panic and nothing in the logs which
 indicates any problem.  This is what I indicated from the get go.

uname -a
/proc/cpuinfo
/proc/meminfo
lspci
lsmod
/proc/mounts
/proc/scsi/scsi
/proc/partitions
dmidecode

... would be useful for starters.

If you've built your own kernel, your config options (if you're running
stock, we can get that from the package itself).

As would wiring up netconsole as I initially suggested.


If I can clarify:  YOU are the person with the problem.  WE are the
people you're turning to for assistance.  YOU are getting pissy.  YOU
should be focusing on providing relevant information, or noting that
it's not available.

You're NOT obliged to repeat information you've already posted (e.g.:
home-brew system), but it's helpful to front-load data rather than have
us tease it out of you.
 
 With what you've posted to date, it's not.
 
 I could waste my time posting logs for you to tell me that they don't
 point to any problem.  I'd rather skip that step.

Krell forfend you should post relevant and useful information which
might be useful in actually diagnosing your problem (or pointing to
likely candidates and/or further tests).

-- 
Dr. Ed Morbius, Chief Scientist /|
  Robot Wrangler / Staff Psychologist| When you seek unlimited power
Krell Power Systems Unlimited|  Go to Krell!
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread compdoc
When we removed the heatsinks, the
cpus came up with them, even though
the socket lever was down in the lock position.

I've seen that in HP desktops too - the thermal paste became a hardened glue
and the cpu gets pulled right out .

Another reason to leave the heat sink on.




___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
compdoc wrote:
 According to the man page, it apparently needs a kernel driver
 named OpenIMPI, which it claims is installed in standard
 distributions.  I don't find it on my system.
 
 
 lm_sensors is another, and I think installs ready to use from the repos.

sensors says that the three temp sensors read +36C, +39C, and +87C.
These appear to be AMD K10 temp sensors, although I might be
misreading sensors-detect.  Low/highs are (+127/+127, +127/+90,
+127/+127) respectively.  (I'm not sure if these are alarm set
points or something else.)

One fan is listed as 0 rpm.   Something to look into.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Rudi Ahlers
On Thu, Mar 10, 2011 at 1:17 AM, Michael Eager ea...@eagerm.com wrote:
 Rudi Ahlers wrote:
 On Thu, Mar 10, 2011 at 12:31 AM, Michael Eager ea...@eagerm.com wrote:
 Dr. Ed Morbius wrote:

 If the issue is repeated but rare system failures on one of a set of
 similarly configured hosts, I'd RMA the box and get a replacement.  End
 of story.
 I'll repeat:  this is a house-made system.  There's no vendor to RMA to.



 I don't know where you are, but in our country we can RMA anything and
 everything. Apart from CPU's. So, even a cheap desktop mobo could be
 RMA'd, as long as I can prove to the suppliers it's faulty, and it's
 within the warrenty period

 I responded to Dr. Morbius' suggestion that I RMA the box.
 There is vendor to RMA the box to.

 If I knew that it was a motherboard problem, I could RMA it.
 Or disk, or PSU, or network card, or whatever.  But, as I've mentioned,
 there's no indication what causes the system to hang.  There is no
 way at this point to prove that it is a defective motherboard.


 --
 Michael Eager    ea...@eagercon.com
 1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
 ___
`

As far as I can see you were giving a bucked load of advice, which you
haven't even bothered to follow yet. You're the only one who could
actually do anything about the problem.

No amount of suggestions made on this list will fix the problem for
you. You need to actually take apart the server and see what's going
on.


-- 
Kind Regards
Rudi Ahlers
SoftDux

Website: http://www.SoftDux.com
Technical Blog: http://Blog.SoftDux.com
Office: 087 805 9573
Cell: 082 554 7532
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Don Krause
On Mar 9, 2011, at 3:26 PM, compdoc wrote:

 When we removed the heatsinks, the
 cpus came up with them, even though
 the socket lever was down in the lock position.
 
 I've seen that in HP desktops too - the thermal paste became a hardened glue
 and the cpu gets pulled right out .
 
 Another reason to leave the heat sink on.
 


Umm, actually, that was a great reason to take the heatsink off. The machines 
wouldn't boot in that condition, reseating the cpus fixed them all.
Yes, we could have shipped them back, (they were brand new, broken out of the 
box) but didn't have the time to deal with that.

--
Don Krause   










smime.p7s
Description: S/MIME cryptographic signature
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread compdoc
+36C and +39C are likely your cpu and motherboard temps. You have to look at
the temps in the cmos and match them.

The +87C is likely just a miss-reading by lm_sensors. Anything running that
hot won't be stable.

I use AMD as well, and lm_sensors tells me something is 128°C.

heh


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread compdoc
Err, that should read 128C

-Original Message-
From: centos-boun...@centos.org [mailto:centos-boun...@centos.org] On Behalf
Of compdoc
Sent: Wednesday, March 09, 2011 4:50 PM
To: 'CentOS mailing list'
Subject: Re: [CentOS] Server hangs on CentOS 5.5

+36C and +39C are likely your cpu and motherboard temps. You have to look at
the temps in the cmos and match them.

The +87C is likely just a miss-reading by lm_sensors. Anything running that
hot won't be stable.

I use AMD as well, and lm_sensors tells me something is 1280C.

heh




___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread John R Pierce
On 03/09/11 2:31 PM, Michael Eager wrote:
 I'll repeat:  this is a house-made system.  There's no vendor to RMA to.
 It seems obvious to me:  RMA is not a diagnostic tool.



you built it, you get to fix it. sometimes the initial savings in 
capital can come back and bite you in time wasted.


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
Rudi Ahlers wrote:

 As far as I can see you were giving a bucked load of advice, which you
 haven't even bothered to follow yet. You're the only one who could
 actually do anything about the problem.

I have followed quite a bit of the advice, which I have
appreciated and noted.  I've set up the monitor so that it
will not be blanked on a crash, installed monitoring software,
and checked a number of conditions which people have suggested.

No, I have not responded to the philosophical discussions
about vender management, nor to the suggestions to RMA
something to somebody for unknown reasons.  No, I'm not
going to replace RAM or capacitors here and there on the off
chance that something might be bad.  (But I will look for
capacitors which show signs of bulging or leaking.)

 No amount of suggestions made on this list will fix the problem for
 you. You need to actually take apart the server and see what's going
 on.

I wasn't interested in anyone fixing the server for me.
I did ask for suggestions on how improve the diagnostics
for the problem, which several people have responded to.
Again, I appreciate their suggestions greatly.

As I've said, I have a list of things to check when the
server is next taken down.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
compdoc wrote:
 Err, that should read 128C
 
 -Original Message-
 From: centos-boun...@centos.org [mailto:centos-boun...@centos.org] On Behalf
 Of compdoc
 Sent: Wednesday, March 09, 2011 4:50 PM
 To: 'CentOS mailing list'
 Subject: Re: [CentOS] Server hangs on CentOS 5.5
 
 +36C and +39C are likely your cpu and motherboard temps. You have to look at
 the temps in the cmos and match them.
 
 The +87C is likely just a miss-reading by lm_sensors. Anything running that
 hot won't be stable.
 
 I use AMD as well, and lm_sensors tells me something is 1280C.

I'll compare the values from lm_sensors with the bios
temps to see if they are in line.

1280C is about the melting point of iron.  Wow!


-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread compdoc
1280C is about the melting point of iron.  Wow!

The degree symbol was converted to text after pasting into the email and
became an '0'

It actually shows 128C in lm_sensors.

Great little program, tho.





___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread John R Pierce
On 03/09/11 4:06 PM, Michael Eager wrote:
 I'll compare the values from lm_sensors with the bios
 temps to see if they are in line.

I find lm_sensors tends to be pretty useless on server grade hardware, 
as opposed to desktop.   server hardware tends to have an IPMI 
management processor, which is accessed over the network (after you 
configure it) and can be centrally managed, this includes temp+fan+power 
monitoring as well as remote power and console.




___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-09 Thread Michael Eager
Dr. Ed Morbius wrote:

 
 You're NOT obliged to repeat information you've already posted (e.g.:
 home-brew system), but it's helpful to front-load data rather than have
 us tease it out of you.

No intention to have anyone tease information out of me.

The subject line says that the system is CentOS 5.5.  The other
info has been forthcoming, as much as I have been able to provide.
Sorry it wasn't all at the same time -- I didn't think that saying
the server was not a Dell or HP box was important.

 With what you've posted to date, it's not.
 I could waste my time posting logs for you to tell me that they don't
 point to any problem.  I'd rather skip that step.
 
 Krell forfend you should post relevant and useful information which
 might be useful in actually diagnosing your problem (or pointing to
 likely candidates and/or further tests).

The logs are uninformative.  No messages for hours before the crash.

Thanks for the help.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


[CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Michael Eager
Hi --

I'm running a server which is usually stable, but every
once in a while it hangs.  The server is used as a file
store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere
to indicate any problem or offer any clue why the system
was hung.

Any suggestions where I might look for a clue?

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Les Mikesell
On 3/8/2011 11:24 AM, Michael Eager wrote:
 Hi --

 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?

Probably something hardware related.  Bad memory, overheating, power 
supply, etc.  I've even seen some rare cases where a bios update would 
fix it although it didn't make much sense for a machine to run for 
years, then need a firmware change.

-- 
   Les Mikesell
lesmikes...@gmail.com
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread compdoc
I'm running a server which is usually stable, but every
once in a while it hangs.


There can be many reasons for that. One thing I'm curious about - try
looking at the reallocated sector count, and current pending sector count
for your drives with smartctl.




___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Brian Mathis
On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager ea...@eagerm.com wrote:
 Hi --

 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?

Please be more specific when you say it hangs.  Does it just pause
for a minute and then continue working, or does it freeze completely
until you reboot it?  Does it respond to s soft reboot like
Ctrl-Alt-Del, or do you need to hard power it off?

Since this is an NFS server I'm going to guess there might be a lot of
IO.  Maybe there is some large IO load going on, like maybe all your
VMs are running anti-virus scan at the same time, or something like
that.

To troubleshoot, I recommend installing the 'sar' utilities (yum
install sysstat) and then reviewing the collected data using the
'ksar' utility (http://sourceforge.net/projects/ksar/).  sar/ksar are
good for tracking down acute problems.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Michael Eager
compdoc wrote:
 I'm running a server which is usually stable, but every
 once in a while it hangs.
 
 
 There can be many reasons for that. One thing I'm curious about - try
 looking at the reallocated sector count, and current pending sector count
 for your drives with smartctl.

Thanks for the suggestions.  All disks show zero realloc sectors
and pending sectors.  Smartctl says no failures.  Also, max temp
was 48 C or less.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Michael Eager
Brian Mathis wrote:
 On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager ea...@eagerm.com wrote:
 Hi --

 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?
 
 Please be more specific when you say it hangs.  Does it just pause
 for a minute and then continue working, or does it freeze completely
 until you reboot it?  Does it respond to s soft reboot like
 Ctrl-Alt-Del, or do you need to hard power it off?

System is unresponsive.  Monitor blank, no response to keyboard,
no response to remote ssh.  Hit reset to reboot.

The only indication that I had that there was a problem (other
that attached systems were not accessing files) was that the fan(s)
on the server were louder than normal.

 Since this is an NFS server I'm going to guess there might be a lot of
 IO.  Maybe there is some large IO load going on, like maybe all your
 VMs are running anti-virus scan at the same time, or something like
 that.

At the time, should be very low NFS load.

 To troubleshoot, I recommend installing the 'sar' utilities (yum
 install sysstat) and then reviewing the collected data using the
 'ksar' utility (http://sourceforge.net/projects/ksar/).  sar/ksar are
 good for tracking down acute problems.

Thanks for the suggestion.  I'll look into sar.


-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Michael Eager
Les Mikesell wrote:
 On 3/8/2011 11:24 AM, Michael Eager wrote:
 Hi --

 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?
 
 Probably something hardware related.  Bad memory, overheating, power 
 supply, etc.  I've even seen some rare cases where a bios update would 
 fix it although it didn't make much sense for a machine to run for 
 years, then need a firmware change.

The system is on a UPS and temps seem reasonable.
Locating a transient memory problem is time consuming.
Identifying a power supply which sometimes spikes is
even more difficult.  I'd like to have a clue about the
likely problem before shutting down the server for an
extended period.

I'll set up sar and sensord to periodically log system
status and see if this gives me a clue for the next
time this happens.


-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread compdoc
The only indication that I had that there was a problem (other
that attached systems were not accessing files) was that the
fan(s) on the server were louder than normal.

Are you saying the fans were running faster than normal while it was hung?
Or are they louder than usual even while its running?

Fans making noise can mean the fan isn't spinning as fast as it should
because the bearing is failing. Be a good time to open the case to check to
see that all fans are working...




___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread m . roth
Michael Eager wrote:
 Brian Mathis wrote:
 On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager ea...@eagerm.com wrote:
 Hi --

 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?
snip
 System is unresponsive.  Monitor blank, no response to keyboard,
 no response to remote ssh.  Hit reset to reboot.

Suggestion 1: -from the console-, run
setterm --powersave off
That way, even if you connect a monitor (in our, uh, computer labs, we
have a monitor-on-a-stick), you'll still see what's on the screen at the
end, not the power save blanking.

 The only indication that I had that there was a problem (other
 that attached systems were not accessing files) was that the fan(s)
 on the server were louder than normal.

Um. Um. What make is the server? We had that on some new Suns, where after
working on them, the fans would spin up and *not* spin down to normal. The
answer to that was, after powering them down, pull all the plugs, and
leave them out for 20 sec or so

  mark


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Les Mikesell
On 3/8/2011 12:31 PM, Michael Eager wrote:

 Any suggestions where I might look for a clue?

 Probably something hardware related.  Bad memory, overheating, power
 supply, etc.  I've even seen some rare cases where a bios update would
 fix it although it didn't make much sense for a machine to run for
 years, then need a firmware change.

 The system is on a UPS and temps seem reasonable.
 Locating a transient memory problem is time consuming.
 Identifying a power supply which sometimes spikes is
 even more difficult.  I'd like to have a clue about the
 likely problem before shutting down the server for an
 extended period.

 I'll set up sar and sensord to periodically log system
 status and see if this gives me a clue for the next
 time this happens.


The times I've seen things like that it would happen too quickly to log 
anything.  One other possibility is an individual bad CPU fan, but then 
you might have to shut down completely for a while to wake it up.

-- 
   Les Mikesell
lesmikes...@gmail.com

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Michael Eager
compdoc wrote:
 The only indication that I had that there was a problem (other
 that attached systems were not accessing files) was that the
 fan(s) on the server were louder than normal.
 
 Are you saying the fans were running faster than normal while it was hung?
 Or are they louder than usual even while its running?

They were louder than normal when hung, but
returned to being quiet after the reboot.

 Fans making noise can mean the fan isn't spinning as fast as it should
 because the bearing is failing. Be a good time to open the case to check to
 see that all fans are working...

Good idea.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Michael Eager
m.r...@5-cent.us wrote:
 Michael Eager wrote:
 Brian Mathis wrote:
 On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager ea...@eagerm.com wrote:
 Hi --

 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?
 snip
 System is unresponsive.  Monitor blank, no response to keyboard,
 no response to remote ssh.  Hit reset to reboot.
 
 Suggestion 1: -from the console-, run
 setterm --powersave off
 That way, even if you connect a monitor (in our, uh, computer labs, we
 have a monitor-on-a-stick), you'll still see what's on the screen at the
 end, not the power save blanking.

I get a message cannot (un)set powersave mode.

I'll add this to .xinitrc.

 The only indication that I had that there was a problem (other
 that attached systems were not accessing files) was that the fan(s)
 on the server were louder than normal.
 
 Um. Um. What make is the server? We had that on some new Suns, where after
 working on them, the fans would spin up and *not* spin down to normal. The
 answer to that was, after powering them down, pull all the plugs, and
 leave them out for 20 sec or so

House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Michael Eager
Scott Silva wrote:

 Did you try the obvious stuff for older equipment? Remove and reseat ALL cards
 and memory, several times, to clean off any oxidation from contacts.
 Blow out any dust and collected lint.
 reseat drive cables.

Not yet, but that's always a good idea.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Michael Eager
Michael Eager wrote:
 m.r...@5-cent.us wrote:

 Suggestion 1: -from the console-, run
 setterm --powersave off
 That way, even if you connect a monitor (in our, uh, computer labs, we
 have a monitor-on-a-stick), you'll still see what's on the screen at the
 end, not the power save blanking.
 
 I get a message cannot (un)set powersave mode.
 
 I'll add this to .xinitrc.

Or better, CTRL-ALT-F1 to switch to serial console
and run setterm -powersave off.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread m . roth
Michael Eager wrote:
 m.r...@5-cent.us wrote:
 Michael Eager wrote:
 Brian Mathis wrote:
 On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager ea...@eagerm.com
 wrote:

 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.

 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.

 Any suggestions where I might look for a clue?
 snip
 System is unresponsive.  Monitor blank, no response to keyboard,
 no response to remote ssh.  Hit reset to reboot.

 Suggestion 1: -from the console-, run
 setterm --powersave off
 That way, even if you connect a monitor (in our, uh, computer labs, we
 have a monitor-on-a-stick), you'll still see what's on the screen at the
 end, not the power save blanking.

 I get a message cannot (un)set powersave mode.

Did you do it from the console? It won't work (or at least neither my
manager nor I have figured out how to do it) remotely.

 I'll add this to .xinitrc.

Um. This isn't X, it's below that.

 The only indication that I had that there was a problem (other
 that attached systems were not accessing files) was that the fan(s)
 on the server were louder than normal.

 Um. Um. What make is the server? We had that on some new Suns, where
 after working on them, the fans would spin up and *not* spin down to
normal.
 The answer to that was, after powering them down, pull all the plugs, and
 leave them out for 20 sec or so

 House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

Any chance the problem's with the video card?

 mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Michael Eager
m.r...@5-cent.us wrote:
 Michael Eager wrote:

 House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.
 
 Any chance the problem's with the video card?

Video is on the MB.  It doesn't seem likely that it's
the video, since the system doesn't respond to network
when it crashes.

It could be anything.  That's why I'm looking for
something that would give me a bit of a hint what
to look at.  With an infrequent failure, it's not
practical to replace components piecemeal.

-- 
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread John R Pierce

 Video is on the MB.  It doesn't seem likely that it's
 the video, since the system doesn't respond to network
 when it crashes.

bad video hardware or drivers can easily crash the system

If its running an X windows display of any sort, I'd suggest trying it 
in text-only mode.   in /etc/inittab, set the default runlevel to 3 
instead of 5.   this leaves the video in plain VGA text mode which is 
far less likely to crash the system.

 id:3:initdefault:

bonus, if this is a server, and thats a shared memory video system, 
disabling the graphic modes reduces the memory bus contention, speeding 
up the whole system by some percentage.


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread m . roth
John R Pierce wrote:

 Video is on the MB.  It doesn't seem likely that it's
 the video, since the system doesn't respond to network
 when it crashes.

 bad video hardware or drivers can easily crash the system

 If its running an X windows display of any sort, I'd suggest trying it
 in text-only mode.   in /etc/inittab, set the default runlevel to 3
 instead of 5.   this leaves the video in plain VGA text mode which is
 far less likely to crash the system.

  id:3:initdefault:

Seconded. If it's a server, it doesn't really need X running anyway.

   mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Dr. Ed Morbius
on 09:24 Tue 08 Mar, Michael Eager (ea...@eagerm.com) wrote:
 Hi --
 
 I'm running a server which is usually stable, but every
 once in a while it hangs.  The server is used as a file
 store using NFS and to run VMware machines.
 
 I don't see anything in /var/log/messages or elsewhere
 to indicate any problem or offer any clue why the system
 was hung.
 
 Any suggestions where I might look for a clue?

I'd very strongly recommend you configure netconsole.  Though not entire
clear from the name, it's actually an in-kernel network logging module,
which is very useful for kicking out kernel panics which otherwise
aren't logged to disk and can't be seen on a (nonresponsive) monitor.

Alternately, a serial console which actually retains all output sent to
it (some remote access systems support this, some don't) may help.

Barring that, I'd start looking at individual HW components, starting
with RAM.

The trick is in passing the appropriate parameters to the module at load
time.  I found it helpful to have an @boot cronjob to do this.

You'll need to pass the local port, local system IP, local network
device, remote syslog UDP port, remote syslog IP, and the /gateway/ MAC
address, where gateway is the syslogd (if on a contiguous ethernet
segment), or your network gateway host, if not.  Some parsing magic can
determine these values for you.

Good article describing configuration:

http://www.cyberciti.biz/tips/linux-netconsole-log-management-tutorial.html


If you're not already remote-logging all other activity, I'd do that as
well.  You might catch the start of the hang, if not all of it.

-- 
Dr. Ed Morbius, Chief Scientist /|
  Robot Wrangler / Staff Psychologist| When you seek unlimited power
Krell Power Systems Unlimited|  Go to Krell!


signature.asc
Description: Digital signature
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server hangs on CentOS 5.5

2011-03-08 Thread Dr. Ed Morbius
on 10:31 Tue 08 Mar, Michael Eager (ea...@eagerm.com) wrote:
 Les Mikesell wrote:
  On 3/8/2011 11:24 AM, Michael Eager wrote:
  Hi --
 
  I'm running a server which is usually stable, but every
  once in a while it hangs.  The server is used as a file
  store using NFS and to run VMware machines.
 
  I don't see anything in /var/log/messages or elsewhere
  to indicate any problem or offer any clue why the system
  was hung.
 
  Any suggestions where I might look for a clue?
  
  Probably something hardware related.  Bad memory, overheating, power 
  supply, etc.  I've even seen some rare cases where a bios update would 
  fix it although it didn't make much sense for a machine to run for 
  years, then need a firmware change.
 
 The system is on a UPS and temps seem reasonable.
 Locating a transient memory problem is time consuming.

Disable or remove half your RAM.  If the problem persists, replace that
RAM and remove the other half.  If the problem resolves, the issue is
likely in the half of the RAM you've removed.  You can binary search
through it, or RMA the lot if warranteed.

 Identifying a power supply which sometimes spikes is even more
 difficult.  

Same drill.  Replace the power supply, or on a dual-PS system, disable
one, then the other.  Follow procedure as for RAM.

 I'd like to have a clue about the likely problem before shutting down
 the server for an extended period.

If the server is critical, get a vendor loaner and bench-test the
equipment until the fault can be identified.
 
 I'll set up sar and sensord to periodically log system status and see
 if this gives me a clue for the next time this happens.

At best, sar will tell you whether or not you're experiencing resource
exhuastion.  It's a valuable tool, but fairly coarse-grained.  Cacti
will give you better resolution and visualization (particularly on
CentOS) than sar (some distros now include sar graphing utilities,
CentOS to the best of my recollection does not).

-- 
Dr. Ed Morbius, Chief Scientist /|
  Robot Wrangler / Staff Psychologist| When you seek unlimited power
Krell Power Systems Unlimited|  Go to Krell!
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos