Re: low-level format before install?

2009-04-08 Thread Geoff Fritz
On Tue, Apr 07, 2009 at 05:41:27PM -0400, John Almberg wrote:
 Thanks for all the tips. At least I have something to start with.
 
 The guys in the data center reinstalled FreeBSD (the filesystem was  
 totally corrupted again), and then ran what they called SMART test,  
 which might be smartctl, and said the hard drives look good.
 
 I am now able to get back in.
 
 So the system ran fine until I put a load on it with the database  
 (many transactions a second). This corrupted the file system again.
 
 So I guess I need to load it enough to produce error messages  
 (hopefully) but not enough to destroy the file system again.

I've had issues with a few hosted servers, and more often than not, it was
a bad PSU on the server and/or rack.

Assuming that you can't get these folks to run a good hardware diag for
you, there are a few things you can do.  You can beat up the RAM/cpu with
various burn-in programs (I like benchmarks/stream for its simplicity --
you'll need to make extract, customize, then ,make install for your own
memory size).  You can thrash the disks pretty well with either dd or
badblocks from sysutils/e2fsprogs, both can be non-destructive.

-- Geoff
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: low-level format before install?

2009-04-07 Thread John Almberg


On Apr 7, 2009, at 3:37 PM, Chuck Swiger wrote:


On Apr 7, 2009, at 12:15 PM, John Almberg wrote:
Well, I've got real problems with that database server that lost  
power over the weekend. We reloaded FreeBSD from scratch and then  
reinstalled mysql, and pf. I loaded up my database and switched  
over all my customer's websites. The database server ran fine for  
about 2 minutes, and then died. At the moment, I can't even ssh  
into the machine, although they can get into it using a keyboard/ 
monitor at the data center. In other words, sshd is not working.


That sounds like either a hardware problem (ie CPU overheating or  
marginal PSU failing under production load), or less likely, some  
kind of software misconfiguration.  System logs would be useful to  
see whether any signs of trouble are being mentioned.


Apparently, power was fluctuating drastically before they decided to  
cut power, so a hardware problem is a definite possibility. A PSU  
failure would not surprise me in the circumstances.


Assuming I can ever ssh in again, what log would hardware failures be  
reported to?


-- John
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: low-level format before install?

2009-04-07 Thread Roland Smith
On Tue, Apr 07, 2009 at 03:15:59PM -0400, John Almberg wrote:
 Well, I've got real problems with that database server that lost  
 power over the weekend. We reloaded FreeBSD from scratch and then  
 reinstalled mysql, and pf. I loaded up my database and switched over  
 all my customer's websites. The database server ran fine for about 2  
 minutes, and then died. At the moment, I can't even ssh into the  
 machine, although they can get into it using a keyboard/monitor at  
 the data center. In other words, sshd is not working.
 
 I am now wondering what kind of format the FreeBSD install process  
 does by default, and if it is possible to do a low level format,  
 first, to block out any bad sectors (not sure if this is the right  
 terminology).

What you could do is run a shell from the install CD, then fill the disk
with zeros using 'dd if=/dev/zero of=/dev/yourdisk bs=2m'.

As I understand it, modern hard disks cannot be low-level formatted by
the user. It is done at the factory. And bad blocks are re-allocated by
the built-in controller without user intervention. In fact, you'll only
see re-allocated blocks in the smartctl -a output (as
Reallocated_Sector_Ct) when the drive has exhausted its spare
sectors. In which case you'd better replace it, because it is failing.

 I'm starting to get real depressed about this machine... You would  
 think a top-tier data center could keep the power on...

Are you sure that the hardware isn't crapping out on you? At least run
smartctl -a on your disks to see if they failed any self test, and a
monitoring program like mbmon to check on temperatures and voltage
levels.

Roland
-- 
R.F.Smith   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)


pgp0KRVDmcbcc.pgp
Description: PGP signature


Re: low-level format before install?

2009-04-07 Thread Chuck Swiger

On Apr 7, 2009, at 12:15 PM, John Almberg wrote:
Well, I've got real problems with that database server that lost  
power over the weekend. We reloaded FreeBSD from scratch and then  
reinstalled mysql, and pf. I loaded up my database and switched over  
all my customer's websites. The database server ran fine for about 2  
minutes, and then died. At the moment, I can't even ssh into the  
machine, although they can get into it using a keyboard/monitor at  
the data center. In other words, sshd is not working.


That sounds like either a hardware problem (ie CPU overheating or  
marginal PSU failing under production load), or less likely, some kind  
of software misconfiguration.  System logs would be useful to see  
whether any signs of trouble are being mentioned.


I am now wondering what kind of format the FreeBSD install process  
does by default, and if it is possible to do a low level format,  
first, to block out any bad sectors (not sure if this is the right  
terminology).


I'm starting to get real depressed about this machine... You would  
think a top-tier data center could keep the power on...


SCSI drives support a standard mechanism called format unit to do a  
low-level format; ATA and SATA drives do not have a standard  
mechanism, but you might be able to find a utility from the  
manufacturer which can do such a thing.  It would not be expected that  
doing such would be helpful, as any modern drive has automatic  
mechanisms to replace bad sectors with spares transparently, at least  
until the drive has gotten to such a condition that it's out of spare  
sectors (in which case the entire drive is likely to be toast soon,  
anyway, and should be replaced ASAP).


However, if you do suspect drive problems, try installing and running  
smartctl from /usr/ports/sysutils/smartmontools, and do a self-test or  
two.


Regards,
--
-Chuck

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: low-level format before install?

2009-04-07 Thread Chuck Swiger

On Apr 7, 2009, at 12:44 PM, John Almberg wrote:
That sounds like either a hardware problem (ie CPU overheating or  
marginal PSU failing under production load), or less likely, some  
kind of software misconfiguration.  System logs would be useful to  
see whether any signs of trouble are being mentioned.


Apparently, power was fluctuating drastically before they decided to  
cut power, so a hardware problem is a definite possibility. A PSU  
failure would not surprise me in the circumstances.


Assuming I can ever ssh in again, what log would hardware failures  
be reported to?


Start with /var/log/messages and output of dmesg command.  Doing an  
ls -ltr /var/log and looking at others which have changed recently  
would also be advisable...


Regards,
--
-Chuck

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: low-level format before install?

2009-04-07 Thread Roland Smith
On Tue, Apr 07, 2009 at 03:44:20PM -0400, John Almberg wrote:
 Apparently, power was fluctuating drastically before they decided to  
 cut power, so a hardware problem is a definite possibility. A PSU  
 failure would not surprise me in the circumstances.
 
 Assuming I can ever ssh in again, what log would hardware failures be  
 reported to?

Often hardware problems can lock up or reboot the machine without any
warning in the logs. :-( It is next to impossible for PC class hardware
to catch hardware failures. But sysutils/healthd or sysutils/mbmon might
help in that they monitor vital motherboard parameters, which can then
be logged. 

Some systems log thermal events through the ACPI system or via the
coretemp driver, in which case devd(8) should get them. See devd.conf(5)
in a recent 7-STABLE, this manpage was recently enhanced by yours truly.

Big programs like compilers randomly dying with a signal 11 (SIGSEGV,
segmentation violation) can be a sign of memory problems.

If someone has access to the machine, have them make sure there are no
loose connectors and that any expansion cards are properly seated.

Roland
-- 
R.F.Smith   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)


pgpHStgkpenfi.pgp
Description: PGP signature


Re: low-level format before install?

2009-04-07 Thread John Almberg

Thanks for all the tips. At least I have something to start with.

The guys in the data center reinstalled FreeBSD (the filesystem was  
totally corrupted again), and then ran what they called SMART test,  
which might be smartctl, and said the hard drives look good.


I am now able to get back in.

So the system ran fine until I put a load on it with the database  
(many transactions a second). This corrupted the file system again.


So I guess I need to load it enough to produce error messages  
(hopefully) but not enough to destroy the file system again.


Sounds like fun :-(

This is an Intel server, not a crummy white box, so hopefully it is  
smart enough to monitor its own hardware at least a bit. We'll see.


-- John
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: low-level format before install?

2009-04-07 Thread Michael Powell
John Almberg wrote:

 Thanks for all the tips. At least I have something to start with.
 
 The guys in the data center reinstalled FreeBSD (the filesystem was
 totally corrupted again), and then ran what they called SMART test,
 which might be smartctl, and said the hard drives look good.
 
 I am now able to get back in.
 
 So the system ran fine until I put a load on it with the database
 (many transactions a second). This corrupted the file system again.
 
 So I guess I need to load it enough to produce error messages
 (hopefully) but not enough to destroy the file system again.
 
 Sounds like fun :-(
 
 This is an Intel server, not a crummy white box, so hopefully it is
 smart enough to monitor its own hardware at least a bit. We'll see.
 

Just a tidbit or two. If it has an ICHR type South Bridge with what Intel 
calls Matrix RAID there has been reported problems with trying to use the 
RAID functionality. If you are not using the RAID make sure the data center 
guys are turning this off in the BIOS. 

Whenever I see these kinds of reports about data corruption correlating with 
SMART saying the drives are good I think disk controller. It does seem 
strange if the problem was not present previous to the power fluctuations. 
But where hardware damage occurs can be funky. At least with the box I once 
had that took a direct lightning strike it was interesting to see where the 
lightening bounced around inside.

If this is a 1u pizza box with only one power supply I would suspect the 
power supply of being damaged from the power problem. If it is a relatively 
low wattage unit then the damage sustained has created a situation where it 
doesn't have enough overhead to provide regulated pure DC when under full 
load. 

I remember a software company I worked for a few years stuck the old WORM 
drives in an HP Vectra desktop that only had a 135 watt power supply. You 
could see the power go all wonky with an oscilloscope as soon as that WORM 
drive started up, but the box worked well up until this point. 

At any rate, this all sounds like hardware to me. If it wasn't doing any of 
this before the so-called power event then I believe there has been 
hardware damage. Unless you are co-locating your own hardware it is the 
responsibility of the data center to provide you with functional hardware. 
After the first go around and the same problem resurfacing they should have 
yanked the box and just replaced it. Put a good one in service and 
troubleshoot the bad one off line. If they can't hold up their end of the 
deal you need to be looking somewhere else.

-Mike




___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org