[gentoo-amd64] Re: fsck seems to screw up my harddisk

Duncan Sun, 03 Dec 2006 09:05:48 -0800

"Guido Doornberg" <[EMAIL PROTECTED]> posted
[EMAIL PROTECTED], excerpted
below, on  Sun, 03 Dec 2006 14:40:11 +0100:


> Well, I downloaded and started a fresh 2006.1 livecd, repartitioned de
> hdd, started mke2fs and this time with the -c option.
> 
> So, it started checking and after about 15 minutes this kept on showing up
> on my screen:
> 
> ata1: error=ox40 {uncorrectable error} ata1: translated ATA stat/err
> 0X51/40 SCSI SK/ASC/ASCQ 0x3/11/04
> 
> after a while i got a couple of other messages, and now it keeps on
> talking about Buffer I/O error on device sda3, and after that various
> sectors and blocks are called.
> 
> I did look after my power supply and I'm for 99% sure that's not the
> problem. So, correct me if i'm wrong but that would mean my harddisk is
> the problem? But how than is it possible that I can use it normally if I
> don't let fsck check it?
> 
> I know this isn't really gentoo specific anymore, but if anyone knows what
> to do i'm happy to hear it.

Your suspiciouns seem correct to me as well.

I've had several hard drives go partially bad over the last several years. 
The last one I know was due to heat as I'm in Phoenix, AZ, with highs in
the summer approaching 50 C (122 F), and my AC went out.  Since it
followed the same basic pattern of another one previous to that, I expect
the problem with the previous one was heat as well tho I'm not positive.

What happens when the drives overheat is the platters expand and the heads
crash into them, thereby digging grooves (which I could see taking the
drive apart later) in the platters.  Of course, the data will be destroyed
at for those disk cylinders, basically wherever the head seeked to
while the platter was hot enough to crash it, but the rest of the drive is
recoverable, and from my experience, somewhat stable, provided the drive
doesn't overheat again.  Due to the way I have my system setup (see below)
and what was damaged, I was actually able to continue to use the system
for some time.  Never-the-less, get anything you want saved off it ASAP,
preferably leaving it shut off until you can, just in case, after which
you can work around the problem if you wish, marking badblocks, and use
the disk for either temp stuff only or always backed up stuff, from then
on.

It's possible for particularly drives used for mobile applications to have
similar head-crashes, due to dropping the laptop or whatever, and there
may be other ways to generate that pattern as well.

How to work around the issue?  First, as I said, backup the disk, or at
least anything of value on it.  Of course, this likely won't apply since
you were setting up a new system on it anyway, but for completeness... If
you run into areas that won't easily copy, and you want to recover the
data if possible, there's a package, sys-apps/dd-rescue, or it should be
available on any good recovery LiveCD.  (I doubt it's on the Gentoo
install CDs, but you can check). dd_rescue is the same idea as the normal
Unix dd utility, but the rescue version is designed to try from the
beginning of a partition forward until it runs into problems, then from
the end backward, then in the middle of anything still left unread, until
it has copied as much of the partition as possible. You can then fsck the
recovered copy and see what can be repaired.  Note, however, that this is
a process that will take awhile, hours, possibly days, depending on how
much of the disk is damaged, as the drive tries several times to read the
data, and if it fails the software will have it try /again/ several times.
 Depending on your i/o system, you aren't likely to be able to do much
else with the system while this is going on, as it'll tend to lock things
up pretty badly during the try and fail and try again phase.  Of course,
this will be repeated for each bad block, so it WILL take awhile if more
than a handful of blocks are damaged. Recovery of all the data is
obviously not guaranteed in any case, and you may simply decide it's not
worth the hassle.  Google or see the dd_rescue manpage for details.

It should be noted that dd_rescue can be configured to report the
badblocks as it goes, so you can skip the badblocks mapping step below if
you use it to recover existing data, and save its badblocks report to be
reused later.

If you skip data recovery attempts, or simply want to test any disk before
you use it, you'll want another app, badblocks, likely installed already as
a part of sys-fs/e2fsprogs. badblocks can scan the disk in either
(non-destructive) read/read-over/compare mode, non-destructive
read/write-back/read-back/compare mode, or destructive
write-pattern/read-back/compare mode.  Do NOT use the destructive mode if
there's stuff on the partition you want to keep, as it WILL overwrite it.

However you generate the badblocks report, using either the output of
dd_rescue or badblocks, you then use this information when setting up your
disk again.  It's probably wise to setup multiple partitions, leaving the
large bad areas unpartitioned.  For smaller bad areas of just a handful of
blocks, one of the parameters you can feed mkfs is a badblocks list. 
Again, check the manpages or google for the details, but when you are
done, you should be left with a working and fsck-able set of partitions
once again, since the badblocks are either excluded from the area you
partitioned, or listed as badblocks in the superblock area of the
filesystem you created using that parameter with your mkfs, and therefore
avoided.

---
*  For reliability purposes, I had my system setup with multiple copies of
most of my partitions.  The idea was periodically, when the system seemed
stable, I'd backup my main working copy of all the critical partitions,
and could therefore boot a not-too-old backup copy in the event something
broke on my main working copy.  Basically, all it took (and all it
continues to take) is appending a different root= parameter to the kernel
command line, to boot the rootmirror.  Thus, when portions of the drive
were damaged, they were naturally the portions the head had tried to seek
to during the time the drive was overheated, which means they were in the
partitions mounted at the time.  The unmounted partitions were therefore
undamaged and after finding the system crashed due to the overheating,
once I cooled things back down, I could boot to the backup partitions and
resume from there.  As it happened, only a couple of my working partitions
were damaged, and I was able to use the working copy of all the other
partitions.

In terms of partitioning strategy... with my old system I made the mistake
of separating /var and /usr onto their own partitions, and then trying to
mix and match backup partitions with working copy partitions.  That didn't
work so well, because the portage records of what were installed were from
the backup and therefore outdated /var partition, while /usr and root were
the working copies, so portage had the wrong package versions as being
installed.  Since I had use FEATURES=buildpkg and had all the packages
available in binary format, it was easy to simply reinstall everything
from them, updating the portage database, but because it wasn't accurate,
it couldn't unmerge the non-existing old versions, so I ended up with a
bunch of stale and orphaned files strewn around.

When I upgraded from that disk, which I did as soon as I could since I
didn't trust it even tho it was working, I therefore setup things a bit
differently.  What I'd suggest today would be keeping /var and /usr on
your root partition, but putting /var/log and /var/tmp and /usr/portage
and /usr/src, as well as stuff such as /home, on on other  partitions. 
(You can use one and either use mount --move or simply symlink, if you
want to put several dirs from different places in the tree on the same
partition.)

Basically, anything that portage installs stuff to, along with its
database in /var/db, should be kept on the same partition, so every backup
of that partition will have the portage database in sync with what's
actually installed, since it's the same partition.

Here, my / partition and backup snapshots are 10 GB each.  That's plenty
of room to spare for me, since less than two GB are actually used.  I'd
recommend a total of three copies of it, the working or main copy, and two
snapshot backups of the same exact partition size.  The idea being that
you can alternate backups, so even if something happens after you've
erased the one backup in preparation for copying over the working system
as a new snapshot, so that backup is erased or incomplete at the
same time the working copy dies, you'll still have the other backup to fall
back to.

Similarly with partitions such as /home and /usr/local that hold data I
want to be sure and keep.  2-3 copies of each, a working and 1-2 backup
copies.  /var/log you probably don't need a copy of.  Same with wherever
you have your portage tree, since you can always just sync it to get
another if it's destroyed, and with /tmp and /var/tmp, since that's temp
data anyway and doesn't need a redundant copy kept.

Actually, while that can be implemented well on one or two disks, here, I
got tired of hard drive problems, and I'm now running a four-disk kernel
based SATA RAID, Seagate drives, 5-yr warrantee, altho they aren't quite
as fast as some of the others you can buy.  Booting requires RAID-1 so I
have a small RAID-1 partition mirrored across all four drives.  That's
/boot. Most of my system is RAID-6, which in a four-way system is
effectively a two-way stripe with two parity stripes as well.  Thus, I can
lose any two of the four drives and anything on the RAID-6 will still be
recoverable.  Stuff like /tmp, the portage tree, etc, that's either easily
redownloaded off the net or is temporary anyway, is on a 4-way RAID-0 for
speed.  If any of the four drives goes down, all that data is lost, but
that's fine, since it's either temporary or easily recovered anyway. 
Likewise, my swap is four-way striped.  Disk read/write speed on this
four-way striped area is incredibly fast (for hard drive access), since
drives are so much slower than the bus connecting them to the system,
meaning the system can keep the bus busy doing i/o to all four devices
instead of just to one, and then having to wait for the slow drive.  The
problem with RAID-0, however, is that while it's far faster, it's also far
riskier, since you lose it if you lose any of the component devices. 
Fortunately, the data that is easiest to replace is also generally the
most speed critical, so it works out quite well. =8^)  I have RAID-1
mirrored for /boot, RAID-6 for safety for most of my system, and RAID-0 for
speed where I don't care if the data dies.  On top of that, for the parts
of the system I really care about, I keep several snapshots around on the
RAID-6, thus protecting me both from fat-finger syndrome deletions (where
RAID won't help, unfortunately) with the multiple snapshots, and from
device failure with the RAID-6.  As an added bonus, since I'm running
kernel-RAID, it's not hardware specific, so if the SATA chip dies, all I
have to do is buy a new 4-way SATA board, plug the existing drives into
the new board, and compile a new kernel (from a liveCD or whatever) with
the appropriate new SATA drivers, and I'm up and running again.  If I had
gone hardware RAID and it died, I'd have to get another one like it to
plug into, if I wanted to recover my data, something I don't have to worry
about with kernel-raid. =8^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

-- 
[email protected] mailing list

[gentoo-amd64] Re: fsck seems to screw up my harddisk

Reply via email to