At 2004-07-31T23:15:55+1200, Volker Kuhlmann wrote:
> smartctl -a /dev/hda
That's really specific. I'll assume you're referring to the
'Reallocated_Sector_Ct' value.
> That's not how I understand it, but resolving this wouldn't really help.
Got a reference? The information I've come across in casual reading
indicates that the value you're quoting is the count of sector
remappings performed by the SMART scan process... I haven't gone as far
as hunting down the SMART specifications and reading them yet.
It also seems that drives generally don't tend to automatically remap
sectors during read failures (preferring instead to push the error up
the software stack for handling), but usually will attempt to on write
failures.
> That would require another change. Hm. Ok I think there was another
> Linux kernel upgrade. Lets not chase the Yeti.
I have no idea what things you've considered before posting on the list,
so I'm not going to assume that you've thought of everything. So far
you've mentioned, after prodding, a kernel upgrade and a smartmontools
upgrade... who knows what other changes you haven't told us about.
> Sure, but that doesn't mean it's impossible. Or undesirable :)
Right, but the point is that it's not a trivial problem, and for the
amount of effort involved in writing a such a specialised tool, it's not
likely to be of much use other than for specific problems like this.
As I've mentioned, it's fairly trivial to work out what file(s) are
affected on ext[23] filesystems using debugfs. I guess Namesys haven't
had the time nor the inclination to provide a similar feature--though
I'm sure they'd do it, just for you, if you asked real nice... at the
right price, of course.
> Anyways, I did a dd if=/dev/hda of=/dev/null, and there are 2 errors
> on the disk, one recoverable, the other not. Location as recoreded in
> the syslog is exactly the same as recorded by smartmontools (ok, the
> disk itself). Seems a clear case now. Unfortunately I don't think the
> manufacturer will replace hard disks just because there's one
> unrecoverable read error on it. Might have to thrash it a bit...
I would back up the data on the drive and look into replacement. If you
want to be sure the drive has a hardware problem before you send it
back, use dd to write zeros across the surface of the disk (or use a
vendor-specific disk formatting tool like IBM's DFT). Once you work out
what file(s) are affected by the bad sectors, you could try writing
zeros to those sectors to force a sector remapping.
> This also shows the stupidity the Linux kernel exhibits when trying to
> be smart with hard disks which may not be capable of DMA: turn off DMA
No. DMA is disabled for all sorts of reasons during attempts at error
recovery. Note that the kernel is telling you the IDE bus has been
reset as part of the recovery process.
> permanently and see if it works better. Fine during boot, plain stupid
It's not always permanently disabled. It depends on the type of error,
and the success of the recovery attempt.
> a few hours later. And gee thanks for turning DMA off on my dvdrom
> drive too. 2.6 is no better than 2.4 here. In some aspects Linux can
The reason that DMA was disabled on both devices is because they're
sharing an IDE channel and the IDE bus was reset. If you want your IDE
devices to work without being affected by other devices, you should have
them each on a separate channel. This is true if you want decent
performance, and it's true if you want to isolate IDE-level failures.
2.6 is no "better" because there's no problem--also, the IDE layer is
pretty similar between 2.4 and 2.6. The kernel is attempting to
execute a safe error recovery strategy. The problem here is your
hardware.
> compete quite well with Redmond. Joe Newbie would now complain about
> an almost unusably slow system which can't burn CDs any more (or
> something like that).
...and once Joe complains, someone will point him to the kernel log and
say "your hardware is naffed, that's the problem".
> Here the log which goes with dd terminating with I/O error:
Did you try any of the things (badblocks, reiserfsck, etc.) I suggested?
I'm hoping one of those will help you find the affected file(s) without
having to go digging through the filesystem with debugreiserfs. Another
thing you could try (rather than going as far as debugreiserfs) is
copying all of the data, file by file, off of the disk to /dev/null, log
the copy, and review it to find I/O failures--that will take quite a
while though.
Cheers,
-mjg
--
Matthew Gregan |/
/| [EMAIL PROTECTED]