> That's really specific.  I'll assume you're referring to the
> 'Reallocated_Sector_Ct' value.

Yes. That's pretty much the only disk-internal info available.

> Got a reference?  The information I've come across in casual reading
> indicates that the value you're quoting is the count of sector
> remappings performed by the SMART scan process... I haven't gone as far
> as hunting down the SMART specifications and reading them yet.

I don't know, but it would make more sense to count the total
remappings as it's an indication of pending disk failure. 'suppose I
could test it out :)

It would be possible for the disk to remap on read problems, but not on
a final read failure. It's not necessarily possible to remap on write
problems unless the disk did a read afterwards, but that would seriously
impair performance.

> I have no idea what things you've considered before posting on the list,
> so I'm not going to assume that you've thought of everything.  So far
> you've mentioned, after prodding, a kernel upgrade and a smartmontools
> upgrade... who knows what other changes you haven't told us about.

I've mentioned what was relevant. I can't tell from the logs exactly
what day the disk started to play up and on what day I updated the
kernel (security update from 2.4 to 2.4). In any case an updated kernel
seems an unlikely cause for disk surface errors. In real life I don't
spend lots of time investigating the 0.0001% probabilities first. I had
made clear that the problem started before I upgraded to SuSE 9.1.

> Right, but the point is that it's not a trivial problem, and for the
> amount of effort involved in writing a such a specialised tool, it's not
> likely to be of much use other than for specific problems like this.

As I recall, I only asked whether there was such a tool.

The reason I haven't persued with finding the file is because it's not
that important and I have more important things to do. Your info on how
to do it in principle is valuable to the list, whether I find my file
is not. The further train of action in this kind of disk case is more
or less obvious (and dependent on individual factors too).

> Once you work out
> what file(s) are affected by the bad sectors, you could try writing
> zeros to those sectors to force a sector remapping.

I don't need to spend heaps of time with debuganyfs, or in fact any
time, to find that file first - the disk was kind enough to tell me the
LBA before I even started. My exercise with dd was to verify that
number, to learn what the typical symptoms of disk failure are, and how
they show up in the logs and with smartmontools. And yes, I can work
out myself when I should back up my data... :)

> Did you try any of the things (badblocks, reiserfsck, etc.) I suggested?

No, I'd thought them up too, but so far I decided it wasn't worth it.
(I now know the dead block is on a scratch partition. If it had been a
2-minute job to find the file I'd have done it out of curiosity.)

Re. kernel turning off DMA: whether I put more than one device on an
IDE bus is my decision. Yes it affects performance. Yes drives can be a
mutual hindrance. Yes it can make IDE bus debugging more difficult. And
yes, it makes zero difference when there's a problem with reading
something from a magnetic surface.

Turning DMA off during read errors on initialliy booting the system is
smart, because if the hardware can't do DMA the kernel won't boot
otherwise. Turning DMA off because, after hours of uptime, your IDE bus
suddenly kicks the bucket is rather questionable but irrelevant - if
the IDE bus hardware is stuffed, no amount of software playing will fix
it up. DMA on or off - you're stuffed either way. Turning DMA off
because the disk has had a media I/O error is as stupid as turning DMA
off on the serial port (if there was one) because your modem lost its
carrier. Turning the IDE back to stoneage is *never* going to fix a
media problem. If your busses were working fine before, they're working
fine now. If they're not working fine now, the other problems you have
are much bigger than whether DMA is off.

I've had many times where the kernel decided to pull DMA from
underneath me, never for a valid reason. Attempting to read blocks past
the end of media (because of kernel bugs) or media errors are not valid
reasons for shutting the bus down.

This happens for faulty cdrom disks too - if the drive can't read some
particular block (e.g. because it's scratched beyond recovery) things
can go bananas too - though I haven't played enough to be certain that
it isn't the cdrom drive putting silly things on the bus during read
errors.

Volker

-- 
Volker Kuhlmann                 is possibly list0570 with the domain in header
http://volker.dnsalias.net/             Please do not CC list postings to me.

Reply via email to