> That's really specific. I'll assume you're referring to the > 'Reallocated_Sector_Ct' value.
Yes. That's pretty much the only disk-internal info available. > Got a reference? The information I've come across in casual reading > indicates that the value you're quoting is the count of sector > remappings performed by the SMART scan process... I haven't gone as far > as hunting down the SMART specifications and reading them yet. I don't know, but it would make more sense to count the total remappings as it's an indication of pending disk failure. 'suppose I could test it out :) It would be possible for the disk to remap on read problems, but not on a final read failure. It's not necessarily possible to remap on write problems unless the disk did a read afterwards, but that would seriously impair performance. > I have no idea what things you've considered before posting on the list, > so I'm not going to assume that you've thought of everything. So far > you've mentioned, after prodding, a kernel upgrade and a smartmontools > upgrade... who knows what other changes you haven't told us about. I've mentioned what was relevant. I can't tell from the logs exactly what day the disk started to play up and on what day I updated the kernel (security update from 2.4 to 2.4). In any case an updated kernel seems an unlikely cause for disk surface errors. In real life I don't spend lots of time investigating the 0.0001% probabilities first. I had made clear that the problem started before I upgraded to SuSE 9.1. > Right, but the point is that it's not a trivial problem, and for the > amount of effort involved in writing a such a specialised tool, it's not > likely to be of much use other than for specific problems like this. As I recall, I only asked whether there was such a tool. The reason I haven't persued with finding the file is because it's not that important and I have more important things to do. Your info on how to do it in principle is valuable to the list, whether I find my file is not. The further train of action in this kind of disk case is more or less obvious (and dependent on individual factors too). > Once you work out > what file(s) are affected by the bad sectors, you could try writing > zeros to those sectors to force a sector remapping. I don't need to spend heaps of time with debuganyfs, or in fact any time, to find that file first - the disk was kind enough to tell me the LBA before I even started. My exercise with dd was to verify that number, to learn what the typical symptoms of disk failure are, and how they show up in the logs and with smartmontools. And yes, I can work out myself when I should back up my data... :) > Did you try any of the things (badblocks, reiserfsck, etc.) I suggested? No, I'd thought them up too, but so far I decided it wasn't worth it. (I now know the dead block is on a scratch partition. If it had been a 2-minute job to find the file I'd have done it out of curiosity.) Re. kernel turning off DMA: whether I put more than one device on an IDE bus is my decision. Yes it affects performance. Yes drives can be a mutual hindrance. Yes it can make IDE bus debugging more difficult. And yes, it makes zero difference when there's a problem with reading something from a magnetic surface. Turning DMA off during read errors on initialliy booting the system is smart, because if the hardware can't do DMA the kernel won't boot otherwise. Turning DMA off because, after hours of uptime, your IDE bus suddenly kicks the bucket is rather questionable but irrelevant - if the IDE bus hardware is stuffed, no amount of software playing will fix it up. DMA on or off - you're stuffed either way. Turning DMA off because the disk has had a media I/O error is as stupid as turning DMA off on the serial port (if there was one) because your modem lost its carrier. Turning the IDE back to stoneage is *never* going to fix a media problem. If your busses were working fine before, they're working fine now. If they're not working fine now, the other problems you have are much bigger than whether DMA is off. I've had many times where the kernel decided to pull DMA from underneath me, never for a valid reason. Attempting to read blocks past the end of media (because of kernel bugs) or media errors are not valid reasons for shutting the bus down. This happens for faulty cdrom disks too - if the drive can't read some particular block (e.g. because it's scratched beyond recovery) things can go bananas too - though I haven't played enough to be certain that it isn't the cdrom drive putting silly things on the bus during read errors. Volker -- Volker Kuhlmann is possibly list0570 with the domain in header http://volker.dnsalias.net/ Please do not CC list postings to me.
