DJA wrote:
On 5/10/05, Ralph Shumaker <[EMAIL PROTECTED]> wrote:
Responsiveness returned and the error message was no longer repeating. Here's the error (assuming that I copied it correctly since I could not figure out how to cut and paste from console 1 to the GUI):
hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=22206449, high=1, low=5429333, sector=84880 end_request: I/O error, dev 21:06 (hde), sector 84880
Does this suggest that there is something wrong with my HDD? Should I just disable dma (or just cripple it)?
Not so quick. Lots of motherboards (in fact most) have bugs in their HDD controllers and chipsets affecting DMA.
I'm using a Promise controller (which is why it is hde instead of hda through hdd). I originally purchased the Promise controller for an 80G HDD which I purchased for a different motherboard (one with known BIOS limitations above 33G or so). Because of a stupid mistake, I fried the controller on that drive as well as the motherboard. The Promise card was in between them but appeared to be the only thing still working.
In this context, "appeared to be ... still working" is pretty weak evidence that you didn't damage it as well.
[Note: this assumes I understand the scenario: you had the promise controller in a box which you subsequently "fried". You determined that the both the motherboard and hard drive were dead. But the Promise controller was not. Have I got that right?]
I figured that the hardware on the Promise card probably could give faster performance than the controller built into the new motherboard (newer than the fried one), so I slapped it in between the newer motherboard (P2 267) and the new 160G HDD (which I set up as hde). The whole system has been sailing smoothly ever since, until now, about 6 or 7 months later.
Which is a pretty pitiful lifetime for a hard drive. If you want to cut to the chase, assuming that the drive is bad, take it back. Of course, if the replacement shows the same symptoms, the merchant is not going to like you very much when you bring it back too.
I am also assuming that you are using the same RAM which was installed in the "fried" motherboard. And what about the CPU? I generally don't consider CPU's because, in my experience, they're either on or off. But RAM is a bit more forgiving; that is, it is more prone to dying a slow death.
Recently, I tried switching the HDD system...
By this, you mean the Promise controller and hard drive?
...over to a yet newer motherboard (P2 300, still ancient, but faster) given to me by Josh.
And the reason for this was?
I had really wierd lockups, first with Open Office Writer (IIRC), then with Mozilla, as well as others.
Well, see now you've thrown in yet another unknown factor: another motherboard. Best to make one change at a time if you really want to narrow things down.
Since I did not have memtest86, and could not run long enough to download it, I switched back to the 267.
So now we're back to the original system.
Then I had to figure out why I was still unable to launch Mozilla (some lock file that had to be hunted down and destroyed). Then, things were back to normal, for a while.
Mozilla can and does do this all on its own without any help from errant hardware. OO.o puking starts to look suspicious. Suspiciously like RAM. Both programs like lots of healthy RAM.
Later, about a month ago, I moved the system to a different location. When I booted it up, it claimed to have been shut down uncleanly. It had been several days between shutting it off and booting back up so I couldn't be sure. It would not boot up because of some error. I didn't log what I did, but I recall doing something with mke2fs (or maybe it was e2fsck or something), crossing my fingers, and waiting for it to do its thing. (It took a while.) After it was done, it seemed that all was well, until this dma error (which only happened when I told Mozilla to search messages for something, which would suggest to me that the error (if truly a HDD error) is either on /dev/hde9 (swap) or on /dev/hde6 which has everything except swap and /boot).
It tells me that the one thing in common with all your problems is the Promise card (and maybe the RAM). From my end, that's the guy at the top of my suspect list. It's been in every system you've described, and each one has given you trouble. If you are using the same RAM, the RAM moves to the top of the list with the Promise controller second.
I don't think Josh would knowingly give you a bad mobo, you have a relatively new hard drive, and you seem to have reasons to believe the 267 mobo is fine. The Promise controller is the only part have been in a wreck. And I am still not clear about the RAM.
Test everything. Pull it. Then test everything. In fact, I'd pull first.
1) Test the RAM (Full 11-test suite, ~24 hours for 512 MB) 2) Test HDD w/o Promise 3) Test HDD w Promise.
Steps two and three assume previous tests passed.
Like Carl said, first back up your data. But before you go wiping the drive, here are a few other things I would do (especially important if you are running an AMD CPU:
o Run Memtest86. Bad memory can precipitate all kinds of errors, including drive errors. I had a bad memory controller go south on one of my boxes, which masqueraded itself as both a bad hard drive and bad RAM.
I'm trying to get that set up right now. The instructions explain how to do it with lilo, but I cannot figure out how to adapt it to grub. For now, I will dig for a floppy and get it going that way if I can. But, I /would/ like to set it up in grub. Any ideas?
Yes. Forget putting it in the boot manager menu. At least for now. Run it properly from a dedicated floppy. Having Memtest86 on your menu is fine for something portable like Knoppix - it's a good utility to have around when you're trying to diagnose someone else's box and you don't have your box of utilities at hand.
But if you have a desktop box at home, which is not likely going anywhere, then you should also have a floppy close by with Memtest86 on it. Having the memory tester on a desktop box sounds cool and geeky and all, but in practice is not all that useful.
(And don't anyone go off on me about "but floppies get lost or go bad", cuz if that happens to you, you've got more than just hardware problems! :^) ).
o Make sure there is not some other problem causing the behavior: All cables tight? Box interior not overheating? CPU not overheat- ing? PSU not flaking out (still running one the marginal units which came in the case you bought for the PII-400 before you upgraded to the XP3500+ and FX6800)?
Cables checked. Box is open. PSU and motherboard (P2 267) came together.
I don't think "PSU" means what you think it means. PSU = Power Supply Unit. The big hunk of iron that came with the case, not the CPU. The last component that anyone ever suspects and the one component that can literally burn up everything in your box (or even house, if put to it).
If the PSU came with the case (with few brand name exceptions), I can almost guarantee that it's a piece of crap. But then you'll have to go to my brother for that lecture. ;^)
Maybe you're talking about the heatsink-fan which came with the CPU? If the CPU's still running, then the heatsink-fan is fine.
o Download and use the manufacturer's diagnostic software for your drive. AFAIK, all of the major brands (IBM, Maxtor, WD, etc.) have such a program available on their Websites.
I'll check on this after memtest has its run.
Good.
o Check the mobo vendor's website for BIOS updates which may address the problem.
I'll check this before going with the drive diagnostic program.
I'd check the HDD first.
o Do some research (Google) on the specific chipset on your motherboard. For instance, VIA chipsets are notorious for having DMA problems on Athlon motherboards. Also research similar problems for your specific hard drive (i.e. for your make and model).
I'll check this one also before the drive diagnostics.
Really, it's okay, you can run the drive diags any time. No need to wait. ;^)
Check if first without the Promise controller in the system. Don't' worry about booting problems: the diagnostic software runs only off of floppy, or in some cases CD-ROM.
If the drive passes w/o the Promise card, run it again with the card.
If using DMA with the hard drive is a known problem for your motherboard, there are boot options which can be used in Grub or LILO to mitigate the problem (such as disabling DMA altogether).
I discovered that root has 187 mail messages that I never checked. Some of them give some details about dma errors and the like.
I haven't used a Promise controller since, ohh, back when VLB was popular, so I can't give any reasons why DMA might or might be problematic on those cards. Paul? Where are you when I need you?
That'll get you closer to the truth before just blindly assuming that the hard drive is going bad, and it might save you both time and money in replacing the wrong parts for the wrong reasons.
Thank you very much for this valuable checklist!!!
Anytime. Just don't ask. ;^)
--
Best Regards,
~DJA.
-- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
