Hi,

Setting up and testing my new system (after wasting nearly 1 month
with bad RAM modules), I got this error today:

[48055.741389] ata3.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x6 frozen
[48055.741393] ata3.00: failed command: READ FPDMA QUEUED
[48055.741398] ata3.00: cmd 60/20:08:38:15:03/01:00:18:00:00/40 tag 1
ncq 147456 in
[48055.741400]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)
[48055.741402] ata3.00: status: { DRDY }
[48055.741405] ata3: hard resetting link
[48056.198746] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[48056.210514] ata3.00: configured for UDMA/133
[48056.210518] ata3.00: device reported invalid CHS sector 0
[48056.210523] ata3: EH complete

I really don't understand what it means, but the "timeout", "hard
resetting link" and "invalid CHS sector 0" look scary to me...

Initial bootup messages for this device were:
Mar 25 22:02:32 [kernel] [    4.496102] ata3: SATA max UDMA/133 abar
m2...@0xfbffc000 port 0xfbffc200 irq 34
Mar 25 22:02:32 [kernel] [    8.519169] ata3: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Mar 25 22:02:32 [kernel] [    8.536681] ata3.00: ATA-8: SAMSUNG
HD203WI, 1AN10002, max UDMA/133
Mar 25 22:02:32 [kernel] [    8.548388] ata3.00: 3907029168 sectors,
multi 0: LBA48 NCQ (depth 31/32), AA
Mar 25 22:02:32 [kernel] [    8.566100] ata3.00: configured for UDMA/133

That disk is part of a md RAID5, but I was at work when this error
happened so I didn't notice if the RAID repaired itself or whatever
would happen in this case (I don't have mdadm monitoring configured
yet). Right now all RAID disks are all up and healthy.

I googled it but most of the results are pastebin snippets. I'm using
kernel 2.6.33 and ahci driver for the SATA controllers.

>From libata documentation in the section about timeout errors it says:
"Most often this is due to an unrelated interrupt subsystem bug (try
booting with 'pci=nomsi' or 'acpi=off' or 'noapic'), which failed to
deliver an interrupt when we were expecting one from the hardware."

I really don't know the potential implications of disabling MSI or
APIC, but in /proc/interrupts I do see AHCI related to both MSI and
APIC rows. So at least I know they are active right now.

Temperatures in my system are good, hddtemp says the drive in question
is 21C degrees right now.

Another possibility is that I need to increase voltage on the
motherboard, since it is running 6 hdd's and 1 DVD-ROM. I'll have to
research to see which voltage is related to this. (X58 motherboard)

Thanks in advance if anyone has any knowledge about this, otherwise I
go to trial-and-hopefully-no-error mode. :)

Paul

Reply via email to