Re: Scary Intel SATA problem: "frozen"

Tejun Heo Tue, 28 Nov 2006 16:58:02 -0800

Jonas Lundgren wrote:
[--snip--]

Also, it doesn't matter if I enable AHCI in the BIOS (But with AHCI
enabled the disks spin down/power down when I boot, just to power up
again a few seconds after. The boot progress freezes until the disks
have spun up again. (This happens when the kernel probes the sata
controller ports at bootup, the disks spin down at the same time, but
spin up one by one as they're getting probed))


Likely fix is pending for this problem.

I've tried changing I/O scheduler, only noticable diffrence is when I
use "noop". Then I get like 20mb/sec write instead of 4mb/sec. I have no
idea why this is :P

Example of what I mean with crappy performance:
dd if=/dev/zero of=test232 bs=1M count=100; time sync
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.130424 s, 804 MB/s
real 0m21.104s
user 0m0.000s
sys 0m0.011s

21 seconds to do a seq write of 100mb.. And during this time ALL other
disk IO gets starved, I can't do anything that uses disk IO for the
duration.. (not even `ls`)

What does the kernel say during this writing? Can you post the resultof the following?


1. reboot
2. dmesg -c
3. time dd if=/dev/zero.. blah
4. dmesg

Also, does 'mount -o remount,barrier=0 /' change anything?

Yet, a hdparm shows a decent read
hdparm -tT /dev/md4
/dev/md4:
Timing cached reads: 8060 MB in 1.99 seconds = 4042.19 MB/sec
Timing buffered disk reads: 400 MB in 3.00 seconds = 133.28 MB/sec

dd if=1GBzeroFile of=/dev/null bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 11.4335 s, 91.7 MB/s

This is the cpu usage stats I get from top when running the dd write:
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 99.0%wa, 0.5%hi, 0.5%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Pretty crappy read speeds compared to what I got on my previous mobo
(around 140mb/sec), but still alot better than the 4mb/sec I get when
writing..

Which controller did you use on your previous mobo? If you're usingata_piix and hook two hard drives as primary and secondary on the samechannel, some level of performance degradation is expected. ata_piixcan only issue command to only one of the two drives at once. Is theread performance still bad in ahci mode?


[--snip--]

Dmesg output from the error(s): (sda and sdb are 2 * 74GB raptor SATA
drives in a Linux software raid0)

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x20)
ata1.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout)

This might be a missed interrupt. It's a write. DMA engine is donefinishing transferring all data. Device is ready for the next commandbut the interrupt has never arrived.

ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)
ata1: soft resetting port
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs

But this is weird. If it were a missed interrupt, softreset should haverecovered it instantly. Something fishy is going on.


[--snip--]

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x21)
ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)


Same thing for read.

ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)


Again, pre-reset wait times out.  Weird.

ata1: soft resetting port
ata1.00: configured for UDMA/100
ata1: EH complete

[--snip--]

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x21)
ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)


Again, for read.

Most of the time when I get these errors the system will recover after
anything from 10 seconds to 10 minutes of unresponsiveness (no disk
I/O), and sometimes hang.

Yeap, libata needs stricter timing constraints for recovery. That'shigh on to-do list.

IF the system does recover, I start getting
the extremly low disk write speeds that I reported above, and only a
reboot will get the performance back to regular.

Please full dmesg after your computer got really slow. I suspect libatadecided to switch to PIO mode.

I don't know what causes it, but most of the times when I've gotten it
my system has been under heavy load (compiling, downloading torrents in
11mb/sec etc). Please let me know if you want any additional info, want
me to try something out, or whatever. My recent hardware upgrade for
around $1200 (to a core2duo system, i965 mobo) is just going to waste
because of this problem. :/

Heh, nice machine you got there. When you look at the dmesg, do theerror messages occur only on one of the two drives? Or are bothaffected? If only one is affected,

1. swap the two. you'll probably have to dance a little bit with bootloader but md should handle that fine once the kernel is loaded. doesthe errors persist? on which device do they occur? do they follow thedrive or stay on the mobo port?

2. try different cable / port. if you change port, again, you need todance w/ boot loader. who's carrying the error messages with it?


3. try different power plug from different power lane.

I just got so glad when I saw the post of this on linux-ide, I've been
searching like crazy to find another person having the same problem (and
possibly a solution) for the past 2-3 weeks or so.

My first guess is frequent transmission errors. Please report the testresults. Thanks.


--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scary Intel SATA problem: "frozen"

Reply via email to