Re: Scary Intel SATA problem: frozen

2006-11-29 Thread Linus Torvalds


On Wed, 29 Nov 2006, Tejun Heo wrote:
 
 You pushed your box really hard and the kernel can't get the memory it wants.
 Not really relevant to SATA problem.

And it's not even really a bug - the caller is supposed to be ok with it. 
It's a warning message that the kernel spits out just because we've had 
problems in the past with callers that did _not_ handle an allocation 
error gracefully, so the warnign is spit out to (a) let us know something 
happened and (b) if there's a subsequent oops due to dereferencing a NULL 
pointer, it becomes easier to pinpoint what the sequence of events was.

So it's an atomic allocation that happens on the receive path in the 
network when you've run out of pages (because you're getting enough 
network traffic that earlier receives have used up all buffers, and so 
much disk IO that we haven't had time to clean any new pages yet), and 
getting an allocation failure there really is normal, it's just very 
unusual.

So that particular dump _looks_ scary, but it happens to be totally a 
non-issue unless something else happens afterwards to imply that the 
caller had trouble with the allocation failure.

It's also a sign of trouble if you can trigger it _easily_. It should be 
something that only triggers under very high load and under unusual 
circumstances.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Mark Lord

Linus Torvalds wrote:
[ You may or may not have gotten my previous email. The kernel stayed 
  working, but due to the IO errors the filesystem got re-mounted 
  read-only, and I'm not sure that the email I sent out in that state 
  actually ever made it out. I suspect it didn't. ]


Jeff,
 I just had a scary thing on my nice new Intel i965 box (all Intel 
chipsets apart from some strange Marvell IDE interface that I'm not using 
and that no driver even detected, and a TI firewire thing that I'm 
similarly not using).


The machine basically froze for about a minute or so (well, things worked 
surprisingly well, considering that apparently no disk IO happened - I 
initially thought it was just firefox that had frozen up, since my mail 
session seemed to be fine), and after it came back the filesystem was 
mounted read-only and nothing really worked any more..


I have no idea what status 0xD0 means: it looks like ATA_BUSY + ATA_DRDY + 
bit#4, but what is bit#4?


Bit #4, when actually implemented, is a rotational seek indicator,
which can be used for timing purposes.

But when BUSY (bit #7) is set, the rest are generally nonsense.


And clearly, the soft-reset isn't doing squat.


Tejun ?

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Sergei Shtylyov

Hello.

Mark Lord wrote:


Bit #4, when actually implemented, is a rotational seek indicator,
which can be used for timing purposes.


   Hm, I thought it was DSC (drive seek complete) set by the SEEK command 
completion, and it's always implemented. Didn't you mean IDX (bit 1, IIRC)?



But when BUSY (bit #7) is set, the rest are generally nonsense.


   Indeed...

WBR, Sergei
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Linus Torvalds


On Tue, 28 Nov 2006, Alan wrote:

 On Tue, 28 Nov 2006 09:31:51 -0800 (PST)
 Linus Torvalds [EMAIL PROTECTED] wrote:
 
   I just had a scary thing on my nice new Intel i965 box (all Intel 
  chipsets apart from some strange Marvell IDE interface that I'm not using 
  and that no driver even detected, and a TI firewire thing that I'm 
 
 Mr Morton has the Marvell libata driver in his tree waiting to head your
 way.

Well, I don't actually personally want it (I have nothing connected to it, 
nor any intention of connecting anything in the future), I just want my 
bog-standard PIIX driver to not do the scary things to me.

Mommy, mommy, the IDE messages/behaviour is scaring me!

I just mentioned the Marvell chip because apart from those two (unused) 
chips, the box is absolutely and utterly bog-standard Intel-everything. 
The i965 may still be somewhat unusual right now, but that's going to 
change, and if there's something strange going on, we should try to fix it 
asap.

It could be a one-off thing (knock wood), but on the other hand, I've only 
been using this machine for a couple of weeks now, and I can't remember 
seeing anything even remotely similar on my other machines (including the 
earlier-generation i945 SATA setup that I've had a lot longer). So I worry 
that it's something i965-specific, and that will be a _very_ common 
chipset soon enough.

One data-point that may or may not be relevant: the afore-mentioned i945 
machine that I've had longer is otherwise reasonably similar, but the DVD 
drive on that one is in legacy mode. Not that I see why it should matter 
(the problem happened on the harddisk, not the DVD)...

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Eric D. Mudama

On 11/28/06, Sergei Shtylyov [EMAIL PROTECTED] wrote:

Hello.

Mark Lord wrote:

 Bit #4, when actually implemented, is a rotational seek indicator,
 which can be used for timing purposes.

Hm, I thought it was DSC (drive seek complete) set by the SEEK command
completion, and it's always implemented. Didn't you mean IDX (bit 1, IIRC)?


0x50 is the standard, non queueing device is ready status.  It used
to have those special meanings, but they're pretty obsolete today as I
understand it.

0x40 is used for queueing, because bit 4 was the service bit for PATA TCQ.


 But when BUSY (bit #7) is set, the rest are generally nonsense.

Indeed...

WBR, Sergei


Typically, 0x80 as the busy state indicates the device is in POR
reset.  Once the firmware is up and running in the device, it often
switches from 0x80 to 0xD0 during POR.

0xD0 is the busy state you'd get to if you were 0x50 and received a
command, so this is reported typically after the device is up and
running.

0x7F usually is hardware indicating nothing is attached to the port,
and isn't supposed to infer a non-busy state.

You're right, while not meaningful according to spec, you can derive
some information from the reported status even when you're only
supposed to look at one bit.
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Sergei Shtylyov

Hello.

Eric D. Mudama wrote:

 Bit #4, when actually implemented, is a rotational seek indicator,
 which can be used for timing purposes.


Hm, I thought it was DSC (drive seek complete) set by the SEEK 
command
completion, and it's always implemented. Didn't you mean IDX (bit 1, 
IIRC)?



0x50 is the standard, non queueing device is ready status.  It used
to have those special meanings, but they're pretty obsolete today as I
understand it.


   Erm, some status bits maybe obsolete but I've never heard that the status 
*values* were specified to mean anything special anywhere...



0x40 is used for queueing, because bit 4 was the service bit for PATA TCQ.


  I know. This meaning (SERVICE) actualy came from ATAPI


 But when BUSY (bit #7) is set, the rest are generally nonsense.



Indeed...



WBR, Sergei



Typically, 0x80 as the busy state indicates the device is in POR
reset.  Once the firmware is up and running in the device, it often
switches from 0x80 to 0xD0 during POR.


   Oh, I guess it's completely up to the disk makers what other status to 
show with BSY=1.



0xD0 is the busy state you'd get to if you were 0x50 and received a
command, so this is reported typically after the device is up and
running.



0x7F usually is hardware indicating nothing is attached to the port,
and isn't supposed to infer a non-busy state.


   Ha, *never* seen that one. It's has always been 0xFF since PC people 
didn't ever bother themselves with silly pulldowns. :-)



You're right, while not meaningful according to spec, you can derive
some information from the reported status even when you're only
supposed to look at one bit.


  Well, to some extent...

WBR, Sergei
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Linus Torvalds


On Tue, 28 Nov 2006, Jeff Garzik wrote:
 
 Does jgarzik/libata-dev.git#upstream (don't pull, just test) work for you?

Well, since I can't really test, I don't know. This problem has happened 
just once in the couple of weeks I've used that machine, and I wasn't even 
doing anything strange when it triggered (no heavy IO, no special 
programs, no nothing - I was literally just reading email and I think 
trying to browse over to news.com or something..)

So I was more hoping that you'd say that it's a known issue, and already 
fixed, or that the status bits would give you some clue and make you say 
Ahh, we don't handle that case. I have nothing to test. The thing 
seems to work, and I have no known way to trigger the problem...

 I'm pretty sure this is already fixed, by the polling IDENTIFY for ata_piix
 patchset.

Hmm. That sounds like it should just affect the bootup identification, 
which has always worked fine for me. Would it fix the softreset too?

Anyway, I can certainly try yout current upstream branch, but as 
mentioned, the standard kernel works fine for me generally, so I don't 
really know what I can offer (except if upstream simply doesn't work at 
all, in which case I'll certainly let you know ;)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Jonas Lundgren
I've been monitoring the linux-ide list to try and find a solution to my
problem with my intel box (i965) and SATA disks.. I sent a mail to the
maintainer of the ata_piix driver and cc'd the linux-ide ML, but go no
responses. I'm not used to mail to ML's, so please excuse me if I did
something wrong with the reply of this mail/CC'ing the wrong persons etc. :)



Here's what I wrote in my last mail to linux-ide:

I've got some big performance related problem with my Abit AB9 pro mobo,
the ICH8 controller and my SATA disks.. I've got 2 64GB WD raptor disks
in a raid0(These are the disks I have used dd/hdparm on in the commands
below), and a 2x250GB WD disk raid0, and I used to get around
130-140mb/sec seq write with them, but now with my new mobo I'm lucky if
I get 10mb/sec. During heavy disk activity the system locks up, until
the write is completed (Ie, no other read or write is being made, it's
like heavy IO completely starves all other processes until it's finished)..

Running 2.6.19-rc5-mm2 atm, but I've tried a few diffrent kernels, same
thing.

Also, it doesn't matter if I enable AHCI in the BIOS (But with AHCI
enabled the disks spin down/power down when I boot, just to power up
again a few seconds after. The boot progress freezes until the disks
have spun up again. (This happens when the kernel probes the sata
controller ports at bootup, the disks spin down at the same time, but
spin up one by one as they're getting probed))

I've tried changing I/O scheduler, only noticable diffrence is when I
use noop. Then I get like 20mb/sec write instead of 4mb/sec. I have no
idea why this is :P

Example of what I mean with crappy performance:
dd if=/dev/zero of=test232 bs=1M count=100; time sync
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.130424 s, 804 MB/s
real 0m21.104s
user 0m0.000s
sys 0m0.011s

21 seconds to do a seq write of 100mb.. And during this time ALL other
disk IO gets starved, I can't do anything that uses disk IO for the
duration.. (not even `ls`)

Yet, a hdparm shows a decent read
hdparm -tT /dev/md4
/dev/md4:
Timing cached reads: 8060 MB in 1.99 seconds = 4042.19 MB/sec
Timing buffered disk reads: 400 MB in 3.00 seconds = 133.28 MB/sec

dd if=1GBzeroFile of=/dev/null bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 11.4335 s, 91.7 MB/s

This is the cpu usage stats I get from top when running the dd write:
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 99.0%wa, 0.5%hi, 0.5%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Pretty crappy read speeds compared to what I got on my previous mobo
(around 140mb/sec), but still alot better than the 4mb/sec I get when
writing..

I've also googled this for many hours, I've searched the lkml, checked
the gentoo forums, as well as other distro forums, I just don't know
what else to do. I'll appreciate any help or hints I can get.




Dmesg output from the error(s): (sda and sdb are 2 * 74GB raptor SATA
drives in a Linux software raid0)

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x20)
ata1.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout)
ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)
ata1: soft resetting port
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata1: soft resetting port
ata1.00: configured for UDMA/100
ata1: EH complete
SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x21)
ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)
ata1: soft resetting port
ata1.00: configured for UDMA/100
ata1: EH complete
SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x21)
ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)
ata1: soft resetting port
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ata1.00: failed to IDENTIFY (I/O error, 

Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Linus Torvalds


On Tue, 28 Nov 2006, Jonas Lundgren wrote:
 
 Example of what I mean with crappy performance:
 dd if=/dev/zero of=test232 bs=1M count=100; time sync
 100+0 records in
 100+0 records out
 104857600 bytes (105 MB) copied, 0.130424 s, 804 MB/s
 real 0m21.104s

Ok, that's definitely not the same thing I see.

I get

real0m2.673s

for the time sync part of your example, so the chipset can definitely do 
better than your 4-5 MB/s.

And your read performance seems fine. Strange.

I suspect it's related to your RAID usage. I've only got a single disk in 
my system. Maybe there is something problematic in sending commands to 
alternating SATA ports on the same controller with the i965 thing?

The switching between SATA ports thing migt actually be a clue, because 
while I've had this thing for a few weeks, I only used the DVD drive for 
the first time the day before yesterday, and didn't actually even have the 
SCSI CD-ROM support compiled in until then (copied a config from another 
machine that had the DVD-rom on the legacy side, so it used the more 
common IDE-CD thing).

So maybe these _are_ related somehow, and my problem showed up because I 
actually had concurrent access to my DVD drive (some KDE media daemon 
checking to see if I inserted a music CD or something?). Jeff, Tejun, is 
there any reason to believe that the two channels on a PIIX ata controller 
are somehow tied together and it could be problematic for concurrent 
accesses?

Jonas definitely has the same error messages:

 Dmesg output from the error(s): (sda and sdb are 2 * 74GB raptor SATA
 drives in a Linux software raid0)
 
 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata1.00: (BMDMA stat 0x20)
 ata1.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout)
 ata1: port is slow to respond, please be patient
 ata1: port failed to respond (30 secs)
 ata1: soft resetting port
 ATA: abnormal status 0xD0 on port 0xFA07
 ATA: abnormal status 0xD0 on port 0xFA07
 ATA: abnormal status 0xD0 on port 0xFA07
 ATA: abnormal status 0xD0 on port 0xFA07
 ATA: abnormal status 0xD0 on port 0xFA07
 ATA: abnormal status 0xD0 on port 0xFA07

That all looks exactly like mine did.

Except:

 ata1: EH complete
 SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB)
 sda: Write Protect is off
 sda: Mode Sense: 00 3a 00 00
 SCSI device sda: drive cache: write back

Jonas' disks came back.

So while Jonas' behaviour/problems otherwise don't seem to match mine at 
all, there might be some underlying commonality..

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Jeff Garzik

Linus Torvalds wrote:
So maybe these _are_ related somehow, and my problem showed up because I 
actually had concurrent access to my DVD drive (some KDE media daemon 
checking to see if I inserted a music CD or something?). Jeff, Tejun, is 
there any reason to believe that the two channels on a PIIX ata controller 
are somehow tied together and it could be problematic for concurrent 
accesses?



I was sorta wondering in that direction too.  If its in legacy mode 
(PATA and SATA smushed together), that's a possibility.  But native or 
AHCI modes, the channels are pretty independent (which is the nature of 
SATA).


Historical note:  ata_piix is IMO more complicated than ahci, because 
the silicon is emulating the PATA interface using an internal (probably 
huge) state machine, converting PATA behavior to sending/receiving SATA 
packets.  There are classes of problems that just don't exist on ahci, 
simply because we can directly talk to the sata phy, rather than having 
to guess what the emulation state machine is doing.


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Linus Torvalds


On Tue, 28 Nov 2006, Jeff Garzik wrote:
 
 I was sorta wondering in that direction too.  If its in legacy mode (PATA and
 SATA smushed together), that's a possibility.  But native or AHCI modes, the
 channels are pretty independent (which is the nature of SATA).

Well, what I was more wondering about is whether perhaps the legacy mode 
emulation - even when it isn't actually used - means that there is simply 
some shared state (read: chipset bug that nobody noticed).

 Historical note:  ata_piix is IMO more complicated than ahci, because the
 silicon is emulating the PATA interface using an internal (probably huge)
 state machine, converting PATA behavior to sending/receiving SATA packets.

Well, there's bound to be the same big state machine working the other 
way, and maybe the chip simply internally gets confused. Or, as you say, 
simply because the emulation state machinery has to be taken into account, 
and _that_ ends up beign shared between the two otherwise independent 
channels..

How hard would it be to just force a shared spinlock between two sata 
channels on the same controller? It sounds like Jonas has a very 
repeatable setup, so even if I can't repeat my problem, if the performance 
degradation on writes is related, he can check his thing..

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Linus Torvalds


On Tue, 28 Nov 2006, Jeff Garzik wrote:
 
 ap-host (struct ata_host) already has a spinlock for precisely just that...
 :)

Right, but do we actually take it? I'm not seing any spin_lock's in 
ata_piix.c, but I don't know the SATA layers enough to say whether upper 
layers take it or not..

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Tejun Heo

Jonas Lundgren wrote:
[--snip--]

Also, it doesn't matter if I enable AHCI in the BIOS (But with AHCI
enabled the disks spin down/power down when I boot, just to power up
again a few seconds after. The boot progress freezes until the disks
have spun up again. (This happens when the kernel probes the sata
controller ports at bootup, the disks spin down at the same time, but
spin up one by one as they're getting probed))


Likely fix is pending for this problem.


I've tried changing I/O scheduler, only noticable diffrence is when I
use noop. Then I get like 20mb/sec write instead of 4mb/sec. I have no
idea why this is :P

Example of what I mean with crappy performance:
dd if=/dev/zero of=test232 bs=1M count=100; time sync
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.130424 s, 804 MB/s
real 0m21.104s
user 0m0.000s
sys 0m0.011s

21 seconds to do a seq write of 100mb.. And during this time ALL other
disk IO gets starved, I can't do anything that uses disk IO for the
duration.. (not even `ls`)


What does the kernel say during this writing?  Can you post the result 
of the following?


1. reboot
2. dmesg -c
3. time dd if=/dev/zero.. blah
4. dmesg

Also, does 'mount -o remount,barrier=0 /' change anything?


Yet, a hdparm shows a decent read
hdparm -tT /dev/md4
/dev/md4:
Timing cached reads: 8060 MB in 1.99 seconds = 4042.19 MB/sec
Timing buffered disk reads: 400 MB in 3.00 seconds = 133.28 MB/sec

dd if=1GBzeroFile of=/dev/null bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 11.4335 s, 91.7 MB/s

This is the cpu usage stats I get from top when running the dd write:
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 99.0%wa, 0.5%hi, 0.5%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Pretty crappy read speeds compared to what I got on my previous mobo
(around 140mb/sec), but still alot better than the 4mb/sec I get when
writing..


Which controller did you use on your previous mobo?  If you're using 
ata_piix and hook two hard drives as primary and secondary on the same 
channel, some level of performance degradation is expected.  ata_piix 
can only issue command to only one of the two drives at once.  Is the 
read performance still bad in ahci mode?


[--snip--]

Dmesg output from the error(s): (sda and sdb are 2 * 74GB raptor SATA
drives in a Linux software raid0)

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x20)
ata1.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout)


This might be a missed interrupt.  It's a write.  DMA engine is done 
finishing transferring all data.  Device is ready for the next command 
but the interrupt has never arrived.



ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)
ata1: soft resetting port
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ATA: abnormal status 0xD0 on port 0xFA07
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs


But this is weird.  If it were a missed interrupt, softreset should have 
recovered it instantly.  Something fishy is going on.


[--snip--]

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x21)
ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)


Same thing for read.


ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)


Again, pre-reset wait times out.  Weird.


ata1: soft resetting port
ata1.00: configured for UDMA/100
ata1: EH complete

[--snip--]

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x21)
ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)


Again, for read.


Most of the time when I get these errors the system will recover after
anything from 10 seconds to 10 minutes of unresponsiveness (no disk
I/O), and sometimes hang.


Yeap, libata needs stricter timing constraints for recovery.  That's 
high on to-do list.



IF the system does recover, I start getting
the extremly low disk write speeds that I reported above, and only a
reboot will get the performance back to regular.


Please full dmesg after your computer got really slow.  I suspect libata 
decided to switch to PIO mode.



I don't know what causes it, but most of the times when I've gotten it
my system has been under heavy load (compiling, downloading torrents in
11mb/sec etc). Please let me know if you want any additional info, want
me to try something out, or whatever. My recent hardware upgrade for
around $1200 (to a core2duo system, i965 mobo) is just going to waste
because of this problem. :/


Heh, nice machine you got there.  When you look at the dmesg, do the 
error messages occur 

Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Tejun Heo

Mark Lord wrote:

Linus Torvalds wrote:
[ You may or may not have gotten my previous email. The kernel stayed 
  working, but due to the IO errors the filesystem got re-mounted   
read-only, and I'm not sure that the email I sent out in that state   
actually ever made it out. I suspect it didn't. ]


Jeff,
 I just had a scary thing on my nice new Intel i965 box (all Intel 
chipsets apart from some strange Marvell IDE interface that I'm not 
using and that no driver even detected, and a TI firewire thing that 
I'm similarly not using).


The machine basically froze for about a minute or so (well, things 
worked surprisingly well, considering that apparently no disk IO 
happened - I initially thought it was just firefox that had frozen up, 
since my mail session seemed to be fine), and after it came back the 
filesystem was mounted read-only and nothing really worked any more..


I have no idea what status 0xD0 means: it looks like ATA_BUSY + 
ATA_DRDY + bit#4, but what is bit#4?


Bit #4, when actually implemented, is a rotational seek indicator,
which can be used for timing purposes.

But when BUSY (bit #7) is set, the rest are generally nonsense.


And clearly, the soft-reset isn't doing squat.


I dunno.  My first suspect is transient transmission error and yeah they 
do occur from time to time even on otherwise stable setup.  For example, 
my machine is nvidia ck804 which has pretty weak error handling (at 
least used to) and stays up 24/7 and I've seen such unrecovered 
transmission error just once during last 6+ months.


My experience is that if something is weird (say, power fluctuation or 
electro-magnetic interference), SATA is the first thing to give out and 
that's why we need good EH w/ SATA much more than we do with PATA.


Drives (controllers too) sometimes fall into weird state after such 
errors and softreset is often not enough, so we need hardreset.  ICH8 
can do hardreset even in ata_piix mode.  I'll work on it.


Linus, I'll follow up with Jonas as his problem seems reproducible but 
I'm a bit skeptical about it being a driver issue.  Even w/ all its 
kinks, ata_piix is just a sff IDE controller and libata has been doing 
it for a long time.  I would be really surprised if the driver or 
controller has any such issue in the usual r/w path.  AHCI should be 
able to recover from most error conditions unless drive firmware is 
completely stuck requiring physical power off.


--
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scary Intel SATA problem: frozen

2006-11-28 Thread Mark Lord
How hard would it be to just force a shared spinlock between two sata 
channels on the same controller? It sounds like Jonas has a very 
repeatable setup, so even if I can't repeat my problem, if the performance 
degradation on writes is related, he can check his thing..


Kinda like the ide0=serialize flag for the IDE subsystem, I suppose.

We certainly don't want it to be the default, as millions of ata_piix 
systems already out there seem to be working just fine (not all of them

running Linux, but enough of them to extrapolate to the zillions of
identical models).

It must be something new with ICH8 that we're not doing correctly yet.

???
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html