Dtrong elcheapo-ZFS-disk recommendation [Was: Re: ahcich timeouts, only with ahci, not with ataahci]

Harald Schmalzbauer Thu, 25 Mar 2010 16:06:16 -0700

Harald Schmalzbauer schrieb am 14.03.2010 12:12 (localtime):

Harald Schmalzbauer schrieb am 13.03.2010 22:27 (localtime):

Am 03.03.2010 12:06, schrieb Jeremy Chadwick:

On Wed, Mar 03, 2010 at 09:28:25AM +0100, Harald Schmalzbauer wrote:

Alexander Motin schrieb am 03.03.2010 09:18 (localtime):

Harald Schmalzbauer wrote:
Alexander Motin schrieb am 23.02.2010 16:10 (localtime):
Harald Schmalzbauer wrote:
I'm frequently getting my machine locked with ahcichX timeouts:
ahcich2: Timeout on slot 0
ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr
00000000
ahcich2: Timeout on slot 8
ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr
00000000
ahcich2: Timeout on slot 8
ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr
00000000
...
Looking that is (Interrupt status) is zero and `rs == cs | ss` (running
command bitmasks in driver and hardware), controller doesn't report
command completion. Looking on TFD status 0xc0 with BUSY bit set, I
would suppose that either disk stuck in command processing for some
reason, or controller missed command completion status.
Have you noticed 30 second (default ATA timeout) pause before timeout message printed? Just want to be sure that driver waited enough before
give up.
This happens when backup over GbE overloads ZFS/HDD capabilities.
I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking
up almost immediately, but from it still happens.
When I don't use ahci but ataahci (the old driver if I understand things correct) I also see the ZFS burst write congestion, but this doesn't
lead to controller timeouts, thus blocking the machine.
Sometimes the machine recovers from the disk lock, but most often I have
to reboot.
How it looks when it doesn't? Can you send me full log messages?
Hello, this morning I had a stall, but the machine recovered after about
one Minute. Here's what I got from the kernel:
ahcich2: Timeout on slot 29
ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 serr
00000000
em1: watchdog timeout -- resetting
em1: watchdog timeout -- resetting
ahcich2: Timeout on slot 10
ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 serr
00000000
ahcich2: Timeout on slot 18
ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr
00000000
ahcich2: Timeout on slot 2
ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr
00000000
ahcich2: Timeout on slot 2
ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 serr
00000000

Does this tell you something useful?
It doesn't. Looking on logged register content - commands are indeed
still running and no interrupts requested. Interesting to see em1
watchdog timeout there. Aren't they related somehow?

    dmesg | grep "irq 18":
uhci0: <Intel 82801I (ICH9) USB controller> port 0x20c0-0x20df irq
18 at device 26.0 on pci0
uhci4: <Intel 82801I (ICH9) USB controller> port 0x2040-0x205f irq
18 at device 29.2 on pci0
em1: <Intel(R) PRO/1000 Network Connection 6.9.14> port
0x1000-0x103f mem 0xe1920000-0xe193ffff,0xe1900000-0xe191ffff irq 18
at device 2.0 on pci3
ichsmb0: <Intel 82801I (ICH9) SMBus controller> port 0x2000-0x201f
mem 0xe1a22000-0xe1a220ff irq 18 at device 31.3 on pci0

The don't share the same IRQ at least.

...
For the records: I replaced the Samsung F2 1.5TB 5200rpm EcoGreen Drives.

In my dreams that should improove my 3-disk RAIDZ from 33MB/s avarage (>5G transferes) to about 60MB/s. In reality, it improoved it to 90MB/s, _and_ completely eliminatong the ahcich timeouts, as well as the burst writes where the complete machine stuck while ZFS flushed/wrote trransaction groups. So the difference in ZFS usage between the disks is far beond my imagination.

I can higly recommend the:
=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K2000
Device Model:     Hitachi HDS722020ALA330
Serial Number:    JK1174YAH9ZH7W
Firmware Version: JKAOA28A
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Mar 25 23:48:13 2010 CET

Some TB restored so far, no errors, no oddities, no problems at all. Same server, same FreeBSD, but ahci.ko enabled again (so with NCQ, thanks mav and friends).

I can confirm that the F2 Samsung drives worked fine with the old ata driver (speaking without enabling NQC) and ZFS. They did their job for 2 weeks without any error in that time, but reproducable showed ahcich timeouts (with the newer ahci.ko) if load was higher than about 50MB/s @raizd with 3 disks (same ICH9) So if I got my problem solved by replacing my HDDs (even the old one had the latest firmware) ans also got triple performance :))


Just to share the info.

Thanks,

-Harry

signature.asc
Description: OpenPGP digital signature

Dtrong elcheapo-ZFS-disk recommendation [Was: Re: ahcich timeouts, only with ahci, not with ataahci]

Reply via email to