Am 03.03.2010 12:06, schrieb Jeremy Chadwick: > On Wed, Mar 03, 2010 at 09:28:25AM +0100, Harald Schmalzbauer wrote: >> Alexander Motin schrieb am 03.03.2010 09:18 (localtime): >>> Harald Schmalzbauer wrote: >>>> Alexander Motin schrieb am 23.02.2010 16:10 (localtime): >>>>> Harald Schmalzbauer wrote: >>>>>> I'm frequently getting my machine locked with ahcichX timeouts: >>>>>> ahcich2: Timeout on slot 0 >>>>>> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr >>>>>> 00000000 >>>>>> ahcich2: Timeout on slot 8 >>>>>> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr >>>>>> 00000000 >>>>>> ahcich2: Timeout on slot 8 >>>>>> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr >>>>>> 00000000 >>>>>> ... >>>>> Looking that is (Interrupt status) is zero and `rs == cs | ss` (running >>>>> command bitmasks in driver and hardware), controller doesn't report >>>>> command completion. Looking on TFD status 0xc0 with BUSY bit set, I >>>>> would suppose that either disk stuck in command processing for some >>>>> reason, or controller missed command completion status. >>>>> >>>>> Have you noticed 30 second (default ATA timeout) pause before timeout >>>>> message printed? Just want to be sure that driver waited enough before >>>>> give up. >>>>> >>>>>> This happens when backup over GbE overloads ZFS/HDD capabilities. >>>>>> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking >>>>>> up almost immediately, but from it still happens. >>>>>> When I don't use ahci but ataahci (the old driver if I understand things >>>>>> correct) I also see the ZFS burst write congestion, but this doesn't >>>>>> lead to controller timeouts, thus blocking the machine. >>>>>> >>>>>> Sometimes the machine recovers from the disk lock, but most often I have >>>>>> to reboot. >>>>> How it looks when it doesn't? Can you send me full log messages? >>>> Hello, this morning I had a stall, but the machine recovered after about >>>> one Minute. Here's what I got from the kernel: >>>> ahcich2: Timeout on slot 29 >>>> ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 serr >>>> 00000000 >>>> em1: watchdog timeout -- resetting >>>> em1: watchdog timeout -- resetting >>>> ahcich2: Timeout on slot 10 >>>> ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 serr >>>> 00000000 >>>> ahcich2: Timeout on slot 18 >>>> ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr >>>> 00000000 >>>> ahcich2: Timeout on slot 2 >>>> ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr >>>> 00000000 >>>> ahcich2: Timeout on slot 2 >>>> ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 serr >>>> 00000000 >>>> >>>> Does this tell you something useful? >>> >>> It doesn't. Looking on logged register content - commands are indeed >>> still running and no interrupts requested. Interesting to see em1 >>> watchdog timeout there. Aren't they related somehow? >> >> dmesg | grep "irq 18": >> uhci0: <Intel 82801I (ICH9) USB controller> port 0x20c0-0x20df irq >> 18 at device 26.0 on pci0 >> uhci4: <Intel 82801I (ICH9) USB controller> port 0x2040-0x205f irq >> 18 at device 29.2 on pci0 >> em1: <Intel(R) PRO/1000 Network Connection 6.9.14> port >> 0x1000-0x103f mem 0xe1920000-0xe193ffff,0xe1900000-0xe191ffff irq 18 >> at device 2.0 on pci3 >> ichsmb0: <Intel 82801I (ICH9) SMBus controller> port 0x2000-0x201f >> mem 0xe1a22000-0xe1a220ff irq 18 at device 31.3 on pci0 >> >> The don't share the same IRQ at least. >> dmesg | grep "irq 21" >> uhci1: <Intel 82801I (ICH9) USB controller> port 0x20a0-0x20bf irq >> 21 at device 26.1 on pci0 >> ahci0: <Intel ICH9 AHCI SATA controller> port >> 0x2408-0x240f,0x2414-0x2417,0x2400-0x2407,0x2410-0x2413,0x2020-0x203f >> mem 0xe1a21000-0xe1a217ff irq 21 at device 31.2 on pci0 >> >> The em1 has no cable attached. I get many of these em watchdog >> timeouts. Never thought they could be related to ahci. I'll see if >> the em watchdog timeouts happens in any relation to disk usage. > > Please provide output from the commands I provided. dmesg|grep is not > sufficient for helping track this down, specifically with regards to the > em1 watchdog timeouts.
Sorry for the delay, here's the details:
hos...@pci0:0:0:0: class=0x060000 card=0x34d08086 chip=0x29f08086
rev=0x00 hdr=0x00
vendor = 'Intel Corporation'
device = '3200 Chipset (Bearlake) Processor to I/O Controller'
class = bridge
subclass = HOST-PCI
e...@pci0:0:25:0: class=0x020000 card=0x34d08086 chip=0x10bd8086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = 'Intel 82566DM Gigabit Ethernet Adapter (82566DM)'
class = network
subclass = ethernet
uh...@pci0:0:26:0: class=0x0c0300 card=0x34d08086 chip=0x29378086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller'
class = serial bus
subclass = USB
uh...@pci0:0:26:1: class=0x0c0300 card=0x34d08086 chip=0x29388086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller'
class = serial bus
subclass = USB
eh...@pci0:0:26:7: class=0x0c0320 card=0x34d08086 chip=0x293c8086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Controller'
class = serial bus
subclass = USB
pc...@pci0:0:28:0: class=0x060400 card=0x29408086 chip=0x29408086
rev=0x02 hdr=0x01
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) PCIe Root Port 1'
class = bridge
subclass = PCI-PCI
pc...@pci0:0:28:4: class=0x060400 card=0x29488086 chip=0x29488086
rev=0x02 hdr=0x01
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) PCIe Root Port 5'
class = bridge
subclass = PCI-PCI
uh...@pci0:0:29:0: class=0x0c0300 card=0x34d08086 chip=0x29348086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller'
class = serial bus
subclass = USB
uh...@pci0:0:29:1: class=0x0c0300 card=0x34d08086 chip=0x29358086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller'
class = serial bus
subclass = USB
uh...@pci0:0:29:2: class=0x0c0300 card=0x34d08086 chip=0x29368086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) USB Universal Host Controller'
class = serial bus
subclass = USB
eh...@pci0:0:29:7: class=0x0c0320 card=0x34d08086 chip=0x293a8086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Controller'
class = serial bus
subclass = USB
pc...@pci0:0:30:0: class=0x060401 card=0x34d08086 chip=0x244e8086
rev=0x92 hdr=0x01
vendor = 'Intel Corporation'
device = '82801 Family (ICH2/3/4/5/6/7/8/9,63xxESB) Hub
Interface to PCI Bridge'
class = bridge
subclass = PCI-PCI
is...@pci0:0:31:0: class=0x060100 card=0x34d08086 chip=0x29168086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IR (ICH9R) LPC Interface Controller'
class = bridge
subclass = PCI-ISA
ah...@pci0:0:31:2: class=0x010601 card=0x34d08086 chip=0x29228086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) 6 port SATA AHCI Controller'
class = mass storage
subclass = SATA
ichs...@pci0:0:31:3: class=0x0c0500 card=0x34d08086 chip=0x29308086
rev=0x02 hdr=0x00
vendor = 'Intel Corporation'
device = '82801IB/IR/IH (ICH9 Family) SMBus Controller'
class = serial bus
subclass = SMBus
vgap...@pci0:2:0:0: class=0x030000 card=0x01018086 chip=0x0522102b
rev=0x02 hdr=0x00
vendor = 'Matrox Electronic Systems Ltd.'
device = 'Matrox G200e (ServerEngines) - English (G200e)'
class = display
subclass = VGA
e...@pci0:3:2:0: class=0x020000 card=0x34d08086 chip=0x10768086 rev=0x05
hdr=0x00
vendor = 'Intel Corporation'
device = 'Gigabit Ethernet Controller (82541EI)'
class = network
subclass = ethernet
banana:~>20: vmstat -i
interrupt total rate
irq4: uart0 27392 0
irq18: em1 uhci0++ 178 0
irq19: uhci3 22 0
irq23: uhci2 ehci1 2 0
cpu0: timer 231629257 1917
irq256: em0 2559286 21
irq257: ahci0 1602196 13
cpu1: timer 231628629 1917
Total 467446962 3869
Looks very similar to the problem here:
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2010-03/msg00143.html
Thanks,
-Harry
signature.asc
Description: OpenPGP digital signature
