Re: [ATA] and re(4) stability issues

2008-12-15 Thread Victor Balada Diaz
On Fri, Dec 12, 2008 at 01:13:09PM +0100, Victor Balada Diaz wrote:
 On Thu, Dec 11, 2008 at 10:50:21AM +0100, Victor Balada Diaz wrote:
  On Thu, Dec 11, 2008 at 06:00:56PM +0900, Pyun YongHyeon wrote:
   
   I've reverted r185756 which caused GMII access issues on some
   controllers. If you are brave enough to try beta code, you can
   get latest re(4) in the following URL. Note, I don't have PCIe
   based RealTek controllers so the code was not tested at all.
   
   http://people.freebsd.org/~yongari/re/if_re.c
   http://people.freebsd.org/~yongari/re/if_rlreg.h
  
  I've recompiled the kernel with the first file in sys/dev/re/
  and the second one in sys/pci/. I'm still testing with MSI enabled.
  
  So far tried rebooting using nextboot(8) (just in case i lost the
  network card i could boot again) and the card seems to work
  but i'll continue stress testing the machine with stress + dd +
  iperf and see if i can take it down. I'll let you know how it goes.
 
 After a day of stress testing the machine haven't got errors, interrupt
 storms or interface up/down problems. Everything seems fine.
 I'll continue stress testing the machine during the weekend, but
 i would say that this time it's fixed.

Stopped stress testing this morning. After all the weekend testing
seems the re(4) problems were fixed. No single interface up/down error.
netstat -i reports no errors and everything is fine. Thanks a lot!

I'm going to deploy the patches on our production machines.

I've been able to trigger interrupt storms with ATA code, though.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [ATA] and re(4) stability issues

2008-12-15 Thread Victor Balada Diaz
On Wed, Dec 10, 2008 at 10:55:35AM +0100, Søren Schmidt wrote:
 On 10Dec, 2008, at 10:11 , Victor Balada Diaz wrote:
 
 Thanks for explaining me what the flags do. I'm not skilled enough  
 to create
 the DMA quirks but if you could give me some patches i'll test them.  
 Also
 if you have any other idea on what could i test or how can i debug  
 this
 it would be more than welcome.
 
 
 Comment out the following two lines in ata_ahci_dmainit():
 
 if (ATA_INL(ctlr-r_res2, ATA_AHCI_CAP)  ATA_AHCI_CAP_64BIT)
 ch-dma-max_address = BUS_SPACE_MAXADDR;
 
 And you will not use 64bit DMA even if the chipset supports it.  
 However I have not seen any chipsets supporting this fail, YMMV as  
 usual :)
 
Hello Søren,

I'm triggering interrupt storms with this chipset after a few
days of stressing the HD calling sysutils/stress with stress -d 10 -i 10
and in other term, doing: 

 while true; do dd if=/dev/zero of=BAH bs=1M count=1024; done;

Right now, as reported by systat -vmstat i have 578k interrupts in atapci
and the machine is idle. Do you have any idea on how could i debug
this? any advice would be much more than welcome.

Thanks a lot.
Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [ATA] and re(4) stability issues

2008-12-15 Thread Victor Balada Diaz
On Mon, Dec 15, 2008 at 10:02:07AM +0100, Victor Balada Diaz wrote:
 Stopped stress testing this morning. After all the weekend testing
 seems the re(4) problems were fixed. No single interface up/down error.
 netstat -i reports no errors and everything is fine. Thanks a lot!
 
 I'm going to deploy the patches on our production machines.
 
 I've been able to trigger interrupt storms with ATA code, though.

After deploying it in various machines this night i've found in the
logs messages like this one:

re0: watchdog timeout (missed Tx interrupts) -- recovering

I know you told me this is harmless, so this is just so you
know it's happening.

Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [ATA] and re(4) stability issues

2008-12-15 Thread Pyun YongHyeon
On Tue, Dec 16, 2008 at 08:19:19AM +0100, Victor Balada Diaz wrote:
  On Mon, Dec 15, 2008 at 10:02:07AM +0100, Victor Balada Diaz wrote:
   Stopped stress testing this morning. After all the weekend testing
   seems the re(4) problems were fixed. No single interface up/down error.
   netstat -i reports no errors and everything is fine. Thanks a lot!
   
   I'm going to deploy the patches on our production machines.
   
   I've been able to trigger interrupt storms with ATA code, though.
  
  After deploying it in various machines this night i've found in the
  logs messages like this one:
  
  re0: watchdog timeout (missed Tx interrupts) -- recovering
  
  I know you told me this is harmless, so this is just so you

Yes, it's not real watchdog timeout as long as re(4) still works
correctly.

  know it's happening.
  

Ok. I'll update re(4) when I find spare time.

-- 
Regards,
Pyun YongHyeon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [ATA] and re(4) stability issues

2008-12-12 Thread Victor Balada Diaz
On Thu, Dec 11, 2008 at 10:50:21AM +0100, Victor Balada Diaz wrote:
 On Thu, Dec 11, 2008 at 06:00:56PM +0900, Pyun YongHyeon wrote:
  
  I've reverted r185756 which caused GMII access issues on some
  controllers. If you are brave enough to try beta code, you can
  get latest re(4) in the following URL. Note, I don't have PCIe
  based RealTek controllers so the code was not tested at all.
  
  http://people.freebsd.org/~yongari/re/if_re.c
  http://people.freebsd.org/~yongari/re/if_rlreg.h
 
 I've recompiled the kernel with the first file in sys/dev/re/
 and the second one in sys/pci/. I'm still testing with MSI enabled.
 
 So far tried rebooting using nextboot(8) (just in case i lost the
 network card i could boot again) and the card seems to work
 but i'll continue stress testing the machine with stress + dd +
 iperf and see if i can take it down. I'll let you know how it goes.

After a day of stress testing the machine haven't got errors, interrupt
storms or interface up/down problems. Everything seems fine.
I'll continue stress testing the machine during the weekend, but
i would say that this time it's fixed.

Seems lately there have been a lot of testing of this driver. Is
there any chance of it being on 7.1 or being MFCed after the release
to RELENG_7?

Thanks a lot.
Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [ATA] and re(4) stability issues

2008-12-12 Thread Pyun YongHyeon
On Fri, Dec 12, 2008 at 01:13:09PM +0100, Victor Balada Diaz wrote:
  On Thu, Dec 11, 2008 at 10:50:21AM +0100, Victor Balada Diaz wrote:
   On Thu, Dec 11, 2008 at 06:00:56PM +0900, Pyun YongHyeon wrote:

I've reverted r185756 which caused GMII access issues on some
controllers. If you are brave enough to try beta code, you can
get latest re(4) in the following URL. Note, I don't have PCIe
based RealTek controllers so the code was not tested at all.

http://people.freebsd.org/~yongari/re/if_re.c
http://people.freebsd.org/~yongari/re/if_rlreg.h
   
   I've recompiled the kernel with the first file in sys/dev/re/
   and the second one in sys/pci/. I'm still testing with MSI enabled.
   
   So far tried rebooting using nextboot(8) (just in case i lost the
   network card i could boot again) and the card seems to work
   but i'll continue stress testing the machine with stress + dd +
   iperf and see if i can take it down. I'll let you know how it goes.
  
  After a day of stress testing the machine haven't got errors, interrupt
  storms or interface up/down problems. Everything seems fine.
  I'll continue stress testing the machine during the weekend, but
  i would say that this time it's fixed.
  

Thanks for testing!

  Seems lately there have been a lot of testing of this driver. Is
  there any chance of it being on 7.1 or being MFCed after the release
  to RELENG_7?
  

It's too early to say MFC but I think MFC would be done after
releasing 7.1-RELEASE if all goes well.

-- 
Regards,
Pyun YongHyeon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [ATA] and re(4) stability issues

2008-12-11 Thread Victor Balada Diaz
On Thu, Dec 11, 2008 at 05:05:59AM +1100, Peter Jeremy wrote:
 On 2008-Dec-10 10:55:35 +0100, Søren Schmidt [EMAIL PROTECTED] wrote:
 And you will not use 64bit DMA even if the chipset supports it.  
 However I have not seen any chipsets supporting this fail, YMMV as  
 usual :)
 
 There's a reference in wikipedia pointing to
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg06694.html
 that claims the AMD/ATI SB600 lies about supporting 64-bit DMA in AHCI
 mode.  I have a SB600 but it doesn't have 4GB to test on.

I have 6 GB of RAM and can test patches, so once i'm done with the re(4)
side of things i'll try commenting the code Soren's suggested and see
if that improves the situation.

Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-11 Thread Victor Balada Diaz
On Thu, Dec 11, 2008 at 08:57:07AM +0100, Victor Balada Diaz wrote:
 On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote:
  On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote:
Also i didn't see any problem with interfaces going up and down,
but that usually happen after some hours of uptime, so i'll let
you know if the error happens again.

 
 After writing to the HD with dd for a few hours and using
 stress -i 10 -d 10 the machine lost connectivity. I waited until
 today to be sure if the machine hung, paniced or just lost network
 connectivity. I don't have local access or serial access, so this
 is the only way i could do it. I've seen in the logs during the
 night various messages of:
 
 
 Dec 10 00:33:49 yac kernel: re0: watchdog timeout
 Dec 10 00:33:49 yac kernel: re0: link state changed to DOWN
 Dec 10 00:33:52 yac kernel: re0: link state changed to UP
 
 The interface never recovered and i wasn't able to ping the machine
 until i rebooted. Nagios was checking all the time and no recovery
 happened.
 
 The netstat -i in daily scripts shows just one Oerrs. I'm used to
 have a lot of them, but seems this time the card didn't recover from
 the only one. I also want to say that this is not a regression, as
 it happened before with 7.1 -BETA 2 code.
 
 Is there anything more i can try?

Sorry it's too early in the morning and i thought today was 10
instead of 11. I don't even know the day i'm today.

Looking at today's log i see no link state changed messages
but i see this other messages that started happening more or
less at the same time i lost connectivity to the server:

Dec 10 18:20:32 yac kernel: re0: link state changed to DOWN
Dec 10 18:20:32 yac kernel: re0: PHY read failed

Sorry for the noise.

Regards.
-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-11 Thread Pyun YongHyeon
On Thu, Dec 11, 2008 at 09:10:45AM +0100, Victor Balada Diaz wrote:
  On Thu, Dec 11, 2008 at 08:57:07AM +0100, Victor Balada Diaz wrote:
   On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote:
On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote:
  Also i didn't see any problem with interfaces going up and down,
  but that usually happen after some hours of uptime, so i'll let
  you know if the error happens again.
  
   
   After writing to the HD with dd for a few hours and using
   stress -i 10 -d 10 the machine lost connectivity. I waited until
   today to be sure if the machine hung, paniced or just lost network
   connectivity. I don't have local access or serial access, so this
   is the only way i could do it. I've seen in the logs during the
   night various messages of:
   
   
   Dec 10 00:33:49 yac kernel: re0: watchdog timeout
   Dec 10 00:33:49 yac kernel: re0: link state changed to DOWN
   Dec 10 00:33:52 yac kernel: re0: link state changed to UP
   
   The interface never recovered and i wasn't able to ping the machine
   until i rebooted. Nagios was checking all the time and no recovery
   happened.
   
   The netstat -i in daily scripts shows just one Oerrs. I'm used to
   have a lot of them, but seems this time the card didn't recover from
   the only one. I also want to say that this is not a regression, as
   it happened before with 7.1 -BETA 2 code.
   
   Is there anything more i can try?
  
  Sorry it's too early in the morning and i thought today was 10
  instead of 11. I don't even know the day i'm today.
  
  Looking at today's log i see no link state changed messages
  but i see this other messages that started happening more or
  less at the same time i lost connectivity to the server:
  
  Dec 10 18:20:32 yac kernel: re0: link state changed to DOWN
  Dec 10 18:20:32 yac kernel: re0: PHY read failed
  

I've reverted r185756 which caused GMII access issues on some
controllers. If you are brave enough to try beta code, you can
get latest re(4) in the following URL. Note, I don't have PCIe
based RealTek controllers so the code was not tested at all.

http://people.freebsd.org/~yongari/re/if_re.c
http://people.freebsd.org/~yongari/re/if_rlreg.h

-- 
Regards,
Pyun YongHyeon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-11 Thread Victor Balada Diaz
On Thu, Dec 11, 2008 at 06:00:56PM +0900, Pyun YongHyeon wrote:
 On Thu, Dec 11, 2008 at 09:10:45AM +0100, Victor Balada Diaz wrote:
   On Thu, Dec 11, 2008 at 08:57:07AM +0100, Victor Balada Diaz wrote:
On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote:
 On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote:
   Also i didn't see any problem with interfaces going up and down,
   but that usually happen after some hours of uptime, so i'll let
   you know if the error happens again.
   

After writing to the HD with dd for a few hours and using
stress -i 10 -d 10 the machine lost connectivity. I waited until
today to be sure if the machine hung, paniced or just lost network
connectivity. I don't have local access or serial access, so this
is the only way i could do it. I've seen in the logs during the
night various messages of:


Dec 10 00:33:49 yac kernel: re0: watchdog timeout
Dec 10 00:33:49 yac kernel: re0: link state changed to DOWN
Dec 10 00:33:52 yac kernel: re0: link state changed to UP

The interface never recovered and i wasn't able to ping the machine
until i rebooted. Nagios was checking all the time and no recovery
happened.

The netstat -i in daily scripts shows just one Oerrs. I'm used to
have a lot of them, but seems this time the card didn't recover from
the only one. I also want to say that this is not a regression, as
it happened before with 7.1 -BETA 2 code.

Is there anything more i can try?
   
   Sorry it's too early in the morning and i thought today was 10
   instead of 11. I don't even know the day i'm today.
   
   Looking at today's log i see no link state changed messages
   but i see this other messages that started happening more or
   less at the same time i lost connectivity to the server:
   
   Dec 10 18:20:32 yac kernel: re0: link state changed to DOWN
   Dec 10 18:20:32 yac kernel: re0: PHY read failed
   
 
 I've reverted r185756 which caused GMII access issues on some
 controllers. If you are brave enough to try beta code, you can
 get latest re(4) in the following URL. Note, I don't have PCIe
 based RealTek controllers so the code was not tested at all.
 
 http://people.freebsd.org/~yongari/re/if_re.c
 http://people.freebsd.org/~yongari/re/if_rlreg.h

I've recompiled the kernel with the first file in sys/dev/re/
and the second one in sys/pci/. I'm still testing with MSI enabled.

So far tried rebooting using nextboot(8) (just in case i lost the
network card i could boot again) and the card seems to work
but i'll continue stress testing the machine with stress + dd +
iperf and see if i can take it down. I'll let you know how it goes.

Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Andrey V. Elsukov

Victor Balada Diaz wrote:

Digging at linux source code i've found that they do some special things
for this chipset that i've been unable to find on our code. This is
linux code for my chipset:

371 AHCI_HFLAGS (AHCI_HFLAG_IGN_SERR_INTERNAL |
372  AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI |
373  AHCI_HFLAG_SECT255),

File and the rest of the code in here[3].

As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could
think of, switching MSI and MSI-x off for the whole system, so
i added to /boot/loader.conf this tunables:


FreeBSD's ata(4) driver doesn't support MSI. This flag in linux's libata used in

if ((hpriv-flags  AHCI_HFLAG_NO_MSI) || pci_enable_msi(pdev))
pci_intx(pdev, 1);

In FreeBSD's code we have the same:

/* enable PCI interrupt */
pci_write_config(dev, PCIR_COMMAND,
 pci_read_config(dev, PCIR_COMMAND, 2)  ~0x0400, 2);

AHCI_HFLAG_IGN_SERR_INTERNAL flag targeted to ignore SERR_INTERNAL errors.
FreeBSD's ata(4) driver ignores they too.

AHCI_HFLAG_32BIT_ONLY flag limits to use 32-bit DMA only.
If AHCI CAP register reports that controller supports 64-bit DMA driver will 
use 64-bit.
So i think there can be added one quirk for you, but i'm not sure that problem 
is here..

AHCI_HFLAG_SECT255 flag limits I/O operation to 255 sectors, FreeBSD uses 
128-limit
by default.

--
WBR, Andrey V. Elsukov

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Victor Balada Diaz
On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote:
 On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
   Hello,
   
   I got various machines[1] at hetzner.de and I've been having problems
   with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
   been trying to narrow the problem so someone more knowledgeable than me
   is able to fix it. This mail is an other attempt to ask a question
   with regards ATA code to see if this time i got something.
   
   For the ones that don't actually know what happened:
   
   With FreeBSD 7.0 -RELEASE for amd64 and default kernel
   the system shared re0 interrupt with OHCI and this caused
   re(4) to corrupt packets and create interrupt storms. Tried
 
 re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily
 triggered on systems with  4GB memory. But I dont' know whether
 this is related with interrupt storms.
 
   updating to 7.1 -BETA2 and still had some problems with it.
   
   I've opened the PR kern/128287[2] and Remko quickly answered
   with a workaround: that workaround was removing USB support from
   my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
   and the interrupt storms were gone. Now sometime later the interface
   goes up and down from time to time, but less often. Also sometimes
   the machine losts the network interface but continues to work.
   
 
 It seems that your controller supports MSI so you can set a tunable
 hw.re.msi_disable to 0 to enable MSI. With MSI you can remove
 interrupt sharing(e.g. add hw.re.msi_disable=0 to
 /boot/loader.conf file.) However there were several issues on re(4)
 w.r.t MSI so it was off by default.

This is undocumented and with sysctl -a i can't find the tunable. Is this
a HEAD feature or it's also in 7.1 -BETA2? Should i add
hw.re_msi_disable=0 to /boot/loader.conf?

This was sharing interrupt with USB, does USB need any special MSI handling
or with re using MSI is enough to not share the interrupt?


 
   I know it continues to work because some days later i can see that
   it tried to deliver the status reports but was unable to resolve the
   aliases hostnames. I can't ping the machine and i know the network
   is OK. If i reboot the machine everything is working again.
   
 
 Recently I've made small changes to re(4) which may help to detect
 link state change event. Would you try re(4) in HEAD?

Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that
or do i need to test the whole HEAD kernel?

Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Victor Balada Diaz
On Wed, Dec 10, 2008 at 11:58:12AM +0300, Andrey V. Elsukov wrote:
 Victor Balada Diaz wrote:
 Digging at linux source code i've found that they do some special things
 for this chipset that i've been unable to find on our code. This is
 linux code for my chipset:
 
 371 AHCI_HFLAGS (AHCI_HFLAG_IGN_SERR_INTERNAL |
 372  AHCI_HFLAG_32BIT_ONLY | 
 AHCI_HFLAG_NO_MSI |
 373  AHCI_HFLAG_SECT255),
 
 File and the rest of the code in here[3].
 
 As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could
 think of, switching MSI and MSI-x off for the whole system, so
 i added to /boot/loader.conf this tunables:
 
 FreeBSD's ata(4) driver doesn't support MSI. This flag in linux's libata 
 used in
 
 if ((hpriv-flags  AHCI_HFLAG_NO_MSI) || pci_enable_msi(pdev))
 pci_intx(pdev, 1);
 
 In FreeBSD's code we have the same:
 
 /* enable PCI interrupt */
 pci_write_config(dev, PCIR_COMMAND,
  pci_read_config(dev, PCIR_COMMAND, 2)  ~0x0400, 2);
 
 AHCI_HFLAG_IGN_SERR_INTERNAL flag targeted to ignore SERR_INTERNAL errors.
 FreeBSD's ata(4) driver ignores they too.
 
 AHCI_HFLAG_32BIT_ONLY flag limits to use 32-bit DMA only.
 If AHCI CAP register reports that controller supports 64-bit DMA driver 
 will use 64-bit.
 So i think there can be added one quirk for you, but i'm not sure that 
 problem is here..
 
 AHCI_HFLAG_SECT255 flag limits I/O operation to 255 sectors, FreeBSD uses 
 128-limit
 by default.

Thanks for explaining me what the flags do. I'm not skilled enough to create
the DMA quirks but if you could give me some patches i'll test them. Also
if you have any other idea on what could i test or how can i debug this
it would be more than welcome.

Thanks.
Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Søren Schmidt

On 10Dec, 2008, at 10:11 , Victor Balada Diaz wrote:


Thanks for explaining me what the flags do. I'm not skilled enough  
to create
the DMA quirks but if you could give me some patches i'll test them.  
Also
if you have any other idea on what could i test or how can i debug  
this

it would be more than welcome.



Comment out the following two lines in ata_ahci_dmainit():

if (ATA_INL(ctlr-r_res2, ATA_AHCI_CAP)  ATA_AHCI_CAP_64BIT)
ch-dma-max_address = BUS_SPACE_MAXADDR;

And you will not use 64bit DMA even if the chipset supports it.  
However I have not seen any chipsets supporting this fail, YMMV as  
usual :)


-Søren






___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Victor Balada Diaz
On Wed, Dec 10, 2008 at 07:28:00PM +0900, Pyun YongHyeon wrote:
 On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote:
   On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote:
On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
  Hello,
  
  I got various machines[1] at hetzner.de and I've been having problems
  with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. 
 I've
  been trying to narrow the problem so someone more knowledgeable than 
 me
  is able to fix it. This mail is an other attempt to ask a question
  with regards ATA code to see if this time i got something.
  
  For the ones that don't actually know what happened:
  
  With FreeBSD 7.0 -RELEASE for amd64 and default kernel
  the system shared re0 interrupt with OHCI and this caused
  re(4) to corrupt packets and create interrupt storms. Tried

re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily
triggered on systems with  4GB memory. But I dont' know whether
this is related with interrupt storms.

  updating to 7.1 -BETA2 and still had some problems with it.
  
  I've opened the PR kern/128287[2] and Remko quickly answered
  with a workaround: that workaround was removing USB support from
  my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
  and the interrupt storms were gone. Now sometime later the interface
  goes up and down from time to time, but less often. Also sometimes
  the machine losts the network interface but continues to work.
  

It seems that your controller supports MSI so you can set a tunable
hw.re.msi_disable to 0 to enable MSI. With MSI you can remove
interrupt sharing(e.g. add hw.re.msi_disable=0 to
/boot/loader.conf file.) However there were several issues on re(4)
w.r.t MSI so it was off by default.
   
   This is undocumented and with sysctl -a i can't find the tunable. Is this
   a HEAD feature or it's also in 7.1 -BETA2? Should i add
 
 Yeah it's an undocmented feature. But most drivers written by me
 have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have
 the tunable.

I think it could be great if you could document it or at least
show it by default when you do sysctl -ad with a small description.

 
   hw.re_msi_disable=0 to /boot/loader.conf?
^
Shoule be hw.re.msi_disable=0
   
 
 Yes, just add it to /boot/loader.conf. Note, you should not disable
 system-wide MSI control(e.g. hw.pci.enable_msi == 1).
 
   This was sharing interrupt with USB, does USB need any special MSI handling
   or with re using MSI is enough to not share the interrupt?
 
 If re(4) can use MSI, you don't need to worry about interrupt
 sharing with USB. Check the output of vmstat -i. You normally get
 an irq256 or higher for MSI enabled driver.
 
   
   

  I know it continues to work because some days later i can see that
  it tried to deliver the status reports but was unable to resolve the
  aliases hostnames. I can't ping the machine and i know the network
  is OK. If i reboot the machine everything is working again.
  

Recently I've made small changes to re(4) which may help to detect
link state change event. Would you try re(4) in HEAD?
   
   Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that
 
 Yes, you can. It should build without problems. Just replace re(4) on
 stable/7 with HEAD version.
 
   or do i need to test the whole HEAD kernel?
   
 
 No you don't have to that.

Backporting the changes i've found that it didn't compile so in
the end i got from HEAD the following files:

base/head/sys/dev/re/if_re.c
base/head/sys/pci/if_rl.c
base/head/sys/pci/if_rlreg.h

After that i've recompiled 7.1 -BETA2 GENERIC kernel and enabled
the knob you suggested in /boot/loader.conf.

With the new kernel and MSI the interrupts are like this:

# vmstat -i
interrupt  total   rate
irq9: acpi01  0
irq16: ohci0   1  0
irq17: ohci1 ohci3 1  0
irq18: ohci2 ohci4 1  0
irq22: atapci0 19215 15
cpu0: timer  2502718   1998
irq256: re0  4967726   3967
cpu1: timer  2502525   1998
Total9992188   7980

The high interrupt numbers are because i've been running iperf to
check everything it's fine, not because of interrupt storms. So far
i didn't find any interrupt storms related to USB or re(4) driver
but while doing the tests i've found this error:

re0: watchdog timeout (missed Tx interrupts) -- recovering

This didn't create any error on the interfaces (netstat -i).

Also i didn't see any problem with interfaces going up and down,
but that usually happen after some hours 

Re: [ATA] and re(4) stability issues

2008-12-10 Thread Pyun YongHyeon
On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote:
  On Wed, Dec 10, 2008 at 07:28:00PM +0900, Pyun YongHyeon wrote:
   On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote:
 On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote:
  On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
Hello,

I got various machines[1] at hetzner.de and I've been having 
   problems
with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in 
   amd64. I've
been trying to narrow the problem so someone more knowledgeable 
   than me
is able to fix it. This mail is an other attempt to ask a question
with regards ATA code to see if this time i got something.

For the ones that don't actually know what happened:

With FreeBSD 7.0 -RELEASE for amd64 and default kernel
the system shared re0 interrupt with OHCI and this caused
re(4) to corrupt packets and create interrupt storms. Tried
  
  re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily
  triggered on systems with  4GB memory. But I dont' know whether
  this is related with interrupt storms.
  
updating to 7.1 -BETA2 and still had some problems with it.

I've opened the PR kern/128287[2] and Remko quickly answered
with a workaround: that workaround was removing USB support from
my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
and the interrupt storms were gone. Now sometime later the 
   interface
goes up and down from time to time, but less often. Also sometimes
the machine losts the network interface but continues to work.

  
  It seems that your controller supports MSI so you can set a tunable
  hw.re.msi_disable to 0 to enable MSI. With MSI you can remove
  interrupt sharing(e.g. add hw.re.msi_disable=0 to
  /boot/loader.conf file.) However there were several issues on re(4)
  w.r.t MSI so it was off by default.
 
 This is undocumented and with sysctl -a i can't find the tunable. Is 
   this
 a HEAD feature or it's also in 7.1 -BETA2? Should i add
   
   Yeah it's an undocmented feature. But most drivers written by me
   have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have
   the tunable.
  
  I think it could be great if you could document it or at least
  show it by default when you do sysctl -ad with a small description.
  

If MSI worked as expected I would have documented it as I did
in msk(4)/nfe(4)/ale(4)/age(4)/jme(4) etc.
Using MSI on RealTek does not seem to stable. I tried hard to fix
that but some users still reported watchdog timeouts. Working
without documentation and hardware also made it hard to complete
the work. This was the main reason why MSI was disabled on re(4).

   
 hw.re_msi_disable=0 to /boot/loader.conf?
  ^
  Shoule be hw.re.msi_disable=0
 
   
   Yes, just add it to /boot/loader.conf. Note, you should not disable
   system-wide MSI control(e.g. hw.pci.enable_msi == 1).
   
 This was sharing interrupt with USB, does USB need any special MSI 
   handling
 or with re using MSI is enough to not share the interrupt?
   
   If re(4) can use MSI, you don't need to worry about interrupt
   sharing with USB. Check the output of vmstat -i. You normally get
   an irq256 or higher for MSI enabled driver.
   
 
 
  
I know it continues to work because some days later i can see that
it tried to deliver the status reports but was unable to resolve 
   the
aliases hostnames. I can't ping the machine and i know the network
is OK. If i reboot the machine everything is working again.

  
  Recently I've made small changes to re(4) which may help to detect
  link state change event. Would you try re(4) in HEAD?
 
 Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that
   
   Yes, you can. It should build without problems. Just replace re(4) on
   stable/7 with HEAD version.
   
 or do i need to test the whole HEAD kernel?
 
   
   No you don't have to that.
  
  Backporting the changes i've found that it didn't compile so in
  the end i got from HEAD the following files:
  
  base/head/sys/dev/re/if_re.c
  base/head/sys/pci/if_rl.c
  base/head/sys/pci/if_rlreg.h
  

Ah,, sorry about that. Recently there was some changes. I forgot
that.

  After that i've recompiled 7.1 -BETA2 GENERIC kernel and enabled
  the knob you suggested in /boot/loader.conf.
  
  With the new kernel and MSI the interrupts are like this:
  
  # vmstat -i
  interrupt  total   rate
  irq9: acpi01  0
  irq16: ohci0   1  0
  irq17: ohci1 ohci3 1  0
  irq18: ohci2 ohci4 1  0
  

Re: [ATA] and re(4) stability issues

2008-12-10 Thread Arnaud Houdelette

Victor Balada Diaz a écrit :

Hello,

I got various machines[1] at hetzner.de and I've been having problems
with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
been trying to narrow the problem so someone more knowledgeable than me
is able to fix it. This mail is an other attempt to ask a question
with regards ATA code to see if this time i got something.

For the ones that don't actually know what happened:

With FreeBSD 7.0 -RELEASE for amd64 and default kernel
the system shared re0 interrupt with OHCI and this caused
re(4) to corrupt packets and create interrupt storms. Tried
updating to 7.1 -BETA2 and still had some problems with it.

I've opened the PR kern/128287[2] and Remko quickly answered
with a workaround: that workaround was removing USB support from
my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
and the interrupt storms were gone. Now sometime later the interface
goes up and down from time to time, but less often. Also sometimes
the machine losts the network interface but continues to work.

I know it continues to work because some days later i can see that
it tried to deliver the status reports but was unable to resolve the
aliases hostnames. I can't ping the machine and i know the network
is OK. If i reboot the machine everything is working again.

When switched from 7.0 to 7.1 BETA2 i also found that under load
after some hours the machine created interrupt storms on ATA disks.

Digging at linux source code i've found that they do some special things
for this chipset that i've been unable to find on our code. This is
linux code for my chipset:

371 AHCI_HFLAGS (AHCI_HFLAG_IGN_SERR_INTERNAL |
372  AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI |
373  AHCI_HFLAG_SECT255),

File and the rest of the code in here[3].

As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could
think of, switching MSI and MSI-x off for the whole system, so
i added to /boot/loader.conf this tunables:

hw.pci.enable_msix=0
hw.pci.enable_msi=0

And then rebooted the machine. After various hours of doing almost nothing
i've found that the machine answered ping but was unable to answer any
request (eg, ssh, nagios nrpe, etc). The machine recovered itself after
some minutes and when i was able to ssh into i saw the following in dmesg:

ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1463123158

and a lot more errors like that. I didn't get this errors with MSI enabled.
I see WRITE_DMA48 and in linux code i saw AHCI_HFLAG_32BIT_ONLY which is later
used for DMA related things. Could someone who is more knowledgeable check
if we're doing the right thing?

I've attached verbose dmesg of a machine that's like this one with
7.1 -BETA2, MSI enabled and GENERIC kernel minus USB and firewrire.

Also, please, could someone give me a hand on how could i continue debugging
this interrupt issues? I'm a bit lost and digging code and posting each
time i think i've found something is not going to go anywhere.

I would also like to say that i've seen reports of this kind of problems
on amd64 machines in the lists since various years ago, so i don't think
this is just a problem with this BIOS/motherboard (MSI K9AG Neo2 Digital)
on the lists


Thanks in advance for any help.
Regards.


[1]: http://www.hetzner.de/hosting/produkte_rootserver/ds7000/
[2]: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/128287
[3]: http://fxr.watson.org/fxr/source/drivers/ata/ahci.c?v=linux-2.6#L369
  


Sorry I didn't take the time to read all the thread, but I got similar 
problem with the same IXP600 chipset.
Only it was'nt with a Realtek NIC (re) but with a Ralink wireless one. 
The simptoms where similar : interrupt 22 was shared between the sata 
controler and the wireless card. And I got Interrupt Storms at random 
times when using the wireless network.


No problem since I removed the ral(4) NIC (got a real access point now).
You might not want to point the finger at the re(4) driver too fast.

Arnaud Houdelette


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Pyun YongHyeon
On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote:
  On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote:
   On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
 Hello,
 
 I got various machines[1] at hetzner.de and I've been having problems
 with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
 been trying to narrow the problem so someone more knowledgeable than me
 is able to fix it. This mail is an other attempt to ask a question
 with regards ATA code to see if this time i got something.
 
 For the ones that don't actually know what happened:
 
 With FreeBSD 7.0 -RELEASE for amd64 and default kernel
 the system shared re0 interrupt with OHCI and this caused
 re(4) to corrupt packets and create interrupt storms. Tried
   
   re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily
   triggered on systems with  4GB memory. But I dont' know whether
   this is related with interrupt storms.
   
 updating to 7.1 -BETA2 and still had some problems with it.
 
 I've opened the PR kern/128287[2] and Remko quickly answered
 with a workaround: that workaround was removing USB support from
 my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
 and the interrupt storms were gone. Now sometime later the interface
 goes up and down from time to time, but less often. Also sometimes
 the machine losts the network interface but continues to work.
 
   
   It seems that your controller supports MSI so you can set a tunable
   hw.re.msi_disable to 0 to enable MSI. With MSI you can remove
   interrupt sharing(e.g. add hw.re.msi_disable=0 to
   /boot/loader.conf file.) However there were several issues on re(4)
   w.r.t MSI so it was off by default.
  
  This is undocumented and with sysctl -a i can't find the tunable. Is this
  a HEAD feature or it's also in 7.1 -BETA2? Should i add

Yeah it's an undocmented feature. But most drivers written by me
have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have
the tunable.

  hw.re_msi_disable=0 to /boot/loader.conf?
   ^
   Shoule be hw.re.msi_disable=0
  

Yes, just add it to /boot/loader.conf. Note, you should not disable
system-wide MSI control(e.g. hw.pci.enable_msi == 1).

  This was sharing interrupt with USB, does USB need any special MSI handling
  or with re using MSI is enough to not share the interrupt?

If re(4) can use MSI, you don't need to worry about interrupt
sharing with USB. Check the output of vmstat -i. You normally get
an irq256 or higher for MSI enabled driver.

  
  
   
 I know it continues to work because some days later i can see that
 it tried to deliver the status reports but was unable to resolve the
 aliases hostnames. I can't ping the machine and i know the network
 is OK. If i reboot the machine everything is working again.
 
   
   Recently I've made small changes to re(4) which may help to detect
   link state change event. Would you try re(4) in HEAD?
  
  Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that

Yes, you can. It should build without problems. Just replace re(4) on
stable/7 with HEAD version.

  or do i need to test the whole HEAD kernel?
  

No you don't have to that.

-- 
Regards,
Pyun YongHyeon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Oliver Peter
On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
 Hello,
 
 I got various machines[1] at hetzner.de and I've been having problems
 with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
 been trying to narrow the problem so someone more knowledgeable than me
 is able to fix it. This mail is an other attempt to ask a question
 with regards ATA code to see if this time i got something.

Just want to add a quick note and say that I'm having the same problem
with my 7.0-RELEASE-p6/amd64 hetzner machine:

 http://lists.freebsd.org/pipermail/freebsd-acpi/2008-September/005095.html

I would be happy to test patches as well.  Thanks.

-- 
Oliver PETER, email: [EMAIL PROTECTED], ICQ# 113969174
If it feels good, you're doing something wrong.
  -- Coach McTavish
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Victor Balada Diaz
On Wed, Dec 10, 2008 at 12:08:40PM +, Oliver Peter wrote:
 On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
  Hello,
  
  I got various machines[1] at hetzner.de and I've been having problems
  with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
  been trying to narrow the problem so someone more knowledgeable than me
  is able to fix it. This mail is an other attempt to ask a question
  with regards ATA code to see if this time i got something.
 
 Just want to add a quick note and say that I'm having the same problem
 with my 7.0-RELEASE-p6/amd64 hetzner machine:
 
  
 http://lists.freebsd.org/pipermail/freebsd-acpi/2008-September/005095.html
 
 I would be happy to test patches as well.  Thanks.

Hello Oliver,

What i did so far and improved a lot the experience was:

1) Upgrade at least the if_re code to RELENG_7. This fixes issues
   of packet corruption on ssh sessions.

2) Delete from your kernel config USB and firewire. This prevents
   the realtek interrupt to be shared.

After this, with 7.1 -BETA2 the systems are more or less stable, but
after a while the ATA controller starts to create interrupt storms.
I wasn't able to find why.

With the help that i've received in this thread from Pyun
YongHyeon (Thanks!!) i'm also trying this suggestions:

3) Backport this 3 files from current to 7.1 -BETA2:

base/head/sys/dev/re/if_re.c
base/head/sys/pci/if_rl.c
base/head/sys/pci/if_rlreg.h

You can fetch them from http://svn.freebsd.org/. With them and
adding to /boot/loader.conf this tunable:

hw.re.msi_disable=0

I can use GENERIC kernel again (ie, USB enabled) and so far
i didn't find any problem yet. No more interface up/down problems
and no more interrupt storms. I must say that i haven't tested
this enough, because the interrupt storms in ATA code start to
happen after a few days of uptime load, but at least the problems
with the realtek seem to be gone. 

If you upgrade to 7.1 -BETA2 you'll also get SATA support for
the IXP card. With 7.0 it will work as ATA 33 in compatibility mode.

Maybe someone with write access to the wiki could add it somewhere
so that other hetzner users that are having the same problems
could use the same workarounds :)

I hope this helps you.

Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Gary Jennejohn
On Wed, 10 Dec 2008 21:07:19 +0900
Pyun YongHyeon [EMAIL PROTECTED] wrote:
 On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote:

   As these seems to improve the current situation, is there any
   chance of merging -current driver in 7.1 before release?
  

 I think re(4) in HEAD needs more testing. As you might know RealTek
 produced too many chipsets. :-(


FYI I've now turned MSI on in HEAD and will see what happens.  Before
my re0 was sharing interrupts with 3 USB controllers.  Now it's all
by itself on irq256.

I'm running amd64 with

re0: RealTek 8168/8168B/8168C/8168CP/8168D/8111B/8111C/8111CP PCIe
Gigabit Ethernet port 0xde00-0xdeff mem 0xfdaff000-0xfdaf,
0xfdae-0xfdae irq 18 at device 0.0 on pci2

---
Gary Jennejohn
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Victor Balada Diaz
On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote:
 On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote:
   On Wed, Dec 10, 2008 at 07:28:00PM +0900, Pyun YongHyeon wrote:
On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote:
  On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote:
   On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
 Hello,
 
 I got various machines[1] at hetzner.de and I've been having 
 problems
 with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in 
 amd64. I've
 been trying to narrow the problem so someone more knowledgeable 
 than me
 is able to fix it. This mail is an other attempt to ask a 
 question
 with regards ATA code to see if this time i got something.
 
 For the ones that don't actually know what happened:
 
 With FreeBSD 7.0 -RELEASE for amd64 and default kernel
 the system shared re0 interrupt with OHCI and this caused
 re(4) to corrupt packets and create interrupt storms. Tried
   
   re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily
   triggered on systems with  4GB memory. But I dont' know whether
   this is related with interrupt storms.
   
 updating to 7.1 -BETA2 and still had some problems with it.
 
 I've opened the PR kern/128287[2] and Remko quickly answered
 with a workaround: that workaround was removing USB support from
 my kernel. I did it and re(4) wasn't sharing interrupts 
 anylonger,
 and the interrupt storms were gone. Now sometime later the 
 interface
 goes up and down from time to time, but less often. Also 
 sometimes
 the machine losts the network interface but continues to work.
 
   
   It seems that your controller supports MSI so you can set a tunable
   hw.re.msi_disable to 0 to enable MSI. With MSI you can remove
   interrupt sharing(e.g. add hw.re.msi_disable=0 to
   /boot/loader.conf file.) However there were several issues on re(4)
   w.r.t MSI so it was off by default.
  
  This is undocumented and with sysctl -a i can't find the tunable. Is 
 this
  a HEAD feature or it's also in 7.1 -BETA2? Should i add

Yeah it's an undocmented feature. But most drivers written by me
have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have
the tunable.
   
   I think it could be great if you could document it or at least
   show it by default when you do sysctl -ad with a small description.
   
 
 If MSI worked as expected I would have documented it as I did
 in msk(4)/nfe(4)/ale(4)/age(4)/jme(4) etc.
 Using MSI on RealTek does not seem to stable. I tried hard to fix
 that but some users still reported watchdog timeouts. Working
 without documentation and hardware also made it hard to complete
 the work. This was the main reason why MSI was disabled on re(4).

What do you think about adding a note in the man page telling that
it's experimental and in some cases it could improve the situation
but in others it will give errors? 

 

  hw.re_msi_disable=0 to /boot/loader.conf?
   ^
   Shoule be hw.re.msi_disable=0
  

Yes, just add it to /boot/loader.conf. Note, you should not disable
system-wide MSI control(e.g. hw.pci.enable_msi == 1).

  This was sharing interrupt with USB, does USB need any special MSI 
 handling
  or with re using MSI is enough to not share the interrupt?

If re(4) can use MSI, you don't need to worry about interrupt
sharing with USB. Check the output of vmstat -i. You normally get
an irq256 or higher for MSI enabled driver.

  
  
   
 I know it continues to work because some days later i can see 
 that
 it tried to deliver the status reports but was unable to resolve 
 the
 aliases hostnames. I can't ping the machine and i know the 
 network
 is OK. If i reboot the machine everything is working again.
 
   
   Recently I've made small changes to re(4) which may help to detect
   link state change event. Would you try re(4) in HEAD?
  
  Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that

Yes, you can. It should build without problems. Just replace re(4) on
stable/7 with HEAD version.

  or do i need to test the whole HEAD kernel?
  

No you don't have to that.
   
   Backporting the changes i've found that it didn't compile so in
   the end i got from HEAD the following files:
   
   base/head/sys/dev/re/if_re.c
   base/head/sys/pci/if_rl.c
   base/head/sys/pci/if_rlreg.h
   
 
 Ah,, sorry about that. Recently there was some changes. I forgot
 that.
 
   After that i've recompiled 7.1 -BETA2 GENERIC kernel and enabled
   the knob you suggested in /boot/loader.conf.
   
   With 

Re: [ATA] and re(4) stability issues

2008-12-10 Thread Oliver Peter
On Wed, Dec 10, 2008 at 03:01:30PM +0100, Victor Balada Diaz wrote:
 On Wed, Dec 10, 2008 at 12:08:40PM +, Oliver Peter wrote:
  On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
...
 I can use GENERIC kernel again (ie, USB enabled) and so far
 i didn't find any problem yet. No more interface up/down problems
 and no more interrupt storms. I must say that i haven't tested
 this enough, because the interrupt storms in ATA code start to
 happen after a few days of uptime load, but at least the problems
 with the realtek seem to be gone. 

I found out that I'm able to 'force' the interrupt storm by provoking
higher disk I/O.  Just let dd write to a file in a loop for some hours
and watch vmstat:

while true; do dd if=/dev/zero of=BLA bs=1M count=1000; done

First you'll see that the throughput will decrease, and a few
hours later you'll have /var/log/messages / dmesg full of
interrupt storm messages.
 
 If you upgrade to 7.1 -BETA2 you'll also get SATA support for
 the IXP card. With 7.0 it will work as ATA 33 in compatibility mode.

Wow!  That's good to hear as well.  I'll definitely switch to
-STABLE or 7.1-PRERELASE sooner or later.  I'll just give it a try
on my other machines at first.
 
 I hope this helps you.

Absolutely, cheers mate.  I owe you one!

~ollie

-- 
Oliver PETER, email: [EMAIL PROTECTED], ICQ# 113969174
If it feels good, you're doing something wrong.
  -- Coach McTavish
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Victor Balada Diaz
On Wed, Dec 10, 2008 at 01:18:00PM +0100, Arnaud Houdelette wrote:
 Victor Balada Diaz a écrit :
 Hello,
 
 I got various machines[1] at hetzner.de and I've been having problems
 with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
 been trying to narrow the problem so someone more knowledgeable than me
 is able to fix it. This mail is an other attempt to ask a question
 with regards ATA code to see if this time i got something.
 
 [...] 
 
 Sorry I didn't take the time to read all the thread, but I got similar 
 problem with the same IXP600 chipset.
 Only it was'nt with a Realtek NIC (re) but with a Ralink wireless one. 
 The simptoms where similar : interrupt 22 was shared between the sata 
 controler and the wireless card. And I got Interrupt Storms at random 
 times when using the wireless network.
 
 No problem since I removed the ral(4) NIC (got a real access point now).
 You might not want to point the finger at the re(4) driver too fast.
 
 Arnaud Houdelette
Hello Arnaud,

I didn't say the problem was just because of re(4). Actually i think the
there were two problems, one with re(4) and other with ata(4). The reason
why i talked about both of them in the same mail is because i thought
that as two drivers were affected, maybe the problem was in other part
of the operating system and that could help the developers to debug the
problem.

My re(4) card isn't sharing the interrupt with IXP600, it's sharing
the interrupt with USB controller. In this case i think the problem
is fixed with the advices from Pyun YongHyeon (backporting the driver
from HEAD and using MSI for interrupts).

I think the problems with ata(4) code will appear again after a few
days of load, as they always do, so i'll keep trying to debug them.

Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Peter Jeremy
On 2008-Dec-10 10:55:35 +0100, Søren Schmidt [EMAIL PROTECTED] wrote:
And you will not use 64bit DMA even if the chipset supports it.  
However I have not seen any chipsets supporting this fail, YMMV as  
usual :)

There's a reference in wikipedia pointing to
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06694.html
that claims the AMD/ATI SB600 lies about supporting 64-bit DMA in AHCI
mode.  I have a SB600 but it doesn't have 4GB to test on.

-- 
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.


pgp1ifE19lUGB.pgp
Description: PGP signature


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Pyun YongHyeon
On Wed, Dec 10, 2008 at 03:08:24PM +0100, Victor Balada Diaz wrote:
  On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote:

[...]

 It seems that your controller supports MSI so you can set a 
   tunable
 hw.re.msi_disable to 0 to enable MSI. With MSI you can remove
 interrupt sharing(e.g. add hw.re.msi_disable=0 to
 /boot/loader.conf file.) However there were several issues on 
   re(4)
 w.r.t MSI so it was off by default.

This is undocumented and with sysctl -a i can't find the tunable. 
   Is this
a HEAD feature or it's also in 7.1 -BETA2? Should i add
  
  Yeah it's an undocmented feature. But most drivers written by me
  have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have
  the tunable.
 
 I think it could be great if you could document it or at least
 show it by default when you do sysctl -ad with a small description.
 
   
   If MSI worked as expected I would have documented it as I did
   in msk(4)/nfe(4)/ale(4)/age(4)/jme(4) etc.
   Using MSI on RealTek does not seem to stable. I tried hard to fix
   that but some users still reported watchdog timeouts. Working
   without documentation and hardware also made it hard to complete
   the work. This was the main reason why MSI was disabled on re(4).
  
  What do you think about adding a note in the man page telling that
  it's experimental and in some cases it could improve the situation
  but in others it will give errors? 

Based on the your testing I have idea how to mitigate the missing
Tx completion interrupt. If all goes well re(4) could reliably take
advantage of MSI on RealTek controllers. If that miserably fail I
would do as you suggested.

   
   I think re(4) in HEAD needs more testing. As you might know RealTek
   produced too many chipsets. :-(
  
  Ok, i'll use the backported driver as it works better for me :-)
  
  If i can help you testing any patches i'm more than welcome to do it.
  
  Thanks a lot for your help Pyun YongHyeon.
  

You're welcome.
-- 
Regards,
Pyun YongHyeon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [ATA] and re(4) stability issues

2008-12-10 Thread Victor Balada Diaz
On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote:
 On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote:
   Also i didn't see any problem with interfaces going up and down,
   but that usually happen after some hours of uptime, so i'll let
   you know if the error happens again.
   

After writing to the HD with dd for a few hours and using
stress -i 10 -d 10 the machine lost connectivity. I waited until
today to be sure if the machine hung, paniced or just lost network
connectivity. I don't have local access or serial access, so this
is the only way i could do it. I've seen in the logs during the
night various messages of:


Dec 10 00:33:49 yac kernel: re0: watchdog timeout
Dec 10 00:33:49 yac kernel: re0: link state changed to DOWN
Dec 10 00:33:52 yac kernel: re0: link state changed to UP

The interface never recovered and i wasn't able to ping the machine
until i rebooted. Nagios was checking all the time and no recovery
happened.

The netstat -i in daily scripts shows just one Oerrs. I'm used to
have a lot of them, but seems this time the card didn't recover from
the only one. I also want to say that this is not a regression, as
it happened before with 7.1 -BETA 2 code.

Is there anything more i can try?

Regards.
-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


[ATA] and re(4) stability issues

2008-12-09 Thread Victor Balada Diaz
Hello,

I got various machines[1] at hetzner.de and I've been having problems
with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
been trying to narrow the problem so someone more knowledgeable than me
is able to fix it. This mail is an other attempt to ask a question
with regards ATA code to see if this time i got something.

For the ones that don't actually know what happened:

With FreeBSD 7.0 -RELEASE for amd64 and default kernel
the system shared re0 interrupt with OHCI and this caused
re(4) to corrupt packets and create interrupt storms. Tried
updating to 7.1 -BETA2 and still had some problems with it.

I've opened the PR kern/128287[2] and Remko quickly answered
with a workaround: that workaround was removing USB support from
my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
and the interrupt storms were gone. Now sometime later the interface
goes up and down from time to time, but less often. Also sometimes
the machine losts the network interface but continues to work.

I know it continues to work because some days later i can see that
it tried to deliver the status reports but was unable to resolve the
aliases hostnames. I can't ping the machine and i know the network
is OK. If i reboot the machine everything is working again.

When switched from 7.0 to 7.1 BETA2 i also found that under load
after some hours the machine created interrupt storms on ATA disks.

Digging at linux source code i've found that they do some special things
for this chipset that i've been unable to find on our code. This is
linux code for my chipset:

371 AHCI_HFLAGS (AHCI_HFLAG_IGN_SERR_INTERNAL |
372  AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI |
373  AHCI_HFLAG_SECT255),

File and the rest of the code in here[3].

As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could
think of, switching MSI and MSI-x off for the whole system, so
i added to /boot/loader.conf this tunables:

hw.pci.enable_msix=0
hw.pci.enable_msi=0

And then rebooted the machine. After various hours of doing almost nothing
i've found that the machine answered ping but was unable to answer any
request (eg, ssh, nagios nrpe, etc). The machine recovered itself after
some minutes and when i was able to ssh into i saw the following in dmesg:

ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1463123158

and a lot more errors like that. I didn't get this errors with MSI enabled.
I see WRITE_DMA48 and in linux code i saw AHCI_HFLAG_32BIT_ONLY which is later
used for DMA related things. Could someone who is more knowledgeable check
if we're doing the right thing?

I've attached verbose dmesg of a machine that's like this one with
7.1 -BETA2, MSI enabled and GENERIC kernel minus USB and firewrire.

Also, please, could someone give me a hand on how could i continue debugging
this interrupt issues? I'm a bit lost and digging code and posting each
time i think i've found something is not going to go anywhere.

I would also like to say that i've seen reports of this kind of problems
on amd64 machines in the lists since various years ago, so i don't think
this is just a problem with this BIOS/motherboard (MSI K9AG Neo2 Digital)
on the lists


Thanks in advance for any help.
Regards.


[1]: http://www.hetzner.de/hosting/produkte_rootserver/ds7000/
[2]: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/128287
[3]: http://fxr.watson.org/fxr/source/drivers/ata/ahci.c?v=linux-2.6#L369
-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
Copyright (c) 1992-2008 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 7.1-BETA2 #1: Wed Oct 22 13:19:14 CEST 2008
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/NOUSB
Preloaded elf kernel /boot/kernel/kernel at 0x80c12000.
Preloaded elf obj module /boot/kernel/geom_mirror.ko at 0x80c121a8.
Preloaded elf obj module /boot/kernel/accf_data.ko at 0x80c12818.
Preloaded elf obj module /boot/kernel/accf_http.ko at 0x80c12cc8.
Preloaded elf obj module /boot/kernel/k8temp.ko at 0x80c13238.
Preloaded elf obj module /boot/kernel/geom_journal.ko at 0x80c13720.
Calibrating clock(s) ... i8254 clock: 1193242 Hz
CLK_USE_I8254_CALIBRATION not specified - using default frequency

Re: [ATA] and re(4) stability issues

2008-12-09 Thread Pyun YongHyeon
On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
  Hello,
  
  I got various machines[1] at hetzner.de and I've been having problems
  with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
  been trying to narrow the problem so someone more knowledgeable than me
  is able to fix it. This mail is an other attempt to ask a question
  with regards ATA code to see if this time i got something.
  
  For the ones that don't actually know what happened:
  
  With FreeBSD 7.0 -RELEASE for amd64 and default kernel
  the system shared re0 interrupt with OHCI and this caused
  re(4) to corrupt packets and create interrupt storms. Tried

re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily
triggered on systems with  4GB memory. But I dont' know whether
this is related with interrupt storms.

  updating to 7.1 -BETA2 and still had some problems with it.
  
  I've opened the PR kern/128287[2] and Remko quickly answered
  with a workaround: that workaround was removing USB support from
  my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
  and the interrupt storms were gone. Now sometime later the interface
  goes up and down from time to time, but less often. Also sometimes
  the machine losts the network interface but continues to work.
  

It seems that your controller supports MSI so you can set a tunable
hw.re.msi_disable to 0 to enable MSI. With MSI you can remove
interrupt sharing(e.g. add hw.re.msi_disable=0 to
/boot/loader.conf file.) However there were several issues on re(4)
w.r.t MSI so it was off by default.

  I know it continues to work because some days later i can see that
  it tried to deliver the status reports but was unable to resolve the
  aliases hostnames. I can't ping the machine and i know the network
  is OK. If i reboot the machine everything is working again.
  

Recently I've made small changes to re(4) which may help to detect
link state change event. Would you try re(4) in HEAD?

-- 
Regards,
Pyun YongHyeon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]