Re: [ATA] and re(4) stability issues
On Fri, Dec 12, 2008 at 01:13:09PM +0100, Victor Balada Diaz wrote: On Thu, Dec 11, 2008 at 10:50:21AM +0100, Victor Balada Diaz wrote: On Thu, Dec 11, 2008 at 06:00:56PM +0900, Pyun YongHyeon wrote: I've reverted r185756 which caused GMII access issues on some controllers. If you are brave enough to try beta code, you can get latest re(4) in the following URL. Note, I don't have PCIe based RealTek controllers so the code was not tested at all. http://people.freebsd.org/~yongari/re/if_re.c http://people.freebsd.org/~yongari/re/if_rlreg.h I've recompiled the kernel with the first file in sys/dev/re/ and the second one in sys/pci/. I'm still testing with MSI enabled. So far tried rebooting using nextboot(8) (just in case i lost the network card i could boot again) and the card seems to work but i'll continue stress testing the machine with stress + dd + iperf and see if i can take it down. I'll let you know how it goes. After a day of stress testing the machine haven't got errors, interrupt storms or interface up/down problems. Everything seems fine. I'll continue stress testing the machine during the weekend, but i would say that this time it's fixed. Stopped stress testing this morning. After all the weekend testing seems the re(4) problems were fixed. No single interface up/down error. netstat -i reports no errors and everything is fine. Thanks a lot! I'm going to deploy the patches on our production machines. I've been able to trigger interrupt storms with ATA code, though. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 10:55:35AM +0100, Søren Schmidt wrote: On 10Dec, 2008, at 10:11 , Victor Balada Diaz wrote: Thanks for explaining me what the flags do. I'm not skilled enough to create the DMA quirks but if you could give me some patches i'll test them. Also if you have any other idea on what could i test or how can i debug this it would be more than welcome. Comment out the following two lines in ata_ahci_dmainit(): if (ATA_INL(ctlr-r_res2, ATA_AHCI_CAP) ATA_AHCI_CAP_64BIT) ch-dma-max_address = BUS_SPACE_MAXADDR; And you will not use 64bit DMA even if the chipset supports it. However I have not seen any chipsets supporting this fail, YMMV as usual :) Hello Søren, I'm triggering interrupt storms with this chipset after a few days of stressing the HD calling sysutils/stress with stress -d 10 -i 10 and in other term, doing: while true; do dd if=/dev/zero of=BAH bs=1M count=1024; done; Right now, as reported by systat -vmstat i have 578k interrupts in atapci and the machine is idle. Do you have any idea on how could i debug this? any advice would be much more than welcome. Thanks a lot. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: [ATA] and re(4) stability issues
On Mon, Dec 15, 2008 at 10:02:07AM +0100, Victor Balada Diaz wrote: Stopped stress testing this morning. After all the weekend testing seems the re(4) problems were fixed. No single interface up/down error. netstat -i reports no errors and everything is fine. Thanks a lot! I'm going to deploy the patches on our production machines. I've been able to trigger interrupt storms with ATA code, though. After deploying it in various machines this night i've found in the logs messages like this one: re0: watchdog timeout (missed Tx interrupts) -- recovering I know you told me this is harmless, so this is just so you know it's happening. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: [ATA] and re(4) stability issues
On Tue, Dec 16, 2008 at 08:19:19AM +0100, Victor Balada Diaz wrote: On Mon, Dec 15, 2008 at 10:02:07AM +0100, Victor Balada Diaz wrote: Stopped stress testing this morning. After all the weekend testing seems the re(4) problems were fixed. No single interface up/down error. netstat -i reports no errors and everything is fine. Thanks a lot! I'm going to deploy the patches on our production machines. I've been able to trigger interrupt storms with ATA code, though. After deploying it in various machines this night i've found in the logs messages like this one: re0: watchdog timeout (missed Tx interrupts) -- recovering I know you told me this is harmless, so this is just so you Yes, it's not real watchdog timeout as long as re(4) still works correctly. know it's happening. Ok. I'll update re(4) when I find spare time. -- Regards, Pyun YongHyeon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: [ATA] and re(4) stability issues
On Thu, Dec 11, 2008 at 10:50:21AM +0100, Victor Balada Diaz wrote: On Thu, Dec 11, 2008 at 06:00:56PM +0900, Pyun YongHyeon wrote: I've reverted r185756 which caused GMII access issues on some controllers. If you are brave enough to try beta code, you can get latest re(4) in the following URL. Note, I don't have PCIe based RealTek controllers so the code was not tested at all. http://people.freebsd.org/~yongari/re/if_re.c http://people.freebsd.org/~yongari/re/if_rlreg.h I've recompiled the kernel with the first file in sys/dev/re/ and the second one in sys/pci/. I'm still testing with MSI enabled. So far tried rebooting using nextboot(8) (just in case i lost the network card i could boot again) and the card seems to work but i'll continue stress testing the machine with stress + dd + iperf and see if i can take it down. I'll let you know how it goes. After a day of stress testing the machine haven't got errors, interrupt storms or interface up/down problems. Everything seems fine. I'll continue stress testing the machine during the weekend, but i would say that this time it's fixed. Seems lately there have been a lot of testing of this driver. Is there any chance of it being on 7.1 or being MFCed after the release to RELENG_7? Thanks a lot. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: [ATA] and re(4) stability issues
On Fri, Dec 12, 2008 at 01:13:09PM +0100, Victor Balada Diaz wrote: On Thu, Dec 11, 2008 at 10:50:21AM +0100, Victor Balada Diaz wrote: On Thu, Dec 11, 2008 at 06:00:56PM +0900, Pyun YongHyeon wrote: I've reverted r185756 which caused GMII access issues on some controllers. If you are brave enough to try beta code, you can get latest re(4) in the following URL. Note, I don't have PCIe based RealTek controllers so the code was not tested at all. http://people.freebsd.org/~yongari/re/if_re.c http://people.freebsd.org/~yongari/re/if_rlreg.h I've recompiled the kernel with the first file in sys/dev/re/ and the second one in sys/pci/. I'm still testing with MSI enabled. So far tried rebooting using nextboot(8) (just in case i lost the network card i could boot again) and the card seems to work but i'll continue stress testing the machine with stress + dd + iperf and see if i can take it down. I'll let you know how it goes. After a day of stress testing the machine haven't got errors, interrupt storms or interface up/down problems. Everything seems fine. I'll continue stress testing the machine during the weekend, but i would say that this time it's fixed. Thanks for testing! Seems lately there have been a lot of testing of this driver. Is there any chance of it being on 7.1 or being MFCed after the release to RELENG_7? It's too early to say MFC but I think MFC would be done after releasing 7.1-RELEASE if all goes well. -- Regards, Pyun YongHyeon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: [ATA] and re(4) stability issues
On Thu, Dec 11, 2008 at 05:05:59AM +1100, Peter Jeremy wrote: On 2008-Dec-10 10:55:35 +0100, Søren Schmidt [EMAIL PROTECTED] wrote: And you will not use 64bit DMA even if the chipset supports it. However I have not seen any chipsets supporting this fail, YMMV as usual :) There's a reference in wikipedia pointing to http://www.mail-archive.com/[EMAIL PROTECTED]/msg06694.html that claims the AMD/ATI SB600 lies about supporting 64-bit DMA in AHCI mode. I have a SB600 but it doesn't have 4GB to test on. I have 6 GB of RAM and can test patches, so once i'm done with the re(4) side of things i'll try commenting the code Soren's suggested and see if that improves the situation. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Thu, Dec 11, 2008 at 08:57:07AM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote: On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote: Also i didn't see any problem with interfaces going up and down, but that usually happen after some hours of uptime, so i'll let you know if the error happens again. After writing to the HD with dd for a few hours and using stress -i 10 -d 10 the machine lost connectivity. I waited until today to be sure if the machine hung, paniced or just lost network connectivity. I don't have local access or serial access, so this is the only way i could do it. I've seen in the logs during the night various messages of: Dec 10 00:33:49 yac kernel: re0: watchdog timeout Dec 10 00:33:49 yac kernel: re0: link state changed to DOWN Dec 10 00:33:52 yac kernel: re0: link state changed to UP The interface never recovered and i wasn't able to ping the machine until i rebooted. Nagios was checking all the time and no recovery happened. The netstat -i in daily scripts shows just one Oerrs. I'm used to have a lot of them, but seems this time the card didn't recover from the only one. I also want to say that this is not a regression, as it happened before with 7.1 -BETA 2 code. Is there anything more i can try? Sorry it's too early in the morning and i thought today was 10 instead of 11. I don't even know the day i'm today. Looking at today's log i see no link state changed messages but i see this other messages that started happening more or less at the same time i lost connectivity to the server: Dec 10 18:20:32 yac kernel: re0: link state changed to DOWN Dec 10 18:20:32 yac kernel: re0: PHY read failed Sorry for the noise. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Thu, Dec 11, 2008 at 09:10:45AM +0100, Victor Balada Diaz wrote: On Thu, Dec 11, 2008 at 08:57:07AM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote: On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote: Also i didn't see any problem with interfaces going up and down, but that usually happen after some hours of uptime, so i'll let you know if the error happens again. After writing to the HD with dd for a few hours and using stress -i 10 -d 10 the machine lost connectivity. I waited until today to be sure if the machine hung, paniced or just lost network connectivity. I don't have local access or serial access, so this is the only way i could do it. I've seen in the logs during the night various messages of: Dec 10 00:33:49 yac kernel: re0: watchdog timeout Dec 10 00:33:49 yac kernel: re0: link state changed to DOWN Dec 10 00:33:52 yac kernel: re0: link state changed to UP The interface never recovered and i wasn't able to ping the machine until i rebooted. Nagios was checking all the time and no recovery happened. The netstat -i in daily scripts shows just one Oerrs. I'm used to have a lot of them, but seems this time the card didn't recover from the only one. I also want to say that this is not a regression, as it happened before with 7.1 -BETA 2 code. Is there anything more i can try? Sorry it's too early in the morning and i thought today was 10 instead of 11. I don't even know the day i'm today. Looking at today's log i see no link state changed messages but i see this other messages that started happening more or less at the same time i lost connectivity to the server: Dec 10 18:20:32 yac kernel: re0: link state changed to DOWN Dec 10 18:20:32 yac kernel: re0: PHY read failed I've reverted r185756 which caused GMII access issues on some controllers. If you are brave enough to try beta code, you can get latest re(4) in the following URL. Note, I don't have PCIe based RealTek controllers so the code was not tested at all. http://people.freebsd.org/~yongari/re/if_re.c http://people.freebsd.org/~yongari/re/if_rlreg.h -- Regards, Pyun YongHyeon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Thu, Dec 11, 2008 at 06:00:56PM +0900, Pyun YongHyeon wrote: On Thu, Dec 11, 2008 at 09:10:45AM +0100, Victor Balada Diaz wrote: On Thu, Dec 11, 2008 at 08:57:07AM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote: On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote: Also i didn't see any problem with interfaces going up and down, but that usually happen after some hours of uptime, so i'll let you know if the error happens again. After writing to the HD with dd for a few hours and using stress -i 10 -d 10 the machine lost connectivity. I waited until today to be sure if the machine hung, paniced or just lost network connectivity. I don't have local access or serial access, so this is the only way i could do it. I've seen in the logs during the night various messages of: Dec 10 00:33:49 yac kernel: re0: watchdog timeout Dec 10 00:33:49 yac kernel: re0: link state changed to DOWN Dec 10 00:33:52 yac kernel: re0: link state changed to UP The interface never recovered and i wasn't able to ping the machine until i rebooted. Nagios was checking all the time and no recovery happened. The netstat -i in daily scripts shows just one Oerrs. I'm used to have a lot of them, but seems this time the card didn't recover from the only one. I also want to say that this is not a regression, as it happened before with 7.1 -BETA 2 code. Is there anything more i can try? Sorry it's too early in the morning and i thought today was 10 instead of 11. I don't even know the day i'm today. Looking at today's log i see no link state changed messages but i see this other messages that started happening more or less at the same time i lost connectivity to the server: Dec 10 18:20:32 yac kernel: re0: link state changed to DOWN Dec 10 18:20:32 yac kernel: re0: PHY read failed I've reverted r185756 which caused GMII access issues on some controllers. If you are brave enough to try beta code, you can get latest re(4) in the following URL. Note, I don't have PCIe based RealTek controllers so the code was not tested at all. http://people.freebsd.org/~yongari/re/if_re.c http://people.freebsd.org/~yongari/re/if_rlreg.h I've recompiled the kernel with the first file in sys/dev/re/ and the second one in sys/pci/. I'm still testing with MSI enabled. So far tried rebooting using nextboot(8) (just in case i lost the network card i could boot again) and the card seems to work but i'll continue stress testing the machine with stress + dd + iperf and see if i can take it down. I'll let you know how it goes. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
Victor Balada Diaz wrote: Digging at linux source code i've found that they do some special things for this chipset that i've been unable to find on our code. This is linux code for my chipset: 371 AHCI_HFLAGS (AHCI_HFLAG_IGN_SERR_INTERNAL | 372 AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI | 373 AHCI_HFLAG_SECT255), File and the rest of the code in here[3]. As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could think of, switching MSI and MSI-x off for the whole system, so i added to /boot/loader.conf this tunables: FreeBSD's ata(4) driver doesn't support MSI. This flag in linux's libata used in if ((hpriv-flags AHCI_HFLAG_NO_MSI) || pci_enable_msi(pdev)) pci_intx(pdev, 1); In FreeBSD's code we have the same: /* enable PCI interrupt */ pci_write_config(dev, PCIR_COMMAND, pci_read_config(dev, PCIR_COMMAND, 2) ~0x0400, 2); AHCI_HFLAG_IGN_SERR_INTERNAL flag targeted to ignore SERR_INTERNAL errors. FreeBSD's ata(4) driver ignores they too. AHCI_HFLAG_32BIT_ONLY flag limits to use 32-bit DMA only. If AHCI CAP register reports that controller supports 64-bit DMA driver will use 64-bit. So i think there can be added one quirk for you, but i'm not sure that problem is here.. AHCI_HFLAG_SECT255 flag limits I/O operation to 255 sectors, FreeBSD uses 128-limit by default. -- WBR, Andrey V. Elsukov ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote: On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote: Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. For the ones that don't actually know what happened: With FreeBSD 7.0 -RELEASE for amd64 and default kernel the system shared re0 interrupt with OHCI and this caused re(4) to corrupt packets and create interrupt storms. Tried re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily triggered on systems with 4GB memory. But I dont' know whether this is related with interrupt storms. updating to 7.1 -BETA2 and still had some problems with it. I've opened the PR kern/128287[2] and Remko quickly answered with a workaround: that workaround was removing USB support from my kernel. I did it and re(4) wasn't sharing interrupts anylonger, and the interrupt storms were gone. Now sometime later the interface goes up and down from time to time, but less often. Also sometimes the machine losts the network interface but continues to work. It seems that your controller supports MSI so you can set a tunable hw.re.msi_disable to 0 to enable MSI. With MSI you can remove interrupt sharing(e.g. add hw.re.msi_disable=0 to /boot/loader.conf file.) However there were several issues on re(4) w.r.t MSI so it was off by default. This is undocumented and with sysctl -a i can't find the tunable. Is this a HEAD feature or it's also in 7.1 -BETA2? Should i add hw.re_msi_disable=0 to /boot/loader.conf? This was sharing interrupt with USB, does USB need any special MSI handling or with re using MSI is enough to not share the interrupt? I know it continues to work because some days later i can see that it tried to deliver the status reports but was unable to resolve the aliases hostnames. I can't ping the machine and i know the network is OK. If i reboot the machine everything is working again. Recently I've made small changes to re(4) which may help to detect link state change event. Would you try re(4) in HEAD? Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that or do i need to test the whole HEAD kernel? Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 11:58:12AM +0300, Andrey V. Elsukov wrote: Victor Balada Diaz wrote: Digging at linux source code i've found that they do some special things for this chipset that i've been unable to find on our code. This is linux code for my chipset: 371 AHCI_HFLAGS (AHCI_HFLAG_IGN_SERR_INTERNAL | 372 AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI | 373 AHCI_HFLAG_SECT255), File and the rest of the code in here[3]. As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could think of, switching MSI and MSI-x off for the whole system, so i added to /boot/loader.conf this tunables: FreeBSD's ata(4) driver doesn't support MSI. This flag in linux's libata used in if ((hpriv-flags AHCI_HFLAG_NO_MSI) || pci_enable_msi(pdev)) pci_intx(pdev, 1); In FreeBSD's code we have the same: /* enable PCI interrupt */ pci_write_config(dev, PCIR_COMMAND, pci_read_config(dev, PCIR_COMMAND, 2) ~0x0400, 2); AHCI_HFLAG_IGN_SERR_INTERNAL flag targeted to ignore SERR_INTERNAL errors. FreeBSD's ata(4) driver ignores they too. AHCI_HFLAG_32BIT_ONLY flag limits to use 32-bit DMA only. If AHCI CAP register reports that controller supports 64-bit DMA driver will use 64-bit. So i think there can be added one quirk for you, but i'm not sure that problem is here.. AHCI_HFLAG_SECT255 flag limits I/O operation to 255 sectors, FreeBSD uses 128-limit by default. Thanks for explaining me what the flags do. I'm not skilled enough to create the DMA quirks but if you could give me some patches i'll test them. Also if you have any other idea on what could i test or how can i debug this it would be more than welcome. Thanks. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On 10Dec, 2008, at 10:11 , Victor Balada Diaz wrote: Thanks for explaining me what the flags do. I'm not skilled enough to create the DMA quirks but if you could give me some patches i'll test them. Also if you have any other idea on what could i test or how can i debug this it would be more than welcome. Comment out the following two lines in ata_ahci_dmainit(): if (ATA_INL(ctlr-r_res2, ATA_AHCI_CAP) ATA_AHCI_CAP_64BIT) ch-dma-max_address = BUS_SPACE_MAXADDR; And you will not use 64bit DMA even if the chipset supports it. However I have not seen any chipsets supporting this fail, YMMV as usual :) -Søren ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 07:28:00PM +0900, Pyun YongHyeon wrote: On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote: On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote: Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. For the ones that don't actually know what happened: With FreeBSD 7.0 -RELEASE for amd64 and default kernel the system shared re0 interrupt with OHCI and this caused re(4) to corrupt packets and create interrupt storms. Tried re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily triggered on systems with 4GB memory. But I dont' know whether this is related with interrupt storms. updating to 7.1 -BETA2 and still had some problems with it. I've opened the PR kern/128287[2] and Remko quickly answered with a workaround: that workaround was removing USB support from my kernel. I did it and re(4) wasn't sharing interrupts anylonger, and the interrupt storms were gone. Now sometime later the interface goes up and down from time to time, but less often. Also sometimes the machine losts the network interface but continues to work. It seems that your controller supports MSI so you can set a tunable hw.re.msi_disable to 0 to enable MSI. With MSI you can remove interrupt sharing(e.g. add hw.re.msi_disable=0 to /boot/loader.conf file.) However there were several issues on re(4) w.r.t MSI so it was off by default. This is undocumented and with sysctl -a i can't find the tunable. Is this a HEAD feature or it's also in 7.1 -BETA2? Should i add Yeah it's an undocmented feature. But most drivers written by me have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have the tunable. I think it could be great if you could document it or at least show it by default when you do sysctl -ad with a small description. hw.re_msi_disable=0 to /boot/loader.conf? ^ Shoule be hw.re.msi_disable=0 Yes, just add it to /boot/loader.conf. Note, you should not disable system-wide MSI control(e.g. hw.pci.enable_msi == 1). This was sharing interrupt with USB, does USB need any special MSI handling or with re using MSI is enough to not share the interrupt? If re(4) can use MSI, you don't need to worry about interrupt sharing with USB. Check the output of vmstat -i. You normally get an irq256 or higher for MSI enabled driver. I know it continues to work because some days later i can see that it tried to deliver the status reports but was unable to resolve the aliases hostnames. I can't ping the machine and i know the network is OK. If i reboot the machine everything is working again. Recently I've made small changes to re(4) which may help to detect link state change event. Would you try re(4) in HEAD? Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that Yes, you can. It should build without problems. Just replace re(4) on stable/7 with HEAD version. or do i need to test the whole HEAD kernel? No you don't have to that. Backporting the changes i've found that it didn't compile so in the end i got from HEAD the following files: base/head/sys/dev/re/if_re.c base/head/sys/pci/if_rl.c base/head/sys/pci/if_rlreg.h After that i've recompiled 7.1 -BETA2 GENERIC kernel and enabled the knob you suggested in /boot/loader.conf. With the new kernel and MSI the interrupts are like this: # vmstat -i interrupt total rate irq9: acpi01 0 irq16: ohci0 1 0 irq17: ohci1 ohci3 1 0 irq18: ohci2 ohci4 1 0 irq22: atapci0 19215 15 cpu0: timer 2502718 1998 irq256: re0 4967726 3967 cpu1: timer 2502525 1998 Total9992188 7980 The high interrupt numbers are because i've been running iperf to check everything it's fine, not because of interrupt storms. So far i didn't find any interrupt storms related to USB or re(4) driver but while doing the tests i've found this error: re0: watchdog timeout (missed Tx interrupts) -- recovering This didn't create any error on the interfaces (netstat -i). Also i didn't see any problem with interfaces going up and down, but that usually happen after some hours
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 07:28:00PM +0900, Pyun YongHyeon wrote: On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote: On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote: Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. For the ones that don't actually know what happened: With FreeBSD 7.0 -RELEASE for amd64 and default kernel the system shared re0 interrupt with OHCI and this caused re(4) to corrupt packets and create interrupt storms. Tried re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily triggered on systems with 4GB memory. But I dont' know whether this is related with interrupt storms. updating to 7.1 -BETA2 and still had some problems with it. I've opened the PR kern/128287[2] and Remko quickly answered with a workaround: that workaround was removing USB support from my kernel. I did it and re(4) wasn't sharing interrupts anylonger, and the interrupt storms were gone. Now sometime later the interface goes up and down from time to time, but less often. Also sometimes the machine losts the network interface but continues to work. It seems that your controller supports MSI so you can set a tunable hw.re.msi_disable to 0 to enable MSI. With MSI you can remove interrupt sharing(e.g. add hw.re.msi_disable=0 to /boot/loader.conf file.) However there were several issues on re(4) w.r.t MSI so it was off by default. This is undocumented and with sysctl -a i can't find the tunable. Is this a HEAD feature or it's also in 7.1 -BETA2? Should i add Yeah it's an undocmented feature. But most drivers written by me have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have the tunable. I think it could be great if you could document it or at least show it by default when you do sysctl -ad with a small description. If MSI worked as expected I would have documented it as I did in msk(4)/nfe(4)/ale(4)/age(4)/jme(4) etc. Using MSI on RealTek does not seem to stable. I tried hard to fix that but some users still reported watchdog timeouts. Working without documentation and hardware also made it hard to complete the work. This was the main reason why MSI was disabled on re(4). hw.re_msi_disable=0 to /boot/loader.conf? ^ Shoule be hw.re.msi_disable=0 Yes, just add it to /boot/loader.conf. Note, you should not disable system-wide MSI control(e.g. hw.pci.enable_msi == 1). This was sharing interrupt with USB, does USB need any special MSI handling or with re using MSI is enough to not share the interrupt? If re(4) can use MSI, you don't need to worry about interrupt sharing with USB. Check the output of vmstat -i. You normally get an irq256 or higher for MSI enabled driver. I know it continues to work because some days later i can see that it tried to deliver the status reports but was unable to resolve the aliases hostnames. I can't ping the machine and i know the network is OK. If i reboot the machine everything is working again. Recently I've made small changes to re(4) which may help to detect link state change event. Would you try re(4) in HEAD? Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that Yes, you can. It should build without problems. Just replace re(4) on stable/7 with HEAD version. or do i need to test the whole HEAD kernel? No you don't have to that. Backporting the changes i've found that it didn't compile so in the end i got from HEAD the following files: base/head/sys/dev/re/if_re.c base/head/sys/pci/if_rl.c base/head/sys/pci/if_rlreg.h Ah,, sorry about that. Recently there was some changes. I forgot that. After that i've recompiled 7.1 -BETA2 GENERIC kernel and enabled the knob you suggested in /boot/loader.conf. With the new kernel and MSI the interrupts are like this: # vmstat -i interrupt total rate irq9: acpi01 0 irq16: ohci0 1 0 irq17: ohci1 ohci3 1 0 irq18: ohci2 ohci4 1 0
Re: [ATA] and re(4) stability issues
Victor Balada Diaz a écrit : Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. For the ones that don't actually know what happened: With FreeBSD 7.0 -RELEASE for amd64 and default kernel the system shared re0 interrupt with OHCI and this caused re(4) to corrupt packets and create interrupt storms. Tried updating to 7.1 -BETA2 and still had some problems with it. I've opened the PR kern/128287[2] and Remko quickly answered with a workaround: that workaround was removing USB support from my kernel. I did it and re(4) wasn't sharing interrupts anylonger, and the interrupt storms were gone. Now sometime later the interface goes up and down from time to time, but less often. Also sometimes the machine losts the network interface but continues to work. I know it continues to work because some days later i can see that it tried to deliver the status reports but was unable to resolve the aliases hostnames. I can't ping the machine and i know the network is OK. If i reboot the machine everything is working again. When switched from 7.0 to 7.1 BETA2 i also found that under load after some hours the machine created interrupt storms on ATA disks. Digging at linux source code i've found that they do some special things for this chipset that i've been unable to find on our code. This is linux code for my chipset: 371 AHCI_HFLAGS (AHCI_HFLAG_IGN_SERR_INTERNAL | 372 AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI | 373 AHCI_HFLAG_SECT255), File and the rest of the code in here[3]. As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could think of, switching MSI and MSI-x off for the whole system, so i added to /boot/loader.conf this tunables: hw.pci.enable_msix=0 hw.pci.enable_msi=0 And then rebooted the machine. After various hours of doing almost nothing i've found that the machine answered ping but was unable to answer any request (eg, ssh, nagios nrpe, etc). The machine recovered itself after some minutes and when i was able to ssh into i saw the following in dmesg: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1463123158 and a lot more errors like that. I didn't get this errors with MSI enabled. I see WRITE_DMA48 and in linux code i saw AHCI_HFLAG_32BIT_ONLY which is later used for DMA related things. Could someone who is more knowledgeable check if we're doing the right thing? I've attached verbose dmesg of a machine that's like this one with 7.1 -BETA2, MSI enabled and GENERIC kernel minus USB and firewrire. Also, please, could someone give me a hand on how could i continue debugging this interrupt issues? I'm a bit lost and digging code and posting each time i think i've found something is not going to go anywhere. I would also like to say that i've seen reports of this kind of problems on amd64 machines in the lists since various years ago, so i don't think this is just a problem with this BIOS/motherboard (MSI K9AG Neo2 Digital) on the lists Thanks in advance for any help. Regards. [1]: http://www.hetzner.de/hosting/produkte_rootserver/ds7000/ [2]: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/128287 [3]: http://fxr.watson.org/fxr/source/drivers/ata/ahci.c?v=linux-2.6#L369 Sorry I didn't take the time to read all the thread, but I got similar problem with the same IXP600 chipset. Only it was'nt with a Realtek NIC (re) but with a Ralink wireless one. The simptoms where similar : interrupt 22 was shared between the sata controler and the wireless card. And I got Interrupt Storms at random times when using the wireless network. No problem since I removed the ral(4) NIC (got a real access point now). You might not want to point the finger at the re(4) driver too fast. Arnaud Houdelette ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote: On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote: Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. For the ones that don't actually know what happened: With FreeBSD 7.0 -RELEASE for amd64 and default kernel the system shared re0 interrupt with OHCI and this caused re(4) to corrupt packets and create interrupt storms. Tried re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily triggered on systems with 4GB memory. But I dont' know whether this is related with interrupt storms. updating to 7.1 -BETA2 and still had some problems with it. I've opened the PR kern/128287[2] and Remko quickly answered with a workaround: that workaround was removing USB support from my kernel. I did it and re(4) wasn't sharing interrupts anylonger, and the interrupt storms were gone. Now sometime later the interface goes up and down from time to time, but less often. Also sometimes the machine losts the network interface but continues to work. It seems that your controller supports MSI so you can set a tunable hw.re.msi_disable to 0 to enable MSI. With MSI you can remove interrupt sharing(e.g. add hw.re.msi_disable=0 to /boot/loader.conf file.) However there were several issues on re(4) w.r.t MSI so it was off by default. This is undocumented and with sysctl -a i can't find the tunable. Is this a HEAD feature or it's also in 7.1 -BETA2? Should i add Yeah it's an undocmented feature. But most drivers written by me have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have the tunable. hw.re_msi_disable=0 to /boot/loader.conf? ^ Shoule be hw.re.msi_disable=0 Yes, just add it to /boot/loader.conf. Note, you should not disable system-wide MSI control(e.g. hw.pci.enable_msi == 1). This was sharing interrupt with USB, does USB need any special MSI handling or with re using MSI is enough to not share the interrupt? If re(4) can use MSI, you don't need to worry about interrupt sharing with USB. Check the output of vmstat -i. You normally get an irq256 or higher for MSI enabled driver. I know it continues to work because some days later i can see that it tried to deliver the status reports but was unable to resolve the aliases hostnames. I can't ping the machine and i know the network is OK. If i reboot the machine everything is working again. Recently I've made small changes to re(4) which may help to detect link state change event. Would you try re(4) in HEAD? Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that Yes, you can. It should build without problems. Just replace re(4) on stable/7 with HEAD version. or do i need to test the whole HEAD kernel? No you don't have to that. -- Regards, Pyun YongHyeon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote: Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. Just want to add a quick note and say that I'm having the same problem with my 7.0-RELEASE-p6/amd64 hetzner machine: http://lists.freebsd.org/pipermail/freebsd-acpi/2008-September/005095.html I would be happy to test patches as well. Thanks. -- Oliver PETER, email: [EMAIL PROTECTED], ICQ# 113969174 If it feels good, you're doing something wrong. -- Coach McTavish ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 12:08:40PM +, Oliver Peter wrote: On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote: Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. Just want to add a quick note and say that I'm having the same problem with my 7.0-RELEASE-p6/amd64 hetzner machine: http://lists.freebsd.org/pipermail/freebsd-acpi/2008-September/005095.html I would be happy to test patches as well. Thanks. Hello Oliver, What i did so far and improved a lot the experience was: 1) Upgrade at least the if_re code to RELENG_7. This fixes issues of packet corruption on ssh sessions. 2) Delete from your kernel config USB and firewire. This prevents the realtek interrupt to be shared. After this, with 7.1 -BETA2 the systems are more or less stable, but after a while the ATA controller starts to create interrupt storms. I wasn't able to find why. With the help that i've received in this thread from Pyun YongHyeon (Thanks!!) i'm also trying this suggestions: 3) Backport this 3 files from current to 7.1 -BETA2: base/head/sys/dev/re/if_re.c base/head/sys/pci/if_rl.c base/head/sys/pci/if_rlreg.h You can fetch them from http://svn.freebsd.org/. With them and adding to /boot/loader.conf this tunable: hw.re.msi_disable=0 I can use GENERIC kernel again (ie, USB enabled) and so far i didn't find any problem yet. No more interface up/down problems and no more interrupt storms. I must say that i haven't tested this enough, because the interrupt storms in ATA code start to happen after a few days of uptime load, but at least the problems with the realtek seem to be gone. If you upgrade to 7.1 -BETA2 you'll also get SATA support for the IXP card. With 7.0 it will work as ATA 33 in compatibility mode. Maybe someone with write access to the wiki could add it somewhere so that other hetzner users that are having the same problems could use the same workarounds :) I hope this helps you. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Wed, 10 Dec 2008 21:07:19 +0900 Pyun YongHyeon [EMAIL PROTECTED] wrote: On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote: As these seems to improve the current situation, is there any chance of merging -current driver in 7.1 before release? I think re(4) in HEAD needs more testing. As you might know RealTek produced too many chipsets. :-( FYI I've now turned MSI on in HEAD and will see what happens. Before my re0 was sharing interrupts with 3 USB controllers. Now it's all by itself on irq256. I'm running amd64 with re0: RealTek 8168/8168B/8168C/8168CP/8168D/8111B/8111C/8111CP PCIe Gigabit Ethernet port 0xde00-0xdeff mem 0xfdaff000-0xfdaf, 0xfdae-0xfdae irq 18 at device 0.0 on pci2 --- Gary Jennejohn ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote: On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 07:28:00PM +0900, Pyun YongHyeon wrote: On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote: On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote: Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. For the ones that don't actually know what happened: With FreeBSD 7.0 -RELEASE for amd64 and default kernel the system shared re0 interrupt with OHCI and this caused re(4) to corrupt packets and create interrupt storms. Tried re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily triggered on systems with 4GB memory. But I dont' know whether this is related with interrupt storms. updating to 7.1 -BETA2 and still had some problems with it. I've opened the PR kern/128287[2] and Remko quickly answered with a workaround: that workaround was removing USB support from my kernel. I did it and re(4) wasn't sharing interrupts anylonger, and the interrupt storms were gone. Now sometime later the interface goes up and down from time to time, but less often. Also sometimes the machine losts the network interface but continues to work. It seems that your controller supports MSI so you can set a tunable hw.re.msi_disable to 0 to enable MSI. With MSI you can remove interrupt sharing(e.g. add hw.re.msi_disable=0 to /boot/loader.conf file.) However there were several issues on re(4) w.r.t MSI so it was off by default. This is undocumented and with sysctl -a i can't find the tunable. Is this a HEAD feature or it's also in 7.1 -BETA2? Should i add Yeah it's an undocmented feature. But most drivers written by me have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have the tunable. I think it could be great if you could document it or at least show it by default when you do sysctl -ad with a small description. If MSI worked as expected I would have documented it as I did in msk(4)/nfe(4)/ale(4)/age(4)/jme(4) etc. Using MSI on RealTek does not seem to stable. I tried hard to fix that but some users still reported watchdog timeouts. Working without documentation and hardware also made it hard to complete the work. This was the main reason why MSI was disabled on re(4). What do you think about adding a note in the man page telling that it's experimental and in some cases it could improve the situation but in others it will give errors? hw.re_msi_disable=0 to /boot/loader.conf? ^ Shoule be hw.re.msi_disable=0 Yes, just add it to /boot/loader.conf. Note, you should not disable system-wide MSI control(e.g. hw.pci.enable_msi == 1). This was sharing interrupt with USB, does USB need any special MSI handling or with re using MSI is enough to not share the interrupt? If re(4) can use MSI, you don't need to worry about interrupt sharing with USB. Check the output of vmstat -i. You normally get an irq256 or higher for MSI enabled driver. I know it continues to work because some days later i can see that it tried to deliver the status reports but was unable to resolve the aliases hostnames. I can't ping the machine and i know the network is OK. If i reboot the machine everything is working again. Recently I've made small changes to re(4) which may help to detect link state change event. Would you try re(4) in HEAD? Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that Yes, you can. It should build without problems. Just replace re(4) on stable/7 with HEAD version. or do i need to test the whole HEAD kernel? No you don't have to that. Backporting the changes i've found that it didn't compile so in the end i got from HEAD the following files: base/head/sys/dev/re/if_re.c base/head/sys/pci/if_rl.c base/head/sys/pci/if_rlreg.h Ah,, sorry about that. Recently there was some changes. I forgot that. After that i've recompiled 7.1 -BETA2 GENERIC kernel and enabled the knob you suggested in /boot/loader.conf. With
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 03:01:30PM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 12:08:40PM +, Oliver Peter wrote: On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote: ... I can use GENERIC kernel again (ie, USB enabled) and so far i didn't find any problem yet. No more interface up/down problems and no more interrupt storms. I must say that i haven't tested this enough, because the interrupt storms in ATA code start to happen after a few days of uptime load, but at least the problems with the realtek seem to be gone. I found out that I'm able to 'force' the interrupt storm by provoking higher disk I/O. Just let dd write to a file in a loop for some hours and watch vmstat: while true; do dd if=/dev/zero of=BLA bs=1M count=1000; done First you'll see that the throughput will decrease, and a few hours later you'll have /var/log/messages / dmesg full of interrupt storm messages. If you upgrade to 7.1 -BETA2 you'll also get SATA support for the IXP card. With 7.0 it will work as ATA 33 in compatibility mode. Wow! That's good to hear as well. I'll definitely switch to -STABLE or 7.1-PRERELASE sooner or later. I'll just give it a try on my other machines at first. I hope this helps you. Absolutely, cheers mate. I owe you one! ~ollie -- Oliver PETER, email: [EMAIL PROTECTED], ICQ# 113969174 If it feels good, you're doing something wrong. -- Coach McTavish ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 01:18:00PM +0100, Arnaud Houdelette wrote: Victor Balada Diaz a écrit : Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. [...] Sorry I didn't take the time to read all the thread, but I got similar problem with the same IXP600 chipset. Only it was'nt with a Realtek NIC (re) but with a Ralink wireless one. The simptoms where similar : interrupt 22 was shared between the sata controler and the wireless card. And I got Interrupt Storms at random times when using the wireless network. No problem since I removed the ral(4) NIC (got a real access point now). You might not want to point the finger at the re(4) driver too fast. Arnaud Houdelette Hello Arnaud, I didn't say the problem was just because of re(4). Actually i think the there were two problems, one with re(4) and other with ata(4). The reason why i talked about both of them in the same mail is because i thought that as two drivers were affected, maybe the problem was in other part of the operating system and that could help the developers to debug the problem. My re(4) card isn't sharing the interrupt with IXP600, it's sharing the interrupt with USB controller. In this case i think the problem is fixed with the advices from Pyun YongHyeon (backporting the driver from HEAD and using MSI for interrupts). I think the problems with ata(4) code will appear again after a few days of load, as they always do, so i'll keep trying to debug them. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On 2008-Dec-10 10:55:35 +0100, Søren Schmidt [EMAIL PROTECTED] wrote: And you will not use 64bit DMA even if the chipset supports it. However I have not seen any chipsets supporting this fail, YMMV as usual :) There's a reference in wikipedia pointing to http://www.mail-archive.com/[EMAIL PROTECTED]/msg06694.html that claims the AMD/ATI SB600 lies about supporting 64-bit DMA in AHCI mode. I have a SB600 but it doesn't have 4GB to test on. -- Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. pgp1ifE19lUGB.pgp Description: PGP signature
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 03:08:24PM +0100, Victor Balada Diaz wrote: On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote: [...] It seems that your controller supports MSI so you can set a tunable hw.re.msi_disable to 0 to enable MSI. With MSI you can remove interrupt sharing(e.g. add hw.re.msi_disable=0 to /boot/loader.conf file.) However there were several issues on re(4) w.r.t MSI so it was off by default. This is undocumented and with sysctl -a i can't find the tunable. Is this a HEAD feature or it's also in 7.1 -BETA2? Should i add Yeah it's an undocmented feature. But most drivers written by me have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have the tunable. I think it could be great if you could document it or at least show it by default when you do sysctl -ad with a small description. If MSI worked as expected I would have documented it as I did in msk(4)/nfe(4)/ale(4)/age(4)/jme(4) etc. Using MSI on RealTek does not seem to stable. I tried hard to fix that but some users still reported watchdog timeouts. Working without documentation and hardware also made it hard to complete the work. This was the main reason why MSI was disabled on re(4). What do you think about adding a note in the man page telling that it's experimental and in some cases it could improve the situation but in others it will give errors? Based on the your testing I have idea how to mitigate the missing Tx completion interrupt. If all goes well re(4) could reliably take advantage of MSI on RealTek controllers. If that miserably fail I would do as you suggested. I think re(4) in HEAD needs more testing. As you might know RealTek produced too many chipsets. :-( Ok, i'll use the backported driver as it works better for me :-) If i can help you testing any patches i'm more than welcome to do it. Thanks a lot for your help Pyun YongHyeon. You're welcome. -- Regards, Pyun YongHyeon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [ATA] and re(4) stability issues
On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote: On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote: Also i didn't see any problem with interfaces going up and down, but that usually happen after some hours of uptime, so i'll let you know if the error happens again. After writing to the HD with dd for a few hours and using stress -i 10 -d 10 the machine lost connectivity. I waited until today to be sure if the machine hung, paniced or just lost network connectivity. I don't have local access or serial access, so this is the only way i could do it. I've seen in the logs during the night various messages of: Dec 10 00:33:49 yac kernel: re0: watchdog timeout Dec 10 00:33:49 yac kernel: re0: link state changed to DOWN Dec 10 00:33:52 yac kernel: re0: link state changed to UP The interface never recovered and i wasn't able to ping the machine until i rebooted. Nagios was checking all the time and no recovery happened. The netstat -i in daily scripts shows just one Oerrs. I'm used to have a lot of them, but seems this time the card didn't recover from the only one. I also want to say that this is not a regression, as it happened before with 7.1 -BETA 2 code. Is there anything more i can try? Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
[ATA] and re(4) stability issues
Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. For the ones that don't actually know what happened: With FreeBSD 7.0 -RELEASE for amd64 and default kernel the system shared re0 interrupt with OHCI and this caused re(4) to corrupt packets and create interrupt storms. Tried updating to 7.1 -BETA2 and still had some problems with it. I've opened the PR kern/128287[2] and Remko quickly answered with a workaround: that workaround was removing USB support from my kernel. I did it and re(4) wasn't sharing interrupts anylonger, and the interrupt storms were gone. Now sometime later the interface goes up and down from time to time, but less often. Also sometimes the machine losts the network interface but continues to work. I know it continues to work because some days later i can see that it tried to deliver the status reports but was unable to resolve the aliases hostnames. I can't ping the machine and i know the network is OK. If i reboot the machine everything is working again. When switched from 7.0 to 7.1 BETA2 i also found that under load after some hours the machine created interrupt storms on ATA disks. Digging at linux source code i've found that they do some special things for this chipset that i've been unable to find on our code. This is linux code for my chipset: 371 AHCI_HFLAGS (AHCI_HFLAG_IGN_SERR_INTERNAL | 372 AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI | 373 AHCI_HFLAG_SECT255), File and the rest of the code in here[3]. As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could think of, switching MSI and MSI-x off for the whole system, so i added to /boot/loader.conf this tunables: hw.pci.enable_msix=0 hw.pci.enable_msi=0 And then rebooted the machine. After various hours of doing almost nothing i've found that the machine answered ping but was unable to answer any request (eg, ssh, nagios nrpe, etc). The machine recovered itself after some minutes and when i was able to ssh into i saw the following in dmesg: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1463123158 and a lot more errors like that. I didn't get this errors with MSI enabled. I see WRITE_DMA48 and in linux code i saw AHCI_HFLAG_32BIT_ONLY which is later used for DMA related things. Could someone who is more knowledgeable check if we're doing the right thing? I've attached verbose dmesg of a machine that's like this one with 7.1 -BETA2, MSI enabled and GENERIC kernel minus USB and firewrire. Also, please, could someone give me a hand on how could i continue debugging this interrupt issues? I'm a bit lost and digging code and posting each time i think i've found something is not going to go anywhere. I would also like to say that i've seen reports of this kind of problems on amd64 machines in the lists since various years ago, so i don't think this is just a problem with this BIOS/motherboard (MSI K9AG Neo2 Digital) on the lists Thanks in advance for any help. Regards. [1]: http://www.hetzner.de/hosting/produkte_rootserver/ds7000/ [2]: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/128287 [3]: http://fxr.watson.org/fxr/source/drivers/ata/ahci.c?v=linux-2.6#L369 -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. Copyright (c) 1992-2008 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 7.1-BETA2 #1: Wed Oct 22 13:19:14 CEST 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/NOUSB Preloaded elf kernel /boot/kernel/kernel at 0x80c12000. Preloaded elf obj module /boot/kernel/geom_mirror.ko at 0x80c121a8. Preloaded elf obj module /boot/kernel/accf_data.ko at 0x80c12818. Preloaded elf obj module /boot/kernel/accf_http.ko at 0x80c12cc8. Preloaded elf obj module /boot/kernel/k8temp.ko at 0x80c13238. Preloaded elf obj module /boot/kernel/geom_journal.ko at 0x80c13720. Calibrating clock(s) ... i8254 clock: 1193242 Hz CLK_USE_I8254_CALIBRATION not specified - using default frequency
Re: [ATA] and re(4) stability issues
On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote: Hello, I got various machines[1] at hetzner.de and I've been having problems with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've been trying to narrow the problem so someone more knowledgeable than me is able to fix it. This mail is an other attempt to ask a question with regards ATA code to see if this time i got something. For the ones that don't actually know what happened: With FreeBSD 7.0 -RELEASE for amd64 and default kernel the system shared re0 interrupt with OHCI and this caused re(4) to corrupt packets and create interrupt storms. Tried re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily triggered on systems with 4GB memory. But I dont' know whether this is related with interrupt storms. updating to 7.1 -BETA2 and still had some problems with it. I've opened the PR kern/128287[2] and Remko quickly answered with a workaround: that workaround was removing USB support from my kernel. I did it and re(4) wasn't sharing interrupts anylonger, and the interrupt storms were gone. Now sometime later the interface goes up and down from time to time, but less often. Also sometimes the machine losts the network interface but continues to work. It seems that your controller supports MSI so you can set a tunable hw.re.msi_disable to 0 to enable MSI. With MSI you can remove interrupt sharing(e.g. add hw.re.msi_disable=0 to /boot/loader.conf file.) However there were several issues on re(4) w.r.t MSI so it was off by default. I know it continues to work because some days later i can see that it tried to deliver the status reports but was unable to resolve the aliases hostnames. I can't ping the machine and i know the network is OK. If i reboot the machine everything is working again. Recently I've made small changes to re(4) which may help to detect link state change event. Would you try re(4) in HEAD? -- Regards, Pyun YongHyeon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]