Re: sk98lin for 2.6.23-rc1
On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote: There are several different problems in this thread: 1. The removal of old sk98lin driver caused some users to be forced to use skge. These users have uncovered issues with the dual port fiber based versions of the board. Short term: The sk98lin driver should be restored to previous state, and the PCI table should be used to limit the usage to only fiber systems. If Adrian doesn't do it, I'll do it when I return from Germany. ... No problem with this, but since it was Jeff's patch it should better be him who reverts it (and he's anyway one step nearer to Linus). But the underlying general problem still remains: How can we get people to test and report bugs with the new drivers before removing the old driver? That's a question especially for the people who now had problems after sk98lin was removed. cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin for 2.6.23-rc1
Adrian Bunk wrote: On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote: There are several different problems in this thread: 1. The removal of old sk98lin driver caused some users to be forced to use skge. These users have uncovered issues with the dual port fiber based versions of the board. Short term: The sk98lin driver should be restored to previous state, and the PCI table should be used to limit the usage to only fiber systems. If Adrian doesn't do it, I'll do it when I return from Germany. ... No problem with this, but since it was Jeff's patch it should better be him who reverts it (and he's anyway one step nearer to Linus). But the underlying general problem still remains: How can we get people to test and report bugs with the new drivers before removing the old driver? Sorry for a long answer, I'm trying to provide insight on two recent cases. Thinking back to several drivers, when e100 was new I tried it because I had problems with eepro100 in the area of multiple cards, multiple cables on a single card, and jumbo packets. For a while I used both, until e100 worked where I need it. So I initially tried it because it had features I needed, and then dropped to older driver just to avoid having to decide. With sk98lin, the driver worked flawlessly with all (3-4) systems, so I had no reason to try any other. When removing sk98lin was first proposed, I tried skge, first measurements showed it was 5-8% slower, NOT what I want, so I went back. For me there was no reliability issue, but I never tried it in a system with more than on NIC on the driver. Would it's a little slower be a valid bug report? Or would I have gotten works fine for me from people not beating it over Gbit? I didn't try sky2 until you suggested it, and I have reported my results previously, just stops working. Could it be my hardware? I tried it on one system, so yes, but sk98lin works for months. That's a question especially for the people who now had problems after sk98lin was removed. So if you want people to try a new driver, I think it really has to have some benefits to the users, in terms of performance, reliability, or features. Cleaner design doesn't motivate, and it does raise the question of why the old driver wasn't just cleaned up. I've been doing software for decades, I appreciate why, but users in general just want to use their system. Which raises the question of why to delete drivers which work for many or even most users? Testing a new kernel is no longer a drop in a boot operation if modprobe.conf must be edited to get the network up, and the typical user isn't going to write that shell script to try one or the other driver. Honestly, new drivers which offer little benefit to most users are the exception rather than the rule, so this may a corner case I would like to see sk98lin back in the kernel, for a while I can build my own kernels and patch it in, but until other drivers are drop-in, I probably won't change. Separate but related: why keep skge and sky2? Are we going through this again in a year? Is the benefit worth the effort? Hope some of this is helpful. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin for 2.6.23-rc1
On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote: Adrian Bunk wrote: On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote: There are several different problems in this thread: 1. The removal of old sk98lin driver caused some users to be forced to use skge. These users have uncovered issues with the dual port fiber based versions of the board. Short term: The sk98lin driver should be restored to previous state,and the PCI table should be used to limit the usage to only fiber systems. If Adrian doesn't do it, I'll do it when I return from Germany. ... No problem with this, but since it was Jeff's patch it should better be him who reverts it (and he's anyway one step nearer to Linus). But the underlying general problem still remains: How can we get people to test and report bugs with the new drivers before removing the old driver? Sorry for a long answer, I'm trying to provide insight on two recent cases. Thinking back to several drivers, when e100 was new I tried it because I had problems with eepro100 in the area of multiple cards, multiple cables on a single card, and jumbo packets. For a while I used both, until e100 worked where I need it. So I initially tried it because it had features I needed, and then dropped to older driver just to avoid having to decide. With sk98lin, the driver worked flawlessly with all (3-4) systems, so I had no reason to try any other. When removing sk98lin was first proposed, I tried skge, first measurements showed it was 5-8% slower, NOT what I want, so I went back. For me there was no reliability issue, but I never tried it in a system with more than on NIC on the driver. Would it's a little slower be a valid bug report? Or would I have gotten works fine for me from people not beating it over Gbit? ... If you get less throughput that is a regression, and it should be reported and fixed. I doubt anybody would have told you otherwise. Is this bug still present as of 2.6.23-rc6? That's a question especially for the people who now had problems after sk98lin was removed. So if you want people to try a new driver, I think it really has to have some benefits to the users, in terms of performance, reliability, or features. Cleaner design doesn't motivate, and it does raise the question of why the old driver wasn't just cleaned up. I've been doing software for decades, I appreciate why, but users in general just want to use their system. Which raises the question of why to delete drivers which work for many or even most users? As I already explained, there is a long term advantage for all users if there is only one driver in the kernel. Therefore all users should switch away from obsolete drivers to the replacement drivers, and the obsolete driver will be removed at some point in time. The only question is how to do it. Testing a new kernel is no longer a drop in a boot operation if modprobe.conf must be edited to get the network up, and the typical user isn't going to write that shell script to try one or the other driver. The typical user will let his distribution handle this. And MODULE_ALIAS can also handle this. Honestly, new drivers which offer little benefit to most users are the exception rather than the rule, so this may a corner case I would like to see sk98lin back in the kernel, for a while I can build my own kernels and patch it in, but until other drivers are drop-in, I probably won't change. That a new driver offers benefits that cause most users to switch isn't realistic. You mention e100 as an example - well, I'm using this driver in my computer, but I doubt anything would be worse for me if I'd use the obsolete eepro100 driver instead since I'm not using any of the fancy e100 features you mentioned as advantages. There is a long term advantage for all users if there is only one driver in the kernel. Therefore all users should switch away from obsolete drivers to the replacement drivers, and the obsolete driver will be removed at some point in time. The only question is how to do it. Separate but related: why keep skge and sky2? Are we going through this again in a year? Is the benefit worth the effort? ... skge and sky2 support distinct hardware. cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin for 2.6.23-rc1
On Tue, Sep 11, 2007 at 05:03:57PM +0200, Adrian Bunk wrote: On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote: So if you want people to try a new driver, I think it really has to have some benefits to the users, in terms of performance, reliability, or features. Cleaner design doesn't motivate, and it does raise the question of why the old driver wasn't just cleaned up. I've been doing software for decades, I appreciate why, but users in general just want to use their system. Which raises the question of why to delete drivers which work for many or even most users? As I already explained, there is a long term advantage for all users if there is only one driver in the kernel. Not only that. You have to place the switch in its context with history. Stephen, please correct me if I'm wrong, but sk98lin has been randomly working for a very long time. Not 100% the driver's fault, because it has had to workaround a lot of chips bugs. The fact that this driver supports *all* chips in the family makes it harder to identify whether problems are caused by the hardware or by the driver because it is bloated with tons of if/else. I've personally encountered random data corruption on the receive path with PCI-E hardware with sk98lin, as well as random TX stops. Sometimes it would require one terabyte of data, sometimes just a few hundreds megs. On other hardware (skge now), UDP would simply stop being sent and some TCP traffic was necessary to restart UDP! One guy at Marvell once asked me for more information, but it was not easy to provide much more, given the randomness of the problems! Stephen has done an excellent (and thankless) job at restarting from scratch, and the idea to separate the two chips was a good one IMHO. The problem is that he might have thought that most of the bugs were in the driver, while most of them are in the hardware, and this requires a lot of workarounds, which do not always work the same for everybody (I remember having tried to disable flow control with sk98lin because it helped with sky2). In parallel, sk98lin has improved on the vendor's site. v8 exhibited all the problems I explained above, but v10 has fixed a lot of them, making the new sk98lin more reliable. In parallel, sky2 and skge had got wider acceptance and testing. The nastiest hardware bugs will slowly surface, a good deal of driver bugs have been detected too (and that's expected from any new driver). It is possible that after 2 or 3 patches, a lot of the remaining problems will suddenly vanish. But it's also possible that the driver will still not work for 1% of people for 1 or 2 years because of some obscure hardware combinations which trigger some obscure hardware bugs. Therefore all users should switch away from obsolete drivers to the replacement drivers, and the obsolete driver will be removed at some point in time. The only question is how to do it. Desktop users genreally have no problem experimenting with multiple kernels or drivers. They can report feedback too, but generally, they're not very good at downloading alternative drivers and patching their kernel with those. Server users cannot experiment for a long time. After 2 or 3 losses of service, they *have* to provide a definitive solution. For some of them when sky2 fails, it may very well be to switch over to sk98lin. Downloading from the vendor's site and patching is not a problem for those users, but it causes them the trouble of updating the kernel for security fixes, so the old driver must be shipped with the kernel. However, I remember something which might constitute a solution. In 2.4, there's a small bug in the kbuild process on alpha. One question is always asked during make oldconfig. Its saved value is ignored because of the way it is computed. I don't know if we could do this with 2.6 kbuild. It would then be nice to always set sk98lin to unset if it was set to Y or M, so that at each build, the user has to explicitly state he wants it. It's annoying enough to give the other one a try once in a while, without causing too much trouble to people who really have no other choice right now. What we need with this driver is people being fed up with it, not them being unable to use it as a last resort. Also, given that it has improved over the last years (probably due to competition pressure from sky2/skge), users will even less understand why there is such incentive to remove it. Another trick for obsolete drivers would be to simply remove them from the usual build system, but have them being available for explicit build. Eg: make modules will not build them, but make obsolete-modules would do. Testing a new kernel is no longer a drop in a boot operation if modprobe.conf must be edited to get the network up, and the typical user isn't going to write that shell script to try one or the other driver. The typical user will let his distribution handle this. And MODULE_ALIAS can also
Re: sk98lin for 2.6.23-rc1
If anyone still has hang issues with sky2, please send me the hardware information (lspci, dmesg output), and capture the debugfs state after hang. At present the known open skge, sky2 issues are: 1) Skge doesn't work with dual port fiber 2) Sky2 (and vendor sk98lin) don't work on some MSI motherboards due to PCI issues 3) Skge and sky2 trigger problems with motherboards that don't do 4GB DMA correctly. Not a driver but a PCI quirk problem. 4) Sky2 does polling for lost irq even when interface not up 5) Sky2 polling rate for lost irq maybe greater than TCP timeout so connection may be lost even after link recovers 6) Sky2 Yukon extreme support is still experimental and fragile, but this hardware isn't in the wild yet 7) Sky2 Yukon EC-U not powering up PHY correctly on some revisions 8) Sky2 suspend to ram not working right on some chips Hardware nuisances that are being masked: a) hardware can't DMA to unaligned address b) polling for lost irq is a hack, but until the root cause is found it will have to stay. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin for 2.6.23-rc1
On Fri, 27 Jul 2007 08:58:45 -0400 Jeff Garzik [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: If anyone still has hang issues with sky2, please send me the hardware information (lspci, dmesg output), and capture the debugfs state after hang. At present the known open skge, sky2 issues are: Did you add Chris Stromsoe's skge-hangs problem to the list? Jeff That is the #1 on list, it is the dual port fiber problem. Seems to be specific to dual port, since can't reproduce on my only skge fiber board (single port) connected to e1000. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin for 2.6.23-rc1
Stephen Hemminger wrote: If anyone still has hang issues with sky2, please send me the hardware information (lspci, dmesg output), and capture the debugfs state after hang. At present the known open skge, sky2 issues are: Did you add Chris Stromsoe's skge-hangs problem to the list? Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin for 2.6.23-rc1
If anyone still has hang issues with sky2, please send me the hardware information (lspci, dmesg output), and capture the debugfs state after hang. At present the known open skge, sky2 issues are: [snip] There is still this active issue which I believe is orthogonal from the interrupt hang. I just hit it again. The IRQ-hang recovery logic doesn't seem to kick in, so it's perhaps another issue: # ifconfig lan0 lan0 Link encap:Ethernet HWaddr 00:03:2D:05:9C:27 inet addr:192.168.0.250 Bcast:192.168.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:13304841 errors:1 dropped:1 overruns:0 frame:2 TX packets:7493765 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:232720755 (221.9 MiB) TX bytes:3964088142 (3.6 GiB) Interrupt:16 # ethtool -S lan0 NIC statistics: tx_bytes: 44342166229 rx_bytes: 28807483162 tx_broadcast: 20071 rx_broadcast: 3309 tx_multicast: 0 rx_multicast: 31 tx_unicast: 45095221 rx_unicast: 43331790 tx_mac_pause: 0 rx_mac_pause: 278 collisions: 0 late_collision: 0 aborted: 0 single_collisions: 0 multi_collisions: 0 rx_short: 0 rx_runt: 0 rx_64_byte_packets: 1815009 rx_65_to_127_byte_packets: 18626585 rx_128_to_255_byte_packets: 788459 rx_256_to_511_byte_packets: 4880174 rx_512_to_1023_byte_packets: 1016831 rx_1024_to_1518_byte_packets: 16208350 rx_1518_to_max_byte_packets: 0 rx_too_long: 0 rx_fifo_overflow: 0 rx_jabber: 0 rx_fcs_error: 0 tx_64_byte_packets: 1632797 tx_65_to_127_byte_packets: 11211285 tx_128_to_255_byte_packets: 1628446 tx_256_to_511_byte_packets: 2042646 tx_512_to_1023_byte_packets: 637342 tx_1024_to_1518_byte_packets: 27962776 tx_1519_to_max_byte_packets: 0 tx_fifo_underrun: 0 # dmesg [snip] sky2 :01:00.0: v1.16 addr 0xdfbfc000 irq 16 Yukon-EC (0xb6) rev 1 sky2 eth1: addr 00:03:2d:05:9c:27 sky2 lan0: enabling interface sky2 lan0: ram buffer 48K [snip] sky2 lan0: rx error, status 0xad78ad78 length 0 lan0: hw csum failure. [stack-trace snipped] lan0: hw csum failure. ... # cat /debug/sky2/lan0 IRQ src=0 mask=c01d control=0 Status ring (empty) Tx ring pending=424...424 report=424 done=424 Rx ring hw get=252 put=405 last=1023 # cat /debug/sky2/lan0 IRQ src=0 mask=c01d control=0 Status ring (empty) Tx ring pending=432...432 report=432 done=432 Rx ring hw get=252 put=415 last=1023 # cat /debug/sky2/lan0 IRQ src=0 mask=c01d control=0 Status ring (empty) Tx ring pending=451...451 report=451 done=451 Rx ring hw get=316 put=440 last=1023 -- Daniel J Blueman - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html