Re: sk98lin for 2.6.23-rc1

2007-09-11 Thread Adrian Bunk
On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote:
 
 There are several different problems in this thread:
 1. The removal of old sk98lin driver caused some users to be forced to use
 skge. These users have uncovered issues with the dual port fiber based 
 versions
 of the board.  
 Short term: The sk98lin driver should be restored to previous state, 
and the PCI table should be used to limit the usage to only fiber 
 systems.
If Adrian doesn't do it, I'll do it when I return from Germany.
...

No problem with this, but since it was Jeff's patch it should better be 
him who reverts it (and he's anyway one step nearer to Linus).

But the underlying general problem still remains:

How can we get people to test and report bugs with the new drivers 
before removing the old driver?

That's a question especially for the people who now had problems after 
sk98lin was removed.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin for 2.6.23-rc1

2007-09-11 Thread Bill Davidsen

Adrian Bunk wrote:

On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote:
  

There are several different problems in this thread:
1. The removal of old sk98lin driver caused some users to be forced to use
skge. These users have uncovered issues with the dual port fiber based 
versions
of the board.  
Short term: The sk98lin driver should be restored to previous state, 
   and the PCI table should be used to limit the usage to only fiber systems.

   If Adrian doesn't do it, I'll do it when I return from Germany.
...



No problem with this, but since it was Jeff's patch it should better be 
him who reverts it (and he's anyway one step nearer to Linus).


But the underlying general problem still remains:

How can we get people to test and report bugs with the new drivers 
before removing the old driver?


  

Sorry for a long answer, I'm trying to provide insight on two recent cases.

Thinking back to several drivers, when e100 was new I tried it because I 
had problems with eepro100 in the area of multiple cards, multiple 
cables on a single card, and jumbo packets. For a while I used both, 
until e100 worked where I need it. So I initially tried it because it 
had features I needed, and then dropped to older driver just to avoid 
having to decide.


With sk98lin, the driver worked flawlessly with all (3-4) systems, so I 
had no reason to try any other. When removing sk98lin was first 
proposed, I tried skge, first measurements showed it was 5-8% slower, 
NOT what I want, so I went back. For me there was no reliability issue, 
but I never tried it in a system with more than on NIC on the driver. 
Would it's a little slower be a valid bug report? Or would I have 
gotten works fine for me from people not beating it over Gbit? I 
didn't try sky2 until you suggested it, and I have reported my results 
previously, just stops working. Could it be my hardware? I tried it on 
one system, so yes, but sk98lin works for months.
That's a question especially for the people who now had problems after 
sk98lin was removed.


So if you want people to try a new driver, I think it really has to have 
some benefits to the users, in terms of performance, reliability, or 
features. Cleaner design doesn't motivate, and it does raise the 
question of why the old driver wasn't just cleaned up. I've been doing 
software for decades, I appreciate why, but users in general just want 
to use their system. Which raises the question of why to delete drivers 
which work for many or even most users? Testing a new kernel is no 
longer a drop in a boot operation if modprobe.conf must be edited to get 
the network up, and the typical user isn't going to write that shell 
script to try one or the other driver.


Honestly, new drivers which offer little benefit to most users are the 
exception rather than the rule, so this may a corner case I would like 
to see sk98lin back in the kernel, for a while I can build my own 
kernels and patch it in, but until other drivers are drop-in, I probably 
won't change.


Separate but related: why keep skge and sky2? Are we going through this 
again in a year? Is the benefit worth the effort?


Hope some of this is helpful.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin for 2.6.23-rc1

2007-09-11 Thread Adrian Bunk
On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote:
 Adrian Bunk wrote:
 On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote:
   
 There are several different problems in this thread:
 1. The removal of old sk98lin driver caused some users to be forced to 
 use
 skge. These users have uncovered issues with the dual port fiber 
 based versions
 of the board.  Short term: The sk98lin driver should be restored 
 to previous state,and the PCI table should be used to limit the 
 usage to only fiber systems.
If Adrian doesn't do it, I'll do it when I return from Germany.
 ...
 

 No problem with this, but since it was Jeff's patch it should better be 
 him who reverts it (and he's anyway one step nearer to Linus).

 But the underlying general problem still remains:

 How can we get people to test and report bugs with the new drivers before 
 removing the old driver?

   
 Sorry for a long answer, I'm trying to provide insight on two recent cases.

 Thinking back to several drivers, when e100 was new I tried it because I 
 had problems with eepro100 in the area of multiple cards, multiple cables 
 on a single card, and jumbo packets. For a while I used both, until e100 
 worked where I need it. So I initially tried it because it had features I 
 needed, and then dropped to older driver just to avoid having to decide.

 With sk98lin, the driver worked flawlessly with all (3-4) systems, so I had 
 no reason to try any other. When removing sk98lin was first proposed, I 
 tried skge, first measurements showed it was 5-8% slower, NOT what I want, 
 so I went back. For me there was no reliability issue, but I never tried it 
 in a system with more than on NIC on the driver. Would it's a little 
 slower be a valid bug report? Or would I have gotten works fine for me 
 from people not beating it over Gbit?
...

If you get less throughput that is a regression, and it should be 
reported and fixed.

I doubt anybody would have told you otherwise.

Is this bug still present as of 2.6.23-rc6?

 That's a question especially for the people who now had problems after 
 sk98lin was removed.

 So if you want people to try a new driver, I think it really has to have 
 some benefits to the users, in terms of performance, reliability, or 
 features. Cleaner design doesn't motivate, and it does raise the question 
 of why the old driver wasn't just cleaned up. I've been doing software for 
 decades, I appreciate why, but users in general just want to use their 
 system. Which raises the question of why to delete drivers which work for 
 many or even most users?

As I already explained, there is a long term advantage for all users if 
there is only one driver in the kernel. Therefore all users should 
switch away from obsolete drivers to the replacement drivers, and the 
obsolete driver will be removed at some point in time. The only question 
is how to do it.

 Testing a new kernel is no longer a drop in a boot 
 operation if modprobe.conf must be edited to get the network up, and the 
 typical user isn't going to write that shell script to try one or the other 
 driver.

The typical user will let his distribution handle this.

And MODULE_ALIAS can also handle this.

 Honestly, new drivers which offer little benefit to most users are the 
 exception rather than the rule, so this may a corner case I would like to 
 see sk98lin back in the kernel, for a while I can build my own kernels and 
 patch it in, but until other drivers are drop-in, I probably won't change.

That a new driver offers benefits that cause most users to switch isn't 
realistic.

You mention e100 as an example - well, I'm using this driver in my 
computer, but I doubt anything would be worse for me if I'd use the 
obsolete eepro100 driver instead since I'm not using any of the fancy 
e100 features you mentioned as advantages.

There is a long term advantage for all users if there is only one driver 
in the kernel. Therefore all users should switch away from obsolete 
drivers to the replacement drivers, and the obsolete driver will be 
removed at some point in time. The only question is how to do it.

 Separate but related: why keep skge and sky2? Are we going through this 
 again in a year? Is the benefit worth the effort?
...

skge and sky2 support distinct hardware.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin for 2.6.23-rc1

2007-09-11 Thread Willy Tarreau
On Tue, Sep 11, 2007 at 05:03:57PM +0200, Adrian Bunk wrote:
 On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote:
  So if you want people to try a new driver, I think it really has to have 
  some benefits to the users, in terms of performance, reliability, or 
  features. Cleaner design doesn't motivate, and it does raise the question 
  of why the old driver wasn't just cleaned up. I've been doing software for 
  decades, I appreciate why, but users in general just want to use their 
  system. Which raises the question of why to delete drivers which work for 
  many or even most users?
 
 As I already explained, there is a long term advantage for all users if 
 there is only one driver in the kernel.

Not only that. You have to place the switch in its context with history.
Stephen, please correct me if I'm wrong, but sk98lin has been randomly
working for a very long time. Not 100% the driver's fault, because it
has had to workaround a lot of chips bugs. The fact that this driver
supports *all* chips in the family makes it harder to identify whether
problems are caused by the hardware or by the driver because it is
bloated with tons of if/else.

I've personally encountered random data corruption on the receive path
with PCI-E hardware with sk98lin, as well as random TX stops. Sometimes
it would require one terabyte of data, sometimes just a few hundreds
megs. On other hardware (skge now), UDP would simply stop being sent
and some TCP traffic was necessary to restart UDP! One guy at Marvell
once asked me for more information, but it was not easy to provide
much more, given the randomness of the problems!

Stephen has done an excellent (and thankless) job at restarting from
scratch, and the idea to separate the two chips was a good one IMHO.
The problem is that he might have thought that most of the bugs were
in the driver, while most of them are in the hardware, and this requires
a lot of workarounds, which do not always work the same for everybody
(I remember having tried to disable flow control with sk98lin because
it helped with sky2).

In parallel, sk98lin has improved on the vendor's site. v8 exhibited
all the problems I explained above, but v10 has fixed a lot of them,
making the new sk98lin more reliable. In parallel, sky2 and skge had
got wider acceptance and testing. The nastiest hardware bugs will
slowly surface, a good deal of driver bugs have been detected too
(and that's expected from any new driver).

It is possible that after 2 or 3 patches, a lot of the remaining
problems will suddenly vanish. But it's also possible that the driver
will still not work for 1% of people for 1 or 2 years because of some
obscure hardware combinations which trigger some obscure hardware bugs.

 Therefore all users should 
 switch away from obsolete drivers to the replacement drivers, and the 
 obsolete driver will be removed at some point in time. The only question 
 is how to do it.

Desktop users genreally have no problem experimenting with multiple kernels
or drivers. They can report feedback too, but generally, they're not very
good at downloading alternative drivers and patching their kernel with those.

Server users cannot experiment for a long time. After 2 or 3 losses of
service, they *have* to provide a definitive solution. For some of them
when sky2 fails, it may very well be to switch over to sk98lin. Downloading
from the vendor's site and patching is not a problem for those users, but
it causes them the trouble of updating the kernel for security fixes, so
the old driver must be shipped with the kernel.

However, I remember something which might constitute a solution. In 2.4,
there's a small bug in the kbuild process on alpha. One question is always
asked during make oldconfig. Its saved value is ignored because of the way
it is computed. I don't know if we could do this with 2.6 kbuild. It would
then be nice to always set sk98lin to unset if it was set to Y or M,
so that at each build, the user has to explicitly state he wants it. It's
annoying enough to give the other one a try once in a while, without causing
too much trouble to people who really have no other choice right now.

What we need with this driver is people being fed up with it, not them
being unable to use it as a last resort. Also, given that it has improved
over the last years (probably due to competition pressure from sky2/skge),
users will even less understand why there is such incentive to remove it.

Another trick for obsolete drivers would be to simply remove them from
the usual build system, but have them being available for explicit build.
Eg: make modules will not build them, but make obsolete-modules would do.

  Testing a new kernel is no longer a drop in a boot 
  operation if modprobe.conf must be edited to get the network up, and the 
  typical user isn't going to write that shell script to try one or the other 
  driver.
 
 The typical user will let his distribution handle this.
 
 And MODULE_ALIAS can also 

Re: sk98lin for 2.6.23-rc1

2007-07-27 Thread Stephen Hemminger
If anyone still has hang issues with sky2, please send me the hardware
information (lspci, dmesg output), and capture the debugfs state after hang.

At present the known open skge, sky2 issues are:

1) Skge doesn't work with dual port fiber
2) Sky2 (and vendor sk98lin) don't work on some MSI motherboards
due to PCI issues
3) Skge and sky2 trigger problems with motherboards that don't
do 4GB DMA correctly. Not a driver but a PCI quirk problem.
4) Sky2 does polling for lost irq even when interface not up
5) Sky2 polling rate for lost irq maybe greater than TCP timeout so
connection may be lost even after link recovers
6) Sky2 Yukon extreme support is still experimental and fragile,
but this hardware isn't in the wild yet
7) Sky2 Yukon EC-U not powering up PHY correctly on some revisions
8) Sky2 suspend to ram not working right on some chips

Hardware nuisances that are being masked:
a) hardware can't DMA to unaligned address
b) polling for lost irq is a hack, but until the root cause
   is found it will have to stay.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin for 2.6.23-rc1

2007-07-27 Thread Stephen Hemminger
On Fri, 27 Jul 2007 08:58:45 -0400
Jeff Garzik [EMAIL PROTECTED] wrote:

 Stephen Hemminger wrote:
  If anyone still has hang issues with sky2, please send me the hardware
  information (lspci, dmesg output), and capture the debugfs state after hang.
  
  At present the known open skge, sky2 issues are:
 
 Did you add Chris Stromsoe's skge-hangs problem to the list?
 
   Jeff

That is the #1 on list, it is the dual port fiber problem.
Seems to be specific to dual port, since can't reproduce on my only
skge fiber board (single port) connected to e1000.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin for 2.6.23-rc1

2007-07-27 Thread Jeff Garzik

Stephen Hemminger wrote:

If anyone still has hang issues with sky2, please send me the hardware
information (lspci, dmesg output), and capture the debugfs state after hang.

At present the known open skge, sky2 issues are:


Did you add Chris Stromsoe's skge-hangs problem to the list?

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin for 2.6.23-rc1

2007-07-27 Thread Daniel J Blueman
 If anyone still has hang issues with sky2, please send me the hardware
 information (lspci, dmesg output), and capture the debugfs state after hang.

 At present the known open skge, sky2 issues are:
[snip]

There is still this active issue which I believe is orthogonal from
the interrupt hang. I just hit it again. The IRQ-hang recovery logic
doesn't seem to kick in, so it's perhaps another issue:

# ifconfig lan0
lan0  Link encap:Ethernet  HWaddr 00:03:2D:05:9C:27
  inet addr:192.168.0.250  Bcast:192.168.0.255  Mask:255.255.255.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:13304841 errors:1 dropped:1 overruns:0 frame:2
  TX packets:7493765 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:232720755 (221.9 MiB)  TX bytes:3964088142 (3.6 GiB)
  Interrupt:16

# ethtool -S lan0
NIC statistics:
 tx_bytes: 44342166229
 rx_bytes: 28807483162
 tx_broadcast: 20071
 rx_broadcast: 3309
 tx_multicast: 0
 rx_multicast: 31
 tx_unicast: 45095221
 rx_unicast: 43331790
 tx_mac_pause: 0
 rx_mac_pause: 278
 collisions: 0
 late_collision: 0
 aborted: 0
 single_collisions: 0
 multi_collisions: 0
 rx_short: 0
 rx_runt: 0
 rx_64_byte_packets: 1815009
 rx_65_to_127_byte_packets: 18626585
 rx_128_to_255_byte_packets: 788459
 rx_256_to_511_byte_packets: 4880174
 rx_512_to_1023_byte_packets: 1016831
 rx_1024_to_1518_byte_packets: 16208350
 rx_1518_to_max_byte_packets: 0
 rx_too_long: 0
 rx_fifo_overflow: 0
 rx_jabber: 0
 rx_fcs_error: 0
 tx_64_byte_packets: 1632797
 tx_65_to_127_byte_packets: 11211285
 tx_128_to_255_byte_packets: 1628446
 tx_256_to_511_byte_packets: 2042646
 tx_512_to_1023_byte_packets: 637342
 tx_1024_to_1518_byte_packets: 27962776
 tx_1519_to_max_byte_packets: 0
 tx_fifo_underrun: 0

# dmesg
[snip]
sky2 :01:00.0: v1.16 addr 0xdfbfc000 irq 16 Yukon-EC (0xb6) rev 1
sky2 eth1: addr 00:03:2d:05:9c:27
sky2 lan0: enabling interface
sky2 lan0: ram buffer 48K
[snip]
sky2 lan0: rx error, status 0xad78ad78 length 0
lan0: hw csum failure.
[stack-trace snipped]
lan0: hw csum failure.
...

# cat /debug/sky2/lan0
IRQ src=0 mask=c01d control=0
Status ring (empty)
Tx ring pending=424...424 report=424 done=424

Rx ring hw get=252 put=405 last=1023

# cat /debug/sky2/lan0
IRQ src=0 mask=c01d control=0
Status ring (empty)
Tx ring pending=432...432 report=432 done=432

Rx ring hw get=252 put=415 last=1023

# cat /debug/sky2/lan0
IRQ src=0 mask=c01d control=0
Status ring (empty)
Tx ring pending=451...451 report=451 done=451

Rx ring hw get=316 put=440 last=1023
-- 
Daniel J Blueman
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html