Re: sky2 0.11 instability

Carl-Daniel Hailfinger Sat, 21 Jan 2006 07:11:42 -0800

Carl-Daniel Hailfinger schrieb:
> Hi,
> 
> Carl-Daniel Hailfinger schrieb:
> 
>>after sending 259 GB and receiving 25 GB over my SysKonnect SK-9E21
>>card (sky2 says it is a "Yukon-EC (0xb6) rev 1"), the card appears
>>dead. Machine is an Athlon64 3200+ on an Asus A8N-SLI Deluxe board.
>>
>>sky2 v0.11 addr 0xc9000000 irq 74 Yukon-EC (0xb6) rev 1
>>sky2 eth3: addr 00:00:5a:70:30:fb
>>[...]
>>sky2 eth3: enabling interface
>>[...]
>>sky2 eth3: phy interrupt status 0x1c40 0x7d0c
>>sky2 eth3: Link is up at 100 Mbps, full duplex, flow control both
>>[...]
>>NETDEV WATCHDOG: eth3: transmit timed out
>>sky2 eth3: tx timeout
>>NETDEV WATCHDOG: eth3: transmit timed out
>>sky2 eth3: tx timeout
>>
>>
>>switch:~ # ifconfig eth3
>>eth3       Link encap:Ethernet  HWaddr 00:00:5A:70:30:FB
>>          inet6 addr: fe80::200:5aff:fe70:30fb/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:130530358 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:209647800 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:25980735946 (24777.1 Mb)  TX bytes:259787058579 (247752.2 
>> Mb)
>>          Interrupt:74
>>
>>switch:~ # cat /proc/interrupts
>>           CPU0
>>  0:   11213627    IO-APIC-edge  timer
>>  1:      24783    IO-APIC-edge  i8042
>>  8:          0    IO-APIC-edge  rtc
>>  9:          0   IO-APIC-level  acpi
>> 15:     401558    IO-APIC-edge  ide1
>> 50:  249384881   IO-APIC-level  eth0
>> 58:  179123938   IO-APIC-level  sky2
>> 66:          3   IO-APIC-level  sky2, ohci1394
>> 74:   98956955   IO-APIC-level  sky2
>> 82:      19952   IO-APIC-level  sky2
>>217:       1865   IO-APIC-level  libata, NVidia CK804
>>225:     263052   IO-APIC-level  libata, ehci_hcd:usb1
>>NMI:      11098
>>LOC:   11214113
>>ERR:          0
>>MIS:          0
>>
>>Not only will the card not transmit anymore, it also doesn't
>>receive any packet at all. "ethtool -r eth3" doesn't change
>>anything, taking the interface down and up again also doesn't
>>help. The interrupt count of interrupt 74 stays constant after
>>failing.
>>
>>modprobe -r sky2; modprobe sky2
>>fixes the problem for me, so maybe resetting the card on TX
>>timeouts will help.
>>
>>The same problem appeared much earlier for another card which
>>shared interrupt 58 with an onboard card driven by skge. After
>>disabling the skge driver and rebooting, that card has been
>>stable so far.
>>
>>The card is connected to a 100 MBit switch.
>>
>>These problems didn't appear with sk98lin v8.14.3.3 (that
>>driver did survive about 10 TB of traffic before I rebooted).
>>
>>Register dumps are available on request (too big for this
>>list).
>>
>>I will now try sky2 0.13 and report back.
> 
> 
> And it hit the other interface after 200 MB transferred...
> NETDEV WATCHDOG: bridgeext0: transmit timed out
> sky2 bridgeext0: tx timeout
> NETDEV WATCHDOG: bridgeext0: transmit timed out
> sky2 transmit interrupt missed? recovered
> 
> Although the driver claims to recover, it doesn't recover at all.
> What debug level would be advisable? It is now running with
> "modprobe sky2 debug=2", but I can't see more than the messages
> above.
> 
> I have now added a hard reset routine to the tx timeout
> path and hope it won't kill my machine.


Apologies for mangled whitespace, this is just a rough cut'n'paste.
--- linux-2.6.15/drivers/net/sky2.c.orig        2006-01-21 16:00:15.000000000 
+0100
+++ linux-2.6.15/drivers/net/sky2.c     2006-01-21 14:08:28.000000000 +0100
@@ -1565,6 +1565,7 @@ static int sky2_autoneg_done(struct sky2
        return 0;
 }

+static int sky2_reset(struct sky2_hw *hw);
 /*
  * Interrupt from PHY are handled outside of interrupt context
  * because accessing phy registers requires spin wait which might
@@ -1639,6 +1640,7 @@ static void sky2_tx_timeout(struct net_d
        if (netif_msg_timer(sky2))
                printk(KERN_ERR PFX "%s: tx timeout\n", dev->name);

+       if (0) {
        sky2_write32(hw, Q_ADDR(txq, Q_CSR), BMU_STOP);
        sky2_write32(hw, Y2_QADDR(txq, PREF_UNIT_CTRL), PREF_UNIT_RST_SET);

@@ -1646,6 +1648,12 @@ static void sky2_tx_timeout(struct net_d

        sky2_qset(hw, txq);
        sky2_prefetch_init(hw, txq, sky2->tx_le_map, TX_RING_SIZE - 1);
+       } else {
+       printk(KERN_ERR PFX "%s: recovering the HARD way...\n", dev->name);
+       sky2_down(dev);
+       sky2_reset(hw);
+       sky2_up(dev);
+       }
 }


And everytime the kernel throws this message, I run the following
script:

#!/bin/bash
deadinterface=`dmesg|grep HARD|tail -1|sed "s/.*sky2 //;s/:.*//"`
ip l s $deadinterface down
ip l s $deadinterface up

After that, everything continues to work until the next tx timeout
happens, and then the script again saves the day.

More results about the circumstances of this bug: It seems that
it will only trigger under LOW load. As long as I keep the interface
busy, it will have no problems at all.


Regards,
Carl-Daniel
-- 
http://www.hailfinger.org/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: sky2 0.11 instability

Reply via email to