On Thu, 14 Dec 2006 12:47:05 -0800
Alex Romosan <[EMAIL PROTECTED]> wrote:

> under heavy network load the sky2 driver (compiled in the kernel)
> locks up and the only way i can get the network back is to reboot the
> machine (bringing the network down and back up again doesn't help).
> this happens on an amd64 machine (athlon 3500+ processor) and the card
> in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit
> Ethernet Controller (rev 15) (from lspci). this is what i see in the
> syslog:
> 
> kernel: sky2 eth0: rx error, status 0x414a414a length 0
> kernel: eth0: hw csum failure.
> kernel: 
> kernel: Call Trace:
> kernel:  <IRQ>  [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66
> kernel:  [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea
> kernel:  [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20
> kernel:  [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4
> kernel:  [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b
> kernel:  [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab
> kernel:  [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e
> kernel:  [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c
> kernel:  [<ffffffff802219ce>] scheduler_tick+0x23/0x2f9
> kernel:  [<ffffffff8044a796>] net_rx_action+0x61/0xf0
> kernel:  [<ffffffff8022a35f>] __do_softirq+0x40/0x8a
> kernel:  [<ffffffff8020a3cc>] call_softirq+0x1c/0x28
> kernel:  [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d
> kernel:  [<ffffffff8022a313>] irq_exit+0x36/0x42
> kernel:  [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e
> kernel:  [<ffffffff80208710>] default_idle+0x0/0x3a
> kernel:  [<ffffffff80209bf1>] ret_from_intr+0x0/0xa
> kernel:  <EOI>  [<ffffffff80208736>] default_idle+0x26/0x3a
> kernel:  [<ffffffff8020878c>] cpu_idle+0x42/0x75
> kernel:  [<ffffffff805df675>] start_kernel+0x1ce/0x1d3
> kernel:  [<ffffffff805df140>] _sinittext+0x140/0x144
> kernel: 
> kernel: eth0: hw csum failure.
> kernel: 
> kernel: Call Trace:
> kernel:  <IRQ>  [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66
> kernel:  [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea
> kernel:  [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20
> kernel:  [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4
> kernel:  [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b
> kernel:  [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab
> kernel:  [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e
> kernel:  [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c
> kernel:  [<ffffffff80474647>] tcp_delack_timer+0x0/0x1b5
> kernel:  [<ffffffff8044a796>] net_rx_action+0x61/0xf0
> kernel:  [<ffffffff8022a35f>] __do_softirq+0x40/0x8a
> kernel:  [<ffffffff8020a3cc>] call_softirq+0x1c/0x28
> kernel:  [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d
> kernel:  [<ffffffff8022a313>] irq_exit+0x36/0x42
> kernel:  [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e
> kernel:  [<ffffffff80209bf1>] ret_from_intr+0x0/0xa
> kernel:  <EOI>  [<ffffffff802a8402>] inode2sd+0x104/0x117
> kernel:  [<ffffffff802b8cfa>] search_by_key+0xa08/0xbfe
> kernel:  [<ffffffff802b8475>] search_by_key+0x183/0xbfe
> kernel:  [<ffffffff80284778>] ll_rw_block+0x89/0x9e
> kernel:  [<ffffffff802b8475>] search_by_key+0x183/0xbfe
> kernel:  [<ffffffff80283cf5>] __find_get_block_slow+0x101/0x10d
> kernel:  [<ffffffff80284053>] __find_get_block+0x197/0x1a5
> kernel:  [<ffffffff8026800c>] inode_get_bytes+0x2a/0x52
> kernel:  [<ffffffff802a89f1>] reiserfs_update_sd_size+0x7e/0x284
> kernel:  [<ffffffff80237700>] kthread+0xed/0xfd
> kernel:  [<ffffffff802be990>] do_journal_end+0x34b/0xbdd
> kernel:  [<ffffffff802b1729>] reiserfs_dirty_inode+0x56/0x76
> kernel:  [<ffffffff80284c19>] block_prepare_write+0x1a/0x24
> kernel:  [<ffffffff802809b1>] __mark_inode_dirty+0x29/0x197
> kernel:  [<ffffffff802a8d04>] reiserfs_commit_write+0x10d/0x19f
> kernel:  [<ffffffff80284c19>] block_prepare_write+0x1a/0x24
> kernel:  [<ffffffff802484fc>] generic_file_buffered_write+0x4ad/0x6c4
> kernel:  [<ffffffff80271b3c>] __pollwait+0x0/0xe0
> kernel:  [<ffffffff8022a006>] current_fs_time+0x35/0x3b
> kernel:  [<ffffffff80248a8c>] __generic_file_aio_write_nolock+0x379/0x3ec
> kernel:  [<ffffffff8049baca>] unix_dgram_recvmsg+0x1be/0x1d9
> kernel:  [<ffffffff804b6516>] __mutex_lock_slowpath+0x205/0x210
> kernel:  [<ffffffff80248b60>] generic_file_aio_write+0x61/0xc1
> kernel:  [<ffffffff80248aff>] generic_file_aio_write+0x0/0xc1
> kernel:  [<ffffffff80264e57>] do_sync_readv_writev+0xc0/0x107
> kernel:  [<ffffffff802377f7>] autoremove_wake_function+0x0/0x2e
> kernel:  [<ffffffff80229d16>] getnstimeofday+0x10/0x28
> kernel:  [<ffffffff80264ced>] rw_copy_check_uvector+0x6c/0xdc
> kernel:  [<ffffffff802654f7>] do_readv_writev+0xb2/0x18b
> kernel:  [<ffffffff80265a2c>] sys_writev+0x45/0x93
> kernel:  [<ffffffff802096de>] system_call+0x7e/0x83
> 
> and so on. some times i don't get this trace but instead i get:
> 
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181
> kernel: sky2 status report lost?
> kernel: NETDEV WATCHDOG: eth0: transmit timed out
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181
> kernel: sky2 hardware hung? flushing
> 
> but the end result is the same, the network card stops responding and
> i have to reboot the machine. i can reproduce this on a consistent
> basis so if there are any patches, i can try them out and see if they
> fix the problem.
> 
> this is probably not a regression per se as i saw it as well with
> 2.6.19 and 2.6.19-rc6. i am not sure if it was there previous to
> 2.6.19-rc6. suggestions, patches welcome. thanks.

Pleas report these problems to netdev@vger.kernel.org, I rarely go
looking in LKML.

These are the things you need to debug a sky2 related problem.

1) What is exact kernel version in use?  This is important because
   problems get fixed but it can be a long while until the fix bubbles down
   to the vendor kernels.

2) What is the chip version?  The driver prints this out on boot up in
   the console log.   (dmesg | grep sky2)
   This matters because each chip version has different
   bugs to deal with.

3) Does it work with the vendor driver?
   The vendor driver does a number of things differently than the sky2 driver
   and can mask problems, but if it doesn't work as well that is a useful
   data point.  If you want to know why the sky2 driver was written instead
   of just using the vendor driver, look at the code. The sk98lin driver
   is huge, includes features that are unsupportable and broken, and locking
   mistakes.  But the sk98lin also has a watchdog that masks off bugs and
   may provide useful insight.

4) What is the IRQ routing?
   There are two issues here, first the driver will never work with edge
   trigger IRQ's, some motherboards also have busted BIOS and chipsets
   that don't do MSI properly. A couple of module parameters are available
   to help:
      disable_msi=1             avoids using MSI
      idle_timeout=10           polls for lost IRQ's every N ms (10)

5) What are the messages in the console log when problem happens?

6) Are you running any of the following: bonding, vlans, bridging,
   netfilter, traffic control?

7) Please get a current version of ethtool from:
   git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
   and run ethtool register dump after a problem occurs:
      ethtool -d eth0

8) Are you using a dual port board.  There were issues on the PCI-X
   version that required hacks, the PCI-express version may have the
   same problem.  Basically, checksum offload wouldn't work and receive
   DMA's would arrive out of order.

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to