On Thu, 14 Dec 2006 12:47:05 -0800 Alex Romosan <[EMAIL PROTECTED]> wrote:
> under heavy network load the sky2 driver (compiled in the kernel) > locks up and the only way i can get the network back is to reboot the > machine (bringing the network down and back up again doesn't help). > this happens on an amd64 machine (athlon 3500+ processor) and the card > in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit > Ethernet Controller (rev 15) (from lspci). this is what i see in the > syslog: > > kernel: sky2 eth0: rx error, status 0x414a414a length 0 > kernel: eth0: hw csum failure. > kernel: > kernel: Call Trace: > kernel: <IRQ> [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66 > kernel: [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea > kernel: [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20 > kernel: [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4 > kernel: [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b > kernel: [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab > kernel: [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e > kernel: [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c > kernel: [<ffffffff802219ce>] scheduler_tick+0x23/0x2f9 > kernel: [<ffffffff8044a796>] net_rx_action+0x61/0xf0 > kernel: [<ffffffff8022a35f>] __do_softirq+0x40/0x8a > kernel: [<ffffffff8020a3cc>] call_softirq+0x1c/0x28 > kernel: [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d > kernel: [<ffffffff8022a313>] irq_exit+0x36/0x42 > kernel: [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e > kernel: [<ffffffff80208710>] default_idle+0x0/0x3a > kernel: [<ffffffff80209bf1>] ret_from_intr+0x0/0xa > kernel: <EOI> [<ffffffff80208736>] default_idle+0x26/0x3a > kernel: [<ffffffff8020878c>] cpu_idle+0x42/0x75 > kernel: [<ffffffff805df675>] start_kernel+0x1ce/0x1d3 > kernel: [<ffffffff805df140>] _sinittext+0x140/0x144 > kernel: > kernel: eth0: hw csum failure. > kernel: > kernel: Call Trace: > kernel: <IRQ> [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66 > kernel: [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea > kernel: [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20 > kernel: [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4 > kernel: [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b > kernel: [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab > kernel: [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e > kernel: [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c > kernel: [<ffffffff80474647>] tcp_delack_timer+0x0/0x1b5 > kernel: [<ffffffff8044a796>] net_rx_action+0x61/0xf0 > kernel: [<ffffffff8022a35f>] __do_softirq+0x40/0x8a > kernel: [<ffffffff8020a3cc>] call_softirq+0x1c/0x28 > kernel: [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d > kernel: [<ffffffff8022a313>] irq_exit+0x36/0x42 > kernel: [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e > kernel: [<ffffffff80209bf1>] ret_from_intr+0x0/0xa > kernel: <EOI> [<ffffffff802a8402>] inode2sd+0x104/0x117 > kernel: [<ffffffff802b8cfa>] search_by_key+0xa08/0xbfe > kernel: [<ffffffff802b8475>] search_by_key+0x183/0xbfe > kernel: [<ffffffff80284778>] ll_rw_block+0x89/0x9e > kernel: [<ffffffff802b8475>] search_by_key+0x183/0xbfe > kernel: [<ffffffff80283cf5>] __find_get_block_slow+0x101/0x10d > kernel: [<ffffffff80284053>] __find_get_block+0x197/0x1a5 > kernel: [<ffffffff8026800c>] inode_get_bytes+0x2a/0x52 > kernel: [<ffffffff802a89f1>] reiserfs_update_sd_size+0x7e/0x284 > kernel: [<ffffffff80237700>] kthread+0xed/0xfd > kernel: [<ffffffff802be990>] do_journal_end+0x34b/0xbdd > kernel: [<ffffffff802b1729>] reiserfs_dirty_inode+0x56/0x76 > kernel: [<ffffffff80284c19>] block_prepare_write+0x1a/0x24 > kernel: [<ffffffff802809b1>] __mark_inode_dirty+0x29/0x197 > kernel: [<ffffffff802a8d04>] reiserfs_commit_write+0x10d/0x19f > kernel: [<ffffffff80284c19>] block_prepare_write+0x1a/0x24 > kernel: [<ffffffff802484fc>] generic_file_buffered_write+0x4ad/0x6c4 > kernel: [<ffffffff80271b3c>] __pollwait+0x0/0xe0 > kernel: [<ffffffff8022a006>] current_fs_time+0x35/0x3b > kernel: [<ffffffff80248a8c>] __generic_file_aio_write_nolock+0x379/0x3ec > kernel: [<ffffffff8049baca>] unix_dgram_recvmsg+0x1be/0x1d9 > kernel: [<ffffffff804b6516>] __mutex_lock_slowpath+0x205/0x210 > kernel: [<ffffffff80248b60>] generic_file_aio_write+0x61/0xc1 > kernel: [<ffffffff80248aff>] generic_file_aio_write+0x0/0xc1 > kernel: [<ffffffff80264e57>] do_sync_readv_writev+0xc0/0x107 > kernel: [<ffffffff802377f7>] autoremove_wake_function+0x0/0x2e > kernel: [<ffffffff80229d16>] getnstimeofday+0x10/0x28 > kernel: [<ffffffff80264ced>] rw_copy_check_uvector+0x6c/0xdc > kernel: [<ffffffff802654f7>] do_readv_writev+0xb2/0x18b > kernel: [<ffffffff80265a2c>] sys_writev+0x45/0x93 > kernel: [<ffffffff802096de>] system_call+0x7e/0x83 > > and so on. some times i don't get this trace but instead i get: > > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181 > kernel: sky2 status report lost? > kernel: NETDEV WATCHDOG: eth0: transmit timed out > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181 > kernel: sky2 hardware hung? flushing > > but the end result is the same, the network card stops responding and > i have to reboot the machine. i can reproduce this on a consistent > basis so if there are any patches, i can try them out and see if they > fix the problem. > > this is probably not a regression per se as i saw it as well with > 2.6.19 and 2.6.19-rc6. i am not sure if it was there previous to > 2.6.19-rc6. suggestions, patches welcome. thanks. Pleas report these problems to netdev@vger.kernel.org, I rarely go looking in LKML. These are the things you need to debug a sky2 related problem. 1) What is exact kernel version in use? This is important because problems get fixed but it can be a long while until the fix bubbles down to the vendor kernels. 2) What is the chip version? The driver prints this out on boot up in the console log. (dmesg | grep sky2) This matters because each chip version has different bugs to deal with. 3) Does it work with the vendor driver? The vendor driver does a number of things differently than the sky2 driver and can mask problems, but if it doesn't work as well that is a useful data point. If you want to know why the sky2 driver was written instead of just using the vendor driver, look at the code. The sk98lin driver is huge, includes features that are unsupportable and broken, and locking mistakes. But the sk98lin also has a watchdog that masks off bugs and may provide useful insight. 4) What is the IRQ routing? There are two issues here, first the driver will never work with edge trigger IRQ's, some motherboards also have busted BIOS and chipsets that don't do MSI properly. A couple of module parameters are available to help: disable_msi=1 avoids using MSI idle_timeout=10 polls for lost IRQ's every N ms (10) 5) What are the messages in the console log when problem happens? 6) Are you running any of the following: bonding, vlans, bridging, netfilter, traffic control? 7) Please get a current version of ethtool from: git://git.kernel.org/pub/scm/network/ethtool/ethtool.git and run ethtool register dump after a problem occurs: ethtool -d eth0 8) Are you using a dual port board. There were issues on the PCI-X version that required hacks, the PCI-express version may have the same problem. Basically, checksum offload wouldn't work and receive DMA's would arrive out of order. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html