Re: [E1000-devel] sf.net bug ID 2934941: "Detected Tx Unit Hang" on quad port copper 82576

Покотиленко Костик Tue, 09 Feb 2010 13:35:32 -0800

В Tue, 09 Feb 2010 10:46:46 +0200, "Покотиленко Костик" пишет:


> В Пнд, 08/02/2010 в 14:03 -0800, Duyck, Alexander H пишет:
>> Покотиленко Костик wrote:
>> > В Fri, 29 Jan 2010 01:29:05 +0200, "Покотиленко Костик" пишет:
>> >
>> >> В Чтв, 28/01/2010 в 14:32 -0800, Alexander Duyck пишет:
>> >>> On Wed, 2010-01-27 at 04:14 -0800, Покотиленко Костик wrote:
>> >>>> Using serial console I've figured out:
>> >>>>
>> >>>> - system working fine except for the NIC
>> >>>> - ifconfig show only RX dropped increasing on eth1 (client side),
>> >>>> other counters stailed.
>> >>>> - ethtool -t eth0:
>> >>>>
>> >>>> The test result is FAIL
>> >>>> The test extra info:
>> >>>> Register test  (offline)         0
>> >>>> Eeprom test    (offline)         0
>> >>>> Interrupt test (offline)         0
>> >>>> Loopback test  (offline)         13
>> >>>> Link test   (on/offline)         0
>> >>>>
>> >>>> - ethtool -t eth1
>> >>>>
>> >>>> The test result is FAIL
>> >>>> The test extra info:
>> >>>> Register test  (offline)         0
>> >>>> Eeprom test    (offline)         0
>> >>>> Interrupt test (offline)         0
>> >>>> Loopback test  (offline)         13
>> >>>> Link test   (on/offline)         0
>> >>>>
>> >>>> - After doing:
>> >>>>
>> >>>> ifdown -a; rmmod igb; rmmod dca; modprobe igb; ifup -a
>> >>>>
>> >>>> both ethtool commands (The test result is FAIL) and ifconfig show
>> >>>> same result
>> >>>>
>> >>>> So it seems like NIC hawdware hand.
>> >>>
>> >>> The next time this occurs could you go though and run the ethtool
>> >>> test on all of the network ports?  I'm wondering if it is only
>> >>> eth0/1 that are blocked or if eth3/4 are stopped as well.
>> >>
>> >> Sure.
>> >
>> > Last time we have changed some BIOS options to:
>> >
>> > Execute Disable Bit: Disabled
>> > ACPI 1.0 Support: Enabled (When Disabled it's 3.0(??))
>> >
>> > After which system worked for almost 9 days with 2.6.30. Then the
>> same
>> > problem.
>> >
>> > Forgot to do ethtool test for all ports :/

Well, it happened again, ethtool -t "Loopback test" failed for all 4 ports.

>> Based on the results it seems like what is failing is the hardware's
>> ability to handle DMA transactions.  Ideally if possible it would be
>> best if you could do an lspci -t dump of the system and work your way
>> up until you find at which point in the tree we have the failure.  The
>> ethtool -t test seems to show the failure as a loopback test so we
>> should be able to at least test this up to the PCIe bridge on the
>> adapter.
>
> lspci -tv attached.

lspci -tv during failure doesn't differ.

Also, it seems that more load make it happen sooner. Average load here  
55Mbit/s (summary throuput between 2 ports), maximal is ~150Mbit/s.

> During last 2 days system rebooted twice shortly after the problem
> occured, so not ethtool tests yet.
>
> BTW, I have many "UDP: bad checksum" messages before the issue occurs
> like this:
>
> Feb  8 18:49:16 lan-r kernel: [99067.458074] UDP: bad checksum. From
> 95.169.150.116:48810 to 89.28.200.210:1126 ulen 181
> Feb  8 18:49:24 lan-r kernel: [99074.976709] __ratelimit: 29 callbacks
> suppressed
>
> Also today there was:
>
> Feb  9 09:57:33 lan-r kernel: [53517.383722] igb 0000:03:00.1: Detected
> Tx Unit Hang
> Feb  9 09:57:33 lan-r kernel: [53517.383725]   Tx Queue             <0>
> Feb  9 09:57:33 lan-r kernel: [53517.383729]   TDH                  <aa>
> Feb  9 09:57:33 lan-r kernel: [53517.383730]   TDT                  <e8>
> Feb  9 09:57:33 lan-r kernel: [53517.383730]   next_to_use          <e8>
> Feb  9 09:57:33 lan-r kernel: [53517.383731]   next_to_clean        <aa>
> Feb  9 09:57:33 lan-r kernel: [53517.383732] buffer_info[next_to_clean]
> Feb  9 09:57:33 lan-r kernel: [53517.383732]   time_stamp
> <cb1921>
> Feb  9 09:57:33 lan-r kernel: [53517.383733]   next_to_watch        <ab>
> Feb  9 09:57:33 lan-r kernel: [53517.383734]   jiffies
> <cb1c48>
> Feb  9 09:57:33 lan-r kernel: [53517.383734]   desc.status
> <158000>
>
> But the system still alive.
>
>> Also if ACPI is having an effect on the issue one other thing you
>> might try changing in the BIOS would be to disable all CPU C-states.
>> The system will consume more power as a result, but the CPU also ends
>> up usually being much more responsive as a result, and we have seen in
>> the past that this can sometimes resolve performance issues.
>
> I'll turn those off:
>
> CPU C State=1               ;Options: 1=Enabled: 0=Disabled
> C1E=1                       ;Options: 1=Enabled: 0=Disabled

Turned off "CPU C State" and "Spread spectrum", C1E turned off automatically.

> Full current BIOS config attached.
>
> --
> Покотиленко Костик <[email protected]>
>



----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] sf.net bug ID 2934941: "Detected Tx Unit Hang" on quad port copper 82576

Reply via email to