Re: [E1000-devel] Detected Tx Unit Hang

Gary Smith Thu, 12 Mar 2009 15:13:17 -0700

Thid probably means the same b ug exists in windows as well. This is where we 
hit the problem first and converted it to a nas server.

Gary
Sent via BlackBerry by AT&T

-----Original Message-----
From: "Brandeburg, Jesse" <jesse.brandeb...@intel.com>

Date: Thu, 12 Mar 2009 15:01:27 
To: Gary W. Smith<g...@primeexalia.com>
Cc: e1000-devel@lists.sourceforge.net<e1000-devel@lists.sourceforge.net>
Subject: RE: [E1000-devel] Detected Tx Unit Hang

so, the 4GB patch will only cause a slight increase in cpu utilization.  There 
are no other side effects, and you *DO NOT* have to run the TxDescriptorStep 
workaround.

I think I might just push the change to not allow 64 bit addressing to these 32 
bit adapters, into e1000.

Glad to hear things are working better,
  Jesse

________________________________
From: Gary W. Smith [mailto:g...@primeexalia.com]
Sent: Thursday, March 12, 2009 2:45 PM
To: Gary W. Smith; Brandeburg, Jesse
Cc: e1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang

That was a bad example...

I was copying to/from the same instance form a machine running under vmware.

I now have a physical machine copying the 50gb of files from the bad machine to 
another machine and everything is still going smooth, but much faster this time.

This is the dstat from the bad machine.  The limiter is the disk (which is 
about 40mb/sec).  We were hitting the error before with only 10mb.  This this 
is definitely a positive thing.

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   5  88   0   1   5|  25M    0 | 443k   26M|   0     0 |5119  1560
  0   8  85   0   2   6|  32M    0 | 570k   33M|   0     0 |6311  1724
  0   2  97   0   0   2|6662k    0 | 135k 7626k|   0     0 |1992   355
  0   0 100   0   0   0|   0     0 | 194B  240B|   0     0 |1007    16
  0   0 100   0   0   0|   0     0 | 134B  240B|   0     0 |1024    40
  0   0 100   0   0   0|   0     0 | 134B  240B|   0     0 |1006    16
  0   0 100   0   0   0|   0     0 | 226B  240B|   0     0 |1024    40
  0   0 100   0   0   0|   0     0 | 318B  240B|   0     0 |1008    16
  0   0 100   0   0   0|   0     0 | 134B  240B|   0     0 |1023    38
  0   0 100   0   0   0|   0     0 | 134B  240B|   0     0 |1006    18
  0   2  97   0   1   2|8960k    0 | 140k 8492k|   0     0 |2357   639
  0   8  85   0   2   7|  33M    0 | 599k   35M|   0     0 |5949  2036
  0  10  78   0   2  10|  43M    0 | 793k   46M|   0     0 |6993  2176
  0   9  79   0   2  10|  42M    0 | 751k   44M|   0     0 |6810  2213
  0   9  82   0   2   8|  37M    0 | 661k   39M|   0     0 |6998  1863
  0   7  86   0   1   5|  28M    0 | 521k   30M|   0     0 |5874  1933

________________________________
From: Gary W. Smith [mailto:g...@primeexalia.com]
Sent: Thu 3/12/2009 2:32 PM
To: Brandeburg, Jesse
Cc: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] Detected Tx Unit Hang

Jesse,

Looks better.  transfering 50GB to/from the server and I'm not getting the 
errors in the log now.   Very large pings (ping vcsoaknas01 -t -l 30000 -w 
7000) are occasionally timing out BUT I haven't lost connectivity to the SSH 
session as of yet and the file transfer is still going.  dstat is also running 
consistantly (no random TX hangs like before).

dstat:

  0   3  92   0   1   4|4224k   13M|7338k 4642k|   0     0 |  10k   12k
  0   1  98   0   0   1| 936k 3872k|1951k 1040k|   0     0 |3649  3403
  0   1  96   0   0   2|1496k 8879k|4638k 1700k|   0     0 |7378  8853
  0   4  91   0   1   4|  13M 3678k|2382k   14M|   0     0 |9188  7267
  0   3  93   0   1   4|4352k   15M|7864k 4877k|   0     0 |  11k   13k
  0   2  95   0   1   3| 384k   14M|7389k  516k|   0     0 |9990    12k
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   2  95   0   0   2|2816k 8104k|4327k 3098k|   0     0 |7075  7510
  0   2  93   0   0   4|5696k 9120k|4918k 6176k|   0     0 |8478  8300
  0   2  95   0   0   3|3968k 6720k|3610k 4306k|   0     0 |6425  6107
  0   2  95   0   0   3|4736k 7616k|4081k 5135k|   0     0 |7242  6974
  0   2  95   0   1   3|4224k 6816k|3687k 4582k|   0     0 |6589  6344
  0   2  95   0   0   3|4096k 7016k|3748k 4445k|   0     0 |6546  6311
  0   1  96   0   0   2|3136k 5288k|2852k 3402k|   0     0 |5251  4936

We have 50GB on an iscsi share (or 500GB) that we are copying to/from over the 
wire for this test.  During the writing of this email we have already copied 
about 1.3gb without any problem as of yet.

So my next question is regarding the 4GB patch.  Does this have any negative 
impact that I need to be aware of?

Gary

________________________________

From: Brandeburg, Jesse [mailto:jesse.brandeb...@intel.com]
Sent: Thu 3/12/2009 1:59 PM
To: Gary W. Smith
Cc: e1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang

re-added the list for tracking...

I think I see the issue, you have more than 4GB ram, and it appears that your 
system doesn't handle dual address cycles correctly, or our adapter doesn't 
work quite right for some reason.

Force the OS to never allow addresses > 4GB to our hardware using this patch:
https://sourceforge.net/tracker2/download.php?group_id=42302&atid=447449&file_id=283326&aid=2007017

its the e1000_disable_dac.patch file.

________________________________

From: Gary W. Smith [mailto:g...@primeexalia.com]
Sent: Thursday, March 12, 2009 12:55 PM
To: Brandeburg, Jesse
Subject: RE: [E1000-devel] Detected Tx Unit Hang

Jesse,

Included is the messages log with the debug patch.  It only took a couple 
seconds to get it to trigger the problem even with the modprobe.conf changes.

options e1000 TxDescriptorStep=4,4
alias eth0 e1000
alias eth1 e1000

Anyway, I did update the BIOS about a month back to try to see if that would 
resolve the problem but it did not.  It does have the latest.  We say a similar 
problem under Windows 2003 with SP1+ but ruled it as being part of the TCP 
offload /DOS patch bug they had and I didn't think much of it (as it affected 
several other servers).  The problem under Windows existed whether or not we 
used the onboard nic.  In fact, we used a seperate BroadComm 1GB adapter 
(thinking it was the TCP offload) and it didn't resolve it either.

I'm really hopping that this isn't a hardware issue (as it's not a warranteed 
box) but if it is then we will just have to deal with that seperately.

Thanks for alll of the help,

Gary

________________________________

From: Brandeburg, Jesse [mailto:jesse.brandeb...@intel.com]
Sent: Thu 3/12/2009 9:33 AM
To: Gary W. Smith
Cc: e1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang

sorry, go to the home page http://sourceforge.net/projects/e1000
click Tracker
click patches
click tx hang debug code (all releases) - 1460945
download the e1000_806_dump.patch, it should apply with fuzz to your e1000 
driver directory with the command

download file.patch...
patch -d e1000-8.0.* -p1 < file.patch

here is the download link
https://sourceforge.net/tracker2/download.php?group_id=42302&atid=447451&file_id=298629&aid=1460945

________________________________

From: Gary W. Smith [mailto:g...@primeexalia.com]
Sent: Thursday, March 12, 2009 9:16 AM
To: Brandeburg, Jesse
Cc: e1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang

Excuse my ignorance, but which patches? ;).  There's a lot of stuff on the 
download page.  I assume you are talking about the I/OAT driver & kernel patch 
but I want to make sure before doing it.

>
> Mar 11 18:50:01 vcsoaknas01 kernel: e1000: eth0: e1000_clean_tx_irq:
> Detected Tx Unit Hang
> Mar 11 18:50:01 vcsoaknas01 kernel:   Tx Queue             <0>
> Mar 11 18:50:01 vcsoaknas01 kernel:   TDH                  <f7>
> Mar 11 18:50:01 vcsoaknas01 kernel:   TDT                  <f7>
> Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_use          <f7>
> Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_clean        <24>
> Mar 11 18:50:01 vcsoaknas01 kernel: buffer_info[next_to_clean]
> Mar 11 18:50:01 vcsoaknas01 kernel:   time_stamp           <1004de0b1>
> Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_watch        <24>
> Mar 11 18:50:01 vcsoaknas01 kernel:   jiffies              <1004dec18>
> Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_watch.status <0>

this really indicates that the adapter is finishing all the work but that
the descriptor is not making it back to main memory indicating the work
was completed.  We have seen this a lot with AMD systems, in particular
ones with VIA chipsets.  There is a bad bug in those machines when an IO
device and the processor both write to the same cache line.

also, if the above workaround doesn't help we'll want you to install the
dump patch from the patches section of e1000.sourceforge.net and send us
the output when you get a tx hang.

hope this helps,
 Jesse

------------------------------------------------------------------------------
Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are
powering Web 2.0 with engaging, cross-platform capabilities. Quickly and
easily build your RIAs with Flex Builder, the Eclipse(TM)based development
software that enables intelligent coding and step-through debugging.
Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel

Re: [E1000-devel] Detected Tx Unit Hang

Reply via email to