Re: [E1000-devel] Detected Tx Unit Hang Issue

2011-01-22 Thread Stephen Palmateer
just found Network Adapter Driver for PCI-E Gigabit Network Connections under 
Linux*
version 1.2.20
Intel's Readme suggests that this will fix the driver generated interrupts.

Since our e1000e driver is only version 1.0.2 I'm going to winscp the tarball 
provided by Intel to the machine and follow Intel's instructions for 
installation.

Intel's website; 
http://downloadcenter.intel.com/Detail_Desc.aspx?agr=YDwnldID=15817
suggests that this version of the driver is valid for the IntelĀ® 82571EB 
Gigabit Ethernet Controllers we're working with.

I'll update this email thread when I'm finished.

thanks again,
Stephen Palmateer

- Original Message -
From: Stephen Palmateer stephen.palmat...@netsweeper.com
To: E1000-devel@lists.sourceforge.net
Cc: ali a...@yemen.net.ye, Assem Alwadee assem1...@gmail.com, Jeremy 
Erb jeremy@netsweeper.com, Tamer Abu-Elsaad 
tamer.abu-els...@netsweeper.com
Sent: Saturday, January 22, 2011 4:44:01 PM
Subject: Detected Tx Unit Hang Issue

Hello All,

I would like to report a problem with the e1000e driver on a CentOS 5.4 machine 
with a custom kernel.

Experiencing interface timeouts/failure on a regular basis, rendering the 
management interface useless.

Seeing the following error repeatedly in dmesg and stdout:

:04:00.0: eth0: Detected Tx Unit Hang:
  TDH  143
  TDT  12e
  next_to_use  12e
  next_to_clean142
buffer_info[next_to_clean]:
  time_stamp   100de9410
  next_to_watch144
  jiffies  100de952f
  next_to_watch.status 0
:04:00.0: eth0: Detected Tx Unit Hang:
  TDH  143
  TDT  12e
  next_to_use  12e
  next_to_clean142
buffer_info[next_to_clean]:
  time_stamp   100de9410
  next_to_watch144
  jiffies  100de95f7
  next_to_watch.status 0
:04:00.0: eth0: Detected Tx Unit Hang:
  TDH  143
  TDT  12e
  next_to_use  12e
  next_to_clean142
buffer_info[next_to_clean]:
  time_stamp   100de9410
  next_to_watch144
  jiffies  100de96bf
  next_to_watch.status 0

The wierd part is eth2 has far more traffic on it and is not seeing any issue.

I'll try to provide as much info as I can below.

[admin@filter1 ~]$ uname -a
Linux filter1.yemen.net.ye 2.6.18-164.15.1.el5.netsw #1 SMP Mon Apr 26 15:01:04 
EDT 2010 i686 i686 i386 GNU/Linux

[root@filter1 ~]# ethtool -i eth0
driver: e1000e
version: 1.0.2-k2
firmware-version: 5.10-2
bus-info: :05:00.0

[root@filter1 ~]# ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off

[root@filter1 ~]# lspci -vv | grep net
04:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Sun Microsystems Computer Corp. x4 PCI-Express Quad Gigabit 
Ethernet UTP Low Profile Adapter
04:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Sun Microsystems Computer Corp. x4 PCI-Express Quad Gigabit 
Ethernet UTP Low Profile Adapter
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Sun Microsystems Computer Corp. x4 PCI-Express Quad Gigabit 
Ethernet UTP Low Profile Adapter
05:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Sun Microsystems Computer Corp. x4 PCI-Express Quad Gigabit 
Ethernet UTP Low Profile Adapter
07:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network 
Connection (rev 02)
07:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network 
Connection (rev 02)

[root@filter1 ~]# modinfo e1000e
filename:   
/lib/modules/2.6.18-164.15.1.el5.netsw/kernel/drivers/net/e1000e/e1000e.ko
version:1.0.2-k2
license:GPL
description:Intel(R) PRO/1000 Network Driver
author: Intel Corporation, linux.n...@intel.com
srcversion: D6678FCB5D0D64FDE5CC3DF
alias:  pci:v8086d10F0sv*sd*bc*sc*i*
alias:  pci:v8086d10EFsv*sd*bc*sc*i*
alias:  pci:v8086d10EBsv*sd*bc*sc*i*
alias:  pci:v8086d10EAsv*sd*bc*sc*i*
alias:  pci:v8086d10DFsv*sd*bc*sc*i*
alias:  pci:v8086d10DEsv*sd*bc*sc*i*
alias:  pci:v8086d10CEsv*sd*bc*sc*i*
alias:  pci:v8086d10CDsv*sd*bc*sc*i*
alias:  pci:v8086d10CCsv*sd*bc*sc*i*
alias:  pci:v8086d10CBsv*sd*bc*sc*i*
alias:  pci:v8086d10F5sv*sd*bc*sc*i*
alias:  pci:v8086d10BFsv*sd*bc*sc*i*
alias:  pci:v8086d10E5sv*sd*bc*sc*i*
alias:  

Re: [E1000-devel] Detected Tx Unit Hang e1000e versions: 0.5.18.3-NAPI and 0.3.3.3-k6 with kernel 2.6.28.9

2009-04-20 Thread Brandeburg, Jesse
On Sun, 19 Apr 2009, Andrey Luzgin wrote:
 We have repeating problems on several servers with different versions of
 the driver e1000e with kernel 2.6.28.9 (this version because of tproxy
 is necessary to us). All servers is IntelĀ® Server Systems SR1560SF with
 one additional NIC 82572EI Gigabit Ethernet Controller. Enabled ioatdma.
 
 This is last log from server with e1000e version: 0.5.18.3-NAPI
 
 Apr 19 21:03:47 R2PX1 [188890.816082] :06:00.0: eth1: Detected Tx
 Unit Hang:
 Apr 19 21:03:47 R2PX1 [188890.816083]   TDH  a39
 Apr 19 21:03:47 R2PX1 [188890.816084]   TDT  a25
 Apr 19 21:03:47 R2PX1 [188890.816085]   next_to_use  a25
 Apr 19 21:03:47 R2PX1 [188890.816086]   next_to_cleana38
 Apr 19 21:03:47 R2PX1 [188890.816086] buffer_info[next_to_clean]:
 Apr 19 21:03:47 R2PX1 [188890.816087]   time_stamp   102cf691d
 Apr 19 21:03:47 R2PX1 [188890.816088]   next_to_watcha3b
 Apr 19 21:03:47 R2PX1 [188890.816088]   jiffies  102cf6ab8
 Apr 19 21:03:47 R2PX1 [188890.816089]   next_to_watch.status 0
 Apr 19 21:03:49 R2PX1 [188892.816132] :06:00.0: eth1: Detected Tx
 Unit Hang:

so is it the 82572EI that is having problems? or the ESB2 ports (LOM)?

what kind of traffic are you running?  And why do you have the 
TxDescriptor count set so high?  I'm wondering if you're running with the 
(ill advised) setting that someone once posted to a debian mailing list 
long ago.

Please include dmesg from boot through the network coming up.  Also please 
attach the ethtool -e ethX eeprom dump from any ports that are having tx 
hangs.  Also, please post the BIOS and BMC firmware versions.

If you have modified the RxAbsIntDelay or RxIntDelay parameters at load, 
then you've likely ran into a hardware errata that can be avoided by not 
modifying those parameters.

 Apr 19 16:47:37 R2PX3 [272540.768103] :06:00.1: eth2: Detected Tx
 Unit Hang:
 Apr 19 16:49:31 R2PX3 [272654.768142] :06:00.1: eth2: Detected Tx
 Unit Hang:

ugh, seems like your data pattern makes the hang repeat every two minutes.  
Well thats good in that it is at least reproducable.

Can you try going back to the default driver settings and see if that 
makes any difference?


--
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] Detected Tx Unit Hang

2009-03-12 Thread Brandeburg, Jesse
On Wed, 11 Mar 2009, Gary W. Smith wrote:
 I asked this last week but didn't get a response.  I have a supermicro

apologies for the slow response.

 server with a dual intel nic that uses the e0100 driver.  I'm using
 CentOS 5.2 and when I do anything network intensive I lose connectivity
 for a few seconds.  Then we get this in the log.  I downloaded, compiled
 and installed the latest e1000 driver.  I see that the driver is in the
 proper location (based on timestamp).

thank you for downloading the latest driver.  It is probably 8.0.9?

please load the driver with the module parameters TxDescriptorStep=4,4

you can modify /etc/modprobe.conf and add
options e1000 TxDescriptorStep=4,4
(if you only have two ports)

or just load the driver with
modprobe e1000 TxDescriptorStep=4,4
and then use ethtool to increase the number of tx descriptors.
ethtool -G eth0 tx 1024
this workaround only uses one in every four descriptors.

 How can I fix this problem on this server.   I have tried to manually
 disable the tso and other entries but this doesn't seem to help.  I've
 also tried setting it down to 100/full to no avail.  It appears to be a
 TX, not RX issue.  I say this because I run dstat in the background and
 when it hangs and then comes back it will quickly dump a full screen of
 dstat entries, which should be one per second, which I'm assuming that
 TCP is buffering the packets.

please attach the full lspci -vvv for your system, make sure that you have 
the latest bios update, and that the system's bios settings are set to the 
defaults, and particularly any settings having to do with write 
combining or PCI transaction combining are disabled.


 Things I've tried.
 
 /sbin/ethtool -K eth0 tso off
 /sbin/ethtool -K eth0 rx off
 /sbin/ethtool -K eth0 tx off
 /sbin/ethtool -K eth0 sg off
 
 
 Mar 11 18:50:01 vcsoaknas01 kernel: e1000: eth0: e1000_clean_tx_irq:
 Detected Tx Unit Hang
 Mar 11 18:50:01 vcsoaknas01 kernel:   Tx Queue 0
 Mar 11 18:50:01 vcsoaknas01 kernel:   TDH  f7
 Mar 11 18:50:01 vcsoaknas01 kernel:   TDT  f7
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_use  f7
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_clean24
 Mar 11 18:50:01 vcsoaknas01 kernel: buffer_info[next_to_clean]
 Mar 11 18:50:01 vcsoaknas01 kernel:   time_stamp   1004de0b1
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_watch24
 Mar 11 18:50:01 vcsoaknas01 kernel:   jiffies  1004dec18
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_watch.status 0

this really indicates that the adapter is finishing all the work but that 
the descriptor is not making it back to main memory indicating the work 
was completed.  We have seen this a lot with AMD systems, in particular 
ones with VIA chipsets.  There is a bad bug in those machines when an IO 
device and the processor both write to the same cache line.

also, if the above workaround doesn't help we'll want you to install the 
dump patch from the patches section of e1000.sourceforge.net and send us 
the output when you get a tx hang.

hope this helps, 
 Jesse

--
Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are
powering Web 2.0 with engaging, cross-platform capabilities. Quickly and
easily build your RIAs with Flex Builder, the Eclipse(TM)based development
software that enables intelligent coding and step-through debugging.
Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] Detected Tx Unit Hang

2009-03-12 Thread Gary W. Smith
Excuse my ignorance, but which patches? ;).  There's a lot of stuff on the 
download page.  I assume you are talking about the I/OAT driver  kernel patch 
but I want to make sure before doing it.

 

 Mar 11 18:50:01 vcsoaknas01 kernel: e1000: eth0: e1000_clean_tx_irq:
 Detected Tx Unit Hang
 Mar 11 18:50:01 vcsoaknas01 kernel:   Tx Queue 0
 Mar 11 18:50:01 vcsoaknas01 kernel:   TDH  f7
 Mar 11 18:50:01 vcsoaknas01 kernel:   TDT  f7
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_use  f7
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_clean24
 Mar 11 18:50:01 vcsoaknas01 kernel: buffer_info[next_to_clean]
 Mar 11 18:50:01 vcsoaknas01 kernel:   time_stamp   1004de0b1
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_watch24
 Mar 11 18:50:01 vcsoaknas01 kernel:   jiffies  1004dec18
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_watch.status 0

this really indicates that the adapter is finishing all the work but that
the descriptor is not making it back to main memory indicating the work
was completed.  We have seen this a lot with AMD systems, in particular
ones with VIA chipsets.  There is a bad bug in those machines when an IO
device and the processor both write to the same cache line.

also, if the above workaround doesn't help we'll want you to install the
dump patch from the patches section of e1000.sourceforge.net and send us
the output when you get a tx hang.

hope this helps,
 Jesse

--
Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are
powering Web 2.0 with engaging, cross-platform capabilities. Quickly and
easily build your RIAs with Flex Builder, the Eclipse(TM)based development
software that enables intelligent coding and step-through debugging.
Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] Detected Tx Unit Hang

2009-03-12 Thread Gary Smith
Thanks. I'll get this in sometime this afternoon.

Hopefully we can have some information from the server tonight. 

Gary
Sent via BlackBerry by ATT

-Original Message-
From: Brandeburg, Jesse jesse.brandeb...@intel.com

Date: Thu, 12 Mar 2009 09:33:21 
To: Gary W. Smithg...@primeexalia.com
Cc: e1000-devel@lists.sourceforge.nete1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang


sorry, go to the home page http://sourceforge.net/projects/e1000
click Tracker
click patches
click tx hang debug code (all releases) - 1460945
download the e1000_806_dump.patch, it should apply with fuzz to your e1000 
driver directory with the command

download file.patch...
patch -d e1000-8.0.* -p1  file.patch

here is the download link
https://sourceforge.net/tracker2/download.php?group_id=42302atid=447451file_id=298629aid=1460945



From: Gary W. Smith [mailto:g...@primeexalia.com]
Sent: Thursday, March 12, 2009 9:16 AM
To: Brandeburg, Jesse
Cc: e1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang

Excuse my ignorance, but which patches? ;).  There's a lot of stuff on the 
download page.  I assume you are talking about the I/OAT driver  kernel patch 
but I want to make sure before doing it.



 Mar 11 18:50:01 vcsoaknas01 kernel: e1000: eth0: e1000_clean_tx_irq:
 Detected Tx Unit Hang
 Mar 11 18:50:01 vcsoaknas01 kernel:   Tx Queue 0
 Mar 11 18:50:01 vcsoaknas01 kernel:   TDH  f7
 Mar 11 18:50:01 vcsoaknas01 kernel:   TDT  f7
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_use  f7
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_clean24
 Mar 11 18:50:01 vcsoaknas01 kernel: buffer_info[next_to_clean]
 Mar 11 18:50:01 vcsoaknas01 kernel:   time_stamp   1004de0b1
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_watch24
 Mar 11 18:50:01 vcsoaknas01 kernel:   jiffies  1004dec18
 Mar 11 18:50:01 vcsoaknas01 kernel:   next_to_watch.status 0

this really indicates that the adapter is finishing all the work but that
the descriptor is not making it back to main memory indicating the work
was completed.  We have seen this a lot with AMD systems, in particular
ones with VIA chipsets.  There is a bad bug in those machines when an IO
device and the processor both write to the same cache line.

also, if the above workaround doesn't help we'll want you to install the
dump patch from the patches section of e1000.sourceforge.net and send us
the output when you get a tx hang.

hope this helps,
 Jesse

--
Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are
powering Web 2.0 with engaging, cross-platform capabilities. Quickly and
easily build your RIAs with Flex Builder, the Eclipse(TM)based development
software that enables intelligent coding and step-through debugging.
Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] Detected Tx Unit Hang

2009-03-12 Thread Brandeburg, Jesse
so, the 4GB patch will only cause a slight increase in cpu utilization.  There 
are no other side effects, and you *DO NOT* have to run the TxDescriptorStep 
workaround.

I think I might just push the change to not allow 64 bit addressing to these 32 
bit adapters, into e1000.

Glad to hear things are working better,
  Jesse


From: Gary W. Smith [mailto:g...@primeexalia.com]
Sent: Thursday, March 12, 2009 2:45 PM
To: Gary W. Smith; Brandeburg, Jesse
Cc: e1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang

That was a bad example...

I was copying to/from the same instance form a machine running under vmware.

I now have a physical machine copying the 50gb of files from the bad machine to 
another machine and everything is still going smooth, but much faster this time.

This is the dstat from the bad machine.  The limiter is the disk (which is 
about 40mb/sec).  We were hitting the error before with only 10mb.  This this 
is definitely a positive thing.

total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   5  88   0   1   5|  25M0 | 443k   26M|   0 0 |5119  1560
  0   8  85   0   2   6|  32M0 | 570k   33M|   0 0 |6311  1724
  0   2  97   0   0   2|6662k0 | 135k 7626k|   0 0 |1992   355
  0   0 100   0   0   0|   0 0 | 194B  240B|   0 0 |100716
  0   0 100   0   0   0|   0 0 | 134B  240B|   0 0 |102440
  0   0 100   0   0   0|   0 0 | 134B  240B|   0 0 |100616
  0   0 100   0   0   0|   0 0 | 226B  240B|   0 0 |102440
  0   0 100   0   0   0|   0 0 | 318B  240B|   0 0 |100816
  0   0 100   0   0   0|   0 0 | 134B  240B|   0 0 |102338
  0   0 100   0   0   0|   0 0 | 134B  240B|   0 0 |100618
  0   2  97   0   1   2|8960k0 | 140k 8492k|   0 0 |2357   639
  0   8  85   0   2   7|  33M0 | 599k   35M|   0 0 |5949  2036
  0  10  78   0   2  10|  43M0 | 793k   46M|   0 0 |6993  2176
  0   9  79   0   2  10|  42M0 | 751k   44M|   0 0 |6810  2213
  0   9  82   0   2   8|  37M0 | 661k   39M|   0 0 |6998  1863
  0   7  86   0   1   5|  28M0 | 521k   30M|   0 0 |5874  1933


From: Gary W. Smith [mailto:g...@primeexalia.com]
Sent: Thu 3/12/2009 2:32 PM
To: Brandeburg, Jesse
Cc: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] Detected Tx Unit Hang


Jesse,

Looks better.  transfering 50GB to/from the server and I'm not getting the 
errors in the log now.   Very large pings (ping vcsoaknas01 -t -l 3 -w 
7000) are occasionally timing out BUT I haven't lost connectivity to the SSH 
session as of yet and the file transfer is still going.  dstat is also running 
consistantly (no random TX hangs like before).

dstat:

  0   3  92   0   1   4|4224k   13M|7338k 4642k|   0 0 |  10k   12k
  0   1  98   0   0   1| 936k 3872k|1951k 1040k|   0 0 |3649  3403
  0   1  96   0   0   2|1496k 8879k|4638k 1700k|   0 0 |7378  8853
  0   4  91   0   1   4|  13M 3678k|2382k   14M|   0 0 |9188  7267
  0   3  93   0   1   4|4352k   15M|7864k 4877k|   0 0 |  11k   13k
  0   2  95   0   1   3| 384k   14M|7389k  516k|   0 0 |999012k
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   2  95   0   0   2|2816k 8104k|4327k 3098k|   0 0 |7075  7510
  0   2  93   0   0   4|5696k 9120k|4918k 6176k|   0 0 |8478  8300
  0   2  95   0   0   3|3968k 6720k|3610k 4306k|   0 0 |6425  6107
  0   2  95   0   0   3|4736k 7616k|4081k 5135k|   0 0 |7242  6974
  0   2  95   0   1   3|4224k 6816k|3687k 4582k|   0 0 |6589  6344
  0   2  95   0   0   3|4096k 7016k|3748k 4445k|   0 0 |6546  6311
  0   1  96   0   0   2|3136k 5288k|2852k 3402k|   0 0 |5251  4936


We have 50GB on an iscsi share (or 500GB) that we are copying to/from over the 
wire for this test.  During the writing of this email we have already copied 
about 1.3gb without any problem as of yet.

So my next question is regarding the 4GB patch.  Does this have any negative 
impact that I need to be aware of?

Gary



From: Brandeburg, Jesse [mailto:jesse.brandeb...@intel.com]
Sent: Thu 3/12/2009 1:59 PM
To: Gary W. Smith
Cc: e1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang


re-added the list for tracking...

I think I see the issue, you have more than 4GB ram, and it appears that your 
system doesn't handle dual address cycles correctly, or our adapter doesn't 
work quite right for some reason.

Force the OS to never allow addresses  4GB to our hardware using this patch:
https://sourceforge.net/tracker2/download.php?group_id=42302atid=447449file_id=283326aid=2007017

its the e1000_disable_dac.patch file

Re: [E1000-devel] Detected Tx Unit Hang

2009-03-12 Thread Gary Smith
Thid probably means the same b ug exists in windows as well. This is where we 
hit the problem first and converted it to a nas server. 

Gary
Sent via BlackBerry by ATT

-Original Message-
From: Brandeburg, Jesse jesse.brandeb...@intel.com

Date: Thu, 12 Mar 2009 15:01:27 
To: Gary W. Smithg...@primeexalia.com
Cc: e1000-devel@lists.sourceforge.nete1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang


so, the 4GB patch will only cause a slight increase in cpu utilization.  There 
are no other side effects, and you *DO NOT* have to run the TxDescriptorStep 
workaround.

I think I might just push the change to not allow 64 bit addressing to these 32 
bit adapters, into e1000.

Glad to hear things are working better,
  Jesse


From: Gary W. Smith [mailto:g...@primeexalia.com]
Sent: Thursday, March 12, 2009 2:45 PM
To: Gary W. Smith; Brandeburg, Jesse
Cc: e1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang

That was a bad example...

I was copying to/from the same instance form a machine running under vmware.

I now have a physical machine copying the 50gb of files from the bad machine to 
another machine and everything is still going smooth, but much faster this time.

This is the dstat from the bad machine.  The limiter is the disk (which is 
about 40mb/sec).  We were hitting the error before with only 10mb.  This this 
is definitely a positive thing.

total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   5  88   0   1   5|  25M0 | 443k   26M|   0 0 |5119  1560
  0   8  85   0   2   6|  32M0 | 570k   33M|   0 0 |6311  1724
  0   2  97   0   0   2|6662k0 | 135k 7626k|   0 0 |1992   355
  0   0 100   0   0   0|   0 0 | 194B  240B|   0 0 |100716
  0   0 100   0   0   0|   0 0 | 134B  240B|   0 0 |102440
  0   0 100   0   0   0|   0 0 | 134B  240B|   0 0 |100616
  0   0 100   0   0   0|   0 0 | 226B  240B|   0 0 |102440
  0   0 100   0   0   0|   0 0 | 318B  240B|   0 0 |100816
  0   0 100   0   0   0|   0 0 | 134B  240B|   0 0 |102338
  0   0 100   0   0   0|   0 0 | 134B  240B|   0 0 |100618
  0   2  97   0   1   2|8960k0 | 140k 8492k|   0 0 |2357   639
  0   8  85   0   2   7|  33M0 | 599k   35M|   0 0 |5949  2036
  0  10  78   0   2  10|  43M0 | 793k   46M|   0 0 |6993  2176
  0   9  79   0   2  10|  42M0 | 751k   44M|   0 0 |6810  2213
  0   9  82   0   2   8|  37M0 | 661k   39M|   0 0 |6998  1863
  0   7  86   0   1   5|  28M0 | 521k   30M|   0 0 |5874  1933


From: Gary W. Smith [mailto:g...@primeexalia.com]
Sent: Thu 3/12/2009 2:32 PM
To: Brandeburg, Jesse
Cc: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] Detected Tx Unit Hang


Jesse,

Looks better.  transfering 50GB to/from the server and I'm not getting the 
errors in the log now.   Very large pings (ping vcsoaknas01 -t -l 3 -w 
7000) are occasionally timing out BUT I haven't lost connectivity to the SSH 
session as of yet and the file transfer is still going.  dstat is also running 
consistantly (no random TX hangs like before).

dstat:

  0   3  92   0   1   4|4224k   13M|7338k 4642k|   0 0 |  10k   12k
  0   1  98   0   0   1| 936k 3872k|1951k 1040k|   0 0 |3649  3403
  0   1  96   0   0   2|1496k 8879k|4638k 1700k|   0 0 |7378  8853
  0   4  91   0   1   4|  13M 3678k|2382k   14M|   0 0 |9188  7267
  0   3  93   0   1   4|4352k   15M|7864k 4877k|   0 0 |  11k   13k
  0   2  95   0   1   3| 384k   14M|7389k  516k|   0 0 |999012k
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   2  95   0   0   2|2816k 8104k|4327k 3098k|   0 0 |7075  7510
  0   2  93   0   0   4|5696k 9120k|4918k 6176k|   0 0 |8478  8300
  0   2  95   0   0   3|3968k 6720k|3610k 4306k|   0 0 |6425  6107
  0   2  95   0   0   3|4736k 7616k|4081k 5135k|   0 0 |7242  6974
  0   2  95   0   1   3|4224k 6816k|3687k 4582k|   0 0 |6589  6344
  0   2  95   0   0   3|4096k 7016k|3748k 4445k|   0 0 |6546  6311
  0   1  96   0   0   2|3136k 5288k|2852k 3402k|   0 0 |5251  4936


We have 50GB on an iscsi share (or 500GB) that we are copying to/from over the 
wire for this test.  During the writing of this email we have already copied 
about 1.3gb without any problem as of yet.

So my next question is regarding the 4GB patch.  Does this have any negative 
impact that I need to be aware of?

Gary



From: Brandeburg, Jesse [mailto:jesse.brandeb...@intel.com]
Sent: Thu 3/12/2009 1:59 PM
To: Gary W. Smith
Cc: e1000-devel@lists.sourceforge.net
Subject: RE: [E1000-devel] Detected Tx Unit Hang


re-added the list