Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-25 Thread Michael Wang
On 10/25/2011 12:26 AM, Flavio Leitner wrote:
 On Mon, 24 Oct 2011 16:26:28 +0800
 Michael Wangwang...@linux.vnet.ibm.com  wrote:

 On 10/21/2011 10:03 PM, Flavio Leitner wrote:
 On Fri, 21 Oct 2011 14:15:12 +0800
 Michael Wangwang...@linux.vnet.ibm.com   wrote:

 On 10/19/2011 08:16 PM, Flavio Leitner wrote:
 On Wed, 19 Oct 2011 12:49:48 +0800
 wangyunwang...@linux.vnet.ibm.comwrote:

 Hi, Flavio

 I am new to join the community, work on e1000e driver currently,
 And I found a thing strange in this issue, please check below.

 Thanks,
 Michael Wang

 On 10/18/2011 10:42 PM, Flavio Leitner wrote:
 On Mon, 17 Oct 2011 11:48:22 -0700
 Jesse Brandeburgjesse.brandeb...@intel.com wrote:

 On Fri, 14 Oct 2011 10:04:26 -0700
 Flavio Leitnerf...@redhat.com wrote:

 TDH is probably not moving due to the writeback threshold settings in
 TXDCTL.  netperf UDP_RR test is likely a good way to test this.

 Yeah, makes sense. I haven't heard about new events after had removed
 the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the 
 exact
 same hardware and I haven't reproduced the issue in-house yet with 
 another
 82571EB. See below about interface statistics from sar.
 Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the
 tx descriptor only when:

 1. the descriptor device cached is lower then 32.
 2. The descriptor host prepared is at least one.

 I don't think this will cause that issue, but another thing it done is to
 set the device to write-back the processed descriptor only when the
 amount reach 5(or 4).

 So may be when the device get a descriptor and processed, but the
 amount not reached 5, so it don't write-back it, but actually already
 transmitted.

 That could explain the issue and the fact that sometimes the hang
 info printed shows empty ring (write-back happened in the middle).

 But this will happen only when the transmit suddenly stopped for one
 second or more, I don't know whether this is the real traffic situation
 or not.

 At least for one customer the interface had almost no traffic.
 I will go over all the data again checking if this happens every time.


 And may be I am wrong about this, but also I think this may be the only
 reason cause this issue.

 I am seeing this based on the debugging output:

 This is the full output with debugging patch applied:
 Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit 
 Hang:
 Oct 11 02:03:52 kernel:   TDH25
 Oct 11 02:03:52 kernel:   TDT26
 Oct 11 02:03:52 kernel:   next_to_use26
 Oct 11 02:03:52 kernel:   next_to_clean25
 Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
 Oct 11 02:03:52 kernel:   time_stamp100b2aa22
 Oct 11 02:03:52 kernel:   next_to_watch25
 Oct 11 02:03:52 kernel:   jiffies100b2ab25
 Oct 11 02:03:52 kernel:   next_to_watch.status0
 Oct 11 02:03:52 kernel:   stored_i =25
 Oct 11 02:03:52 kernel:   stored_first =25
 Oct 11 02:03:52 kernel:   stamp =100b2aa22
 Oct 11 02:03:52 kernel:   factor =fa
 Oct 11 02:03:52 kernel:   last_clean =100b2aa1a
 Oct 11 02:03:52 kernel:   last_tx =100b2aa22
 Oct 11 02:03:52 kernel:   count =0/100
 Notice above that buffer_info time_stamp is the same as in
 last_tx (last time the xmit function was called), also that
 last_clean (last time the clean function was called) is before
 that.  Therefore, the system sent just one descriptor in about
 1 second confirming your idea.


 So have you try to use the Red Hat 6, is this problem still
 exist?

 Actually, I received few other reports that looks like to be same
 issue but with 6.2.  As far as I can tell, hardware that was working
 just fine started to show it after the kernel upgrade (coincidentally
 5.7 and 6.2 introduces FLAG2_DMA_BURST).  However, I haven't heard
 anything back since I had provided the instrumented kernel to confirm
 to you.  I will follow up as soon as I hear something.

 Assuming that your idea is true, the hang detection is broken because
 it's possible to have a descriptor apparently stuck that is just missing
 the write-back. So, is it possible to set a timer to write-back? If yes,
 it could expire and run before the hang detection period expires. Or
 perhaps force the write-back to happen before hang detection execution.


According to code ew32(TIDV, adapter-tx_int_delay);, I think
such timer has been already set, but I don't know if the
tx_int_delay is the default value which is 8(units of 1.024 μs).

TIDV means if the time expire, it will flush the write-back,
enforced.

The default value is very less than 1sec, it can not caused this
issue.

 Customer has a test system reproducing this with 5.7, we can test
 patches there if you like. Just let me know.

 thank you!
 fbl

May be you can just search macro
E1000_TXDCTL_DMA_BURST_ENABLE
in drivers/net/e1000e/e1000.h, change it to:

#define E1000_TXDCTL_DMA_BURST_ENABLE \
(E1000_TXDCTL_GRAN | /* set descriptor granularity */ \
E1000_TXDCTL_COUNT_DESC | \
(0  16) | /* wthresh must be +1 more than desired */\
(1  8) | 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-25 Thread Jesse Brandeburg
On Mon, 24 Oct 2011 23:29:34 -0700
Michael Wang wang...@linux.vnet.ibm.com wrote:
 May be you can just search macro
 E1000_TXDCTL_DMA_BURST_ENABLE
 in drivers/net/e1000e/e1000.h, change it to:
 
 #define E1000_TXDCTL_DMA_BURST_ENABLE \
 (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \
 E1000_TXDCTL_COUNT_DESC | \
 (0  16) | /* wthresh must be +1 more than desired */\
 (1  8) | /* hthresh */ \
 0x1f) /* pthresh */
 
 this will do the write-back even only one has been done, if the
 problem solved, we can think about a good solution.

I can already tell you that this will fix the problem, but wthresh=1 is
more like the hardware default after reset I think.  Doing this will
prevent the bursting behavior that got us the performance improvement
this patch was made for, which is bad.

That is why we are looking at a solution that likely involves two
flush writes via the flush partial descriptors bits.  Just do the bit
31 set in TIDV and RDTR twice in a row and then make sure it is write
flushed.

If you wish to implement that and give it a try that would be useful
information.  We haven't had time yet to get a full repro going.


--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-25 Thread Michael Wang
On 10/25/2011 11:57 PM, Jesse Brandeburg wrote:
 On Mon, 24 Oct 2011 23:29:34 -0700
 Michael Wangwang...@linux.vnet.ibm.com  wrote:
 May be you can just search macro
 E1000_TXDCTL_DMA_BURST_ENABLE
 in drivers/net/e1000e/e1000.h, change it to:

 #define E1000_TXDCTL_DMA_BURST_ENABLE \
 (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \
 E1000_TXDCTL_COUNT_DESC | \
 (0  16) | /* wthresh must be +1 more than desired */\
 (1  8) | /* hthresh */ \
 0x1f) /* pthresh */

 this will do the write-back even only one has been done, if the
 problem solved, we can think about a good solution.
 I can already tell you that this will fix the problem, but wthresh=1 is
 more like the hardware default after reset I think.  Doing this will
 prevent the bursting behavior that got us the performance improvement
 this patch was made for, which is bad.

Hi, Jesse

I was confused about the code ew32(TIDV, adapter-tx_int_delay);
I think this will cause a enforced write-back flush every 8*1.024 μs for
default.

If it works, I don't know why wthresh = 5 will cause this issue, because
even there are not enough descriptor(over 4), the write-back will still 
be done
every 8*1.024 μs.

 That is why we are looking at a solution that likely involves two
 flush writes via the flush partial descriptors bits.  Just do the bit
 31 set in TIDV and RDTR twice in a row and then make sure it is write
 flushed.

 If you wish to implement that and give it a try that would be useful
 information.  We haven't had time yet to get a full repro going.

I think besides my confusion, I will still try to do such work, but I 
really
don't know whether this issue is caused by wthresh or not.

Thanks  Best regards
Michael Wang


--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired