Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Michael Wang Mon, 24 Oct 2011 23:55:18 -0700

On 10/25/2011 12:26 AM, Flavio Leitner wrote:
> On Mon, 24 Oct 2011 16:26:28 +0800
> Michael Wang<wang...@linux.vnet.ibm.com>  wrote:
>
>> On 10/21/2011 10:03 PM, Flavio Leitner wrote:
>>> On Fri, 21 Oct 2011 14:15:12 +0800
>>> Michael Wang<wang...@linux.vnet.ibm.com>   wrote:
>>>
>>>> On 10/19/2011 08:16 PM, Flavio Leitner wrote:
>>>>> On Wed, 19 Oct 2011 12:49:48 +0800
>>>>> wangyun<wang...@linux.vnet.ibm.com>    wrote:
>>>>>
>>>>>> Hi, Flavio
>>>>>>
>>>>>> I am new to join the community, work on e1000e driver currently,
>>>>>> And I found a thing strange in this issue, please check below.
>>>>>>
>>>>>> Thanks,
>>>>>> Michael Wang
>>>>>>
>>>>>> On 10/18/2011 10:42 PM, Flavio Leitner wrote:
>>>>>>> On Mon, 17 Oct 2011 11:48:22 -0700
>>>>>>> Jesse Brandeburg<jesse.brandeb...@intel.com>     wrote:
>>>>>>>
>>>>>>>> On Fri, 14 Oct 2011 10:04:26 -0700
>>>>>>>> Flavio Leitner<f...@redhat.com>     wrote:
>>>>>>>>
>>>>>>>> TDH is probably not moving due to the writeback threshold settings in
>>>>>>>> TXDCTL.  netperf UDP_RR test is likely a good way to test this.
>>>>>>>>
>>>>>>> Yeah, makes sense. I haven't heard about new events after had removed
>>>>>>> the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the 
>>>>>>> exact
>>>>>>> same hardware and I haven't reproduced the issue in-house yet with 
>>>>>>> another
>>>>>>> 82571EB. See below about interface statistics from sar.
>> Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the
>> tx descriptor only when:
>>
>> 1. the descriptor device cached is lower then 32.
>> 2. The descriptor host prepared is at least one.
>>
>> I don't think this will cause that issue, but another thing it done is to
>> set the device to write-back the processed descriptor only when the
>> amount reach 5(or 4).
>>
>> So may be when the device get a descriptor and processed, but the
>> amount not reached 5, so it don't write-back it, but actually already
>> transmitted.
>>
> That could explain the issue and the fact that sometimes the hang
> info printed shows empty ring (write-back happened in the middle).
>
>> But this will happen only when the transmit suddenly stopped for one
>> second or more, I don't know whether this is the real traffic situation
>> or not.
>>
> At least for one customer the interface had almost no traffic.
> I will go over all the data again checking if this happens every time.
>
>
>> And may be I am wrong about this, but also I think this may be the only
>> reason cause this issue.
>>
> I am seeing this based on the debugging output:
>
>>>>> This is the full output with debugging patch applied:
>>>>> Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit 
>>>>> Hang:
>>>>> Oct 11 02:03:52 kernel:   TDH<25>
>>>>> Oct 11 02:03:52 kernel:   TDT<26>
>>>>> Oct 11 02:03:52 kernel:   next_to_use<26>
>>>>> Oct 11 02:03:52 kernel:   next_to_clean<25>
>>>>> Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
>>>>> Oct 11 02:03:52 kernel:   time_stamp<100b2aa22>
>>>>> Oct 11 02:03:52 kernel:   next_to_watch<25>
>>>>> Oct 11 02:03:52 kernel:   jiffies<100b2ab25>
>>>>> Oct 11 02:03:52 kernel:   next_to_watch.status<0>
>>>>> Oct 11 02:03:52 kernel:   stored_i =<25>
>>>>> Oct 11 02:03:52 kernel:   stored_first =<25>
>>>>> Oct 11 02:03:52 kernel:   stamp =<100b2aa22>
>>>>> Oct 11 02:03:52 kernel:   factor =<fa>
>>>>> Oct 11 02:03:52 kernel:   last_clean =<100b2aa1a>
>>>>> Oct 11 02:03:52 kernel:   last_tx =<100b2aa22>
>>>>> Oct 11 02:03:52 kernel:   count =<0>/<100>
> Notice above that buffer_info time_stamp is the same as in
> last_tx (last time the xmit function was called), also that
> last_clean (last time the clean function was called) is before
> that.  Therefore, the system sent just one descriptor in about
> 1 second confirming your idea.
>
>
>> So have you try to use the Red Hat 6, is this problem still
>> exist?
>>
> Actually, I received few other reports that looks like to be same
> issue but with 6.2.  As far as I can tell, hardware that was working
> just fine started to show it after the kernel upgrade (coincidentally
> 5.7 and 6.2 introduces FLAG2_DMA_BURST).  However, I haven't heard
> anything back since I had provided the instrumented kernel to confirm
> to you.  I will follow up as soon as I hear something.
>
> Assuming that your idea is true, the hang detection is broken because
> it's possible to have a descriptor apparently stuck that is just missing
> the write-back. So, is it possible to set a timer to write-back? If yes,
> it could expire and run before the hang detection period expires. Or
> perhaps force the write-back to happen before hang detection execution.
>


According to code "ew32(TIDV, adapter->tx_int_delay);", I think
such timer has been already set, but I don't know if the
tx_int_delay is the default value which is 8(units of 1.024 μs).

TIDV means if the time expire, it will flush the write-back,
enforced.

The default value is very less than 1sec, it can not caused this
issue.

> Customer has a test system reproducing this with 5.7, we can test
> patches there if you like. Just let me know.
>
> thank you!
> fbl
>
May be you can just search macro
"E1000_TXDCTL_DMA_BURST_ENABLE"
in "drivers/net/e1000e/e1000.h", change it to:

#define E1000_TXDCTL_DMA_BURST_ENABLE \
(E1000_TXDCTL_GRAN | /* set descriptor granularity */ \
E1000_TXDCTL_COUNT_DESC | \
(0 << 16) | /* wthresh must be +1 more than desired */\
(1 << 8) | /* hthresh */ \
0x1f) /* pthresh */

this will do the write-back even only one has been done, if the
problem solved, we can think about a good solution.

Thanks,
Michael Wang


------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Reply via email to