Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Flavio Leitner Mon, 24 Oct 2011 09:28:40 -0700

On Mon, 24 Oct 2011 16:26:28 +0800
Michael Wang <[email protected]> wrote:


> On 10/21/2011 10:03 PM, Flavio Leitner wrote:
> > On Fri, 21 Oct 2011 14:15:12 +0800
> > Michael Wang<[email protected]>  wrote:
> >
> >> On 10/19/2011 08:16 PM, Flavio Leitner wrote:
> >>> On Wed, 19 Oct 2011 12:49:48 +0800
> >>> wangyun<[email protected]>   wrote:
> >>>
> >>>> Hi, Flavio
> >>>>
> >>>> I am new to join the community, work on e1000e driver currently,
> >>>> And I found a thing strange in this issue, please check below.
> >>>>
> >>>> Thanks,
> >>>> Michael Wang
> >>>>
> >>>> On 10/18/2011 10:42 PM, Flavio Leitner wrote:
> >>>>> On Mon, 17 Oct 2011 11:48:22 -0700
> >>>>> Jesse Brandeburg<[email protected]>    wrote:
> >>>>>
> >>>>>> On Fri, 14 Oct 2011 10:04:26 -0700
> >>>>>> Flavio Leitner<[email protected]>    wrote:
> >>>>>>
> >>>>>> TDH is probably not moving due to the writeback threshold settings in
> >>>>>> TXDCTL.  netperf UDP_RR test is likely a good way to test this.
> >>>>>>
> >>>>> Yeah, makes sense. I haven't heard about new events after had removed
> >>>>> the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the 
> >>>>> exact
> >>>>> same hardware and I haven't reproduced the issue in-house yet with 
> >>>>> another
> >>>>> 82571EB. See below about interface statistics from sar.
> 
> Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the
> tx descriptor only when:
> 
> 1. the descriptor device cached is lower then 32.
> 2. The descriptor host prepared is at least one.
> 
> I don't think this will cause that issue, but another thing it done is to
> set the device to write-back the processed descriptor only when the
> amount reach 5(or 4).
> 
> So may be when the device get a descriptor and processed, but the
> amount not reached 5, so it don't write-back it, but actually already
> transmitted.
>

That could explain the issue and the fact that sometimes the hang
info printed shows empty ring (write-back happened in the middle).

> 
> But this will happen only when the transmit suddenly stopped for one
> second or more, I don't know whether this is the real traffic situation
> or not.
> 

At least for one customer the interface had almost no traffic.
I will go over all the data again checking if this happens every time.


> And may be I am wrong about this, but also I think this may be the only
> reason cause this issue.
> 

I am seeing this based on the debugging output:

> >>> This is the full output with debugging patch applied:
> >>> Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit 
> >>> Hang:
> >>> Oct 11 02:03:52 kernel:   TDH<25>
> >>> Oct 11 02:03:52 kernel:   TDT<26>
> >>> Oct 11 02:03:52 kernel:   next_to_use<26>
> >>> Oct 11 02:03:52 kernel:   next_to_clean<25>
> >>> Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
> >>> Oct 11 02:03:52 kernel:   time_stamp<100b2aa22>
> >>> Oct 11 02:03:52 kernel:   next_to_watch<25>
> >>> Oct 11 02:03:52 kernel:   jiffies<100b2ab25>
> >>> Oct 11 02:03:52 kernel:   next_to_watch.status<0>
> >>> Oct 11 02:03:52 kernel:   stored_i =<25>
> >>> Oct 11 02:03:52 kernel:   stored_first =<25>
> >>> Oct 11 02:03:52 kernel:   stamp =<100b2aa22>
> >>> Oct 11 02:03:52 kernel:   factor =<fa>
> >>> Oct 11 02:03:52 kernel:   last_clean =<100b2aa1a>
> >>> Oct 11 02:03:52 kernel:   last_tx =<100b2aa22>
> >>> Oct 11 02:03:52 kernel:   count =<0>/<100>

Notice above that buffer_info time_stamp is the same as in
last_tx (last time the xmit function was called), also that
last_clean (last time the clean function was called) is before
that.  Therefore, the system sent just one descriptor in about
1 second confirming your idea.


> So have you try to use the Red Hat 6, is this problem still
> exist?
> 

Actually, I received few other reports that looks like to be same
issue but with 6.2.  As far as I can tell, hardware that was working
just fine started to show it after the kernel upgrade (coincidentally
5.7 and 6.2 introduces FLAG2_DMA_BURST).  However, I haven't heard
anything back since I had provided the instrumented kernel to confirm
to you.  I will follow up as soon as I hear something.

Assuming that your idea is true, the hang detection is broken because
it's possible to have a descriptor apparently stuck that is just missing
the write-back. So, is it possible to set a timer to write-back? If yes,
it could expire and run before the hang detection period expires. Or
perhaps force the write-back to happen before hang detection execution.

Customer has a test system reproducing this with 5.7, we can test
patches there if you like. Just let me know.

thank you!
fbl

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Reply via email to