Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Flavio Leitner Wed, 19 Oct 2011 05:18:45 -0700

On Wed, 19 Oct 2011 12:49:48 +0800
wangyun <wang...@linux.vnet.ibm.com> wrote:


> Hi, Flavio
> 
> I am new to join the community, work on e1000e driver currently,
> And I found a thing strange in this issue, please check below.
> 
> Thanks,
> Michael Wang
> 
> On 10/18/2011 10:42 PM, Flavio Leitner wrote:
> > On Mon, 17 Oct 2011 11:48:22 -0700
> > Jesse Brandeburg<jesse.brandeb...@intel.com>  wrote:
> >
> >> On Fri, 14 Oct 2011 10:04:26 -0700
> >> Flavio Leitner<f...@redhat.com>  wrote:
> >>
> >>> Hi,
> >>>
> >>> I got few reports so far that 82571EB models are having the
> >>> "Detected Hardware Unit Hang" issue after upgrading the kernel.
> >>>
> >>> Further debugging with an instrumented kernel revealed that the
> >>> socket buffer time stamp matches with the last time e1000_xmit_frame()
> >>> was called. Also that the time stamp of e1000_clean_tx_irq() last run
> >>> is prior to the one in socket buffer.
> >>>
> >>> However, ~1 second later, an interrupt is fired and the old entry
> >>> is found. Sometimes, the scheduled print_hang_task dumps the
> >>> information _after_ the old entry is sent (shows empty ring),
> >>> indicating that the HW TX unit isn't really stuck and apparently
> >>> just missed the signal to initiate the transmission.
> >>>
> >>> Order of events:
> >>>   (1) skb is pushed down
> >>>   (2) e1000_xmit_frame() is called
> >>>   (3) ring is filled with one entry
> >>>   (4) TDT is updated
> >>>> (5) nothing happens for little more than 1 second
> >>>   (6) interrupt is fired
> >>>   (7) e1000_clean_tx_irq() is called
> >>>   (8) finds the entry not ready with an old time stamp,
> >>>       schedules print_hang_task and stops the TX queue.
> >>>   (9) print_hang_task runs, dump the info but the old entry is now sent
> >>> (10) apparently the TX queue is back.
> >> Flavio, thanks for the detailed info, please be sure to supply us the
> >> bugzilla number.
> >>
> > It was buried in the end of the first email:
> > https://bugzilla.redhat.com/show_bug.cgi?id=746272
> >
> >> TDH is probably not moving due to the writeback threshold settings in
> >> TXDCTL.  netperf UDP_RR test is likely a good way to test this.
> >>
> > Yeah, makes sense. I haven't heard about new events after had removed
> > the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
> > same hardware and I haven't reproduced the issue in-house yet with another
> > 82571EB. See below about interface statistics from sar.
> >
> >
> >> I don't think the sequence is quite what you said.  We are going to
> >> work with the hardware team to get a sequence that works right, and we
> >> should have a fix for you soon.
> > Yeah, the sequence might not be exact, but gives us a good idea of
> > what could be happening.
> >
> > There are two events right after another:
> >
> > Oct  9 05:45:23  kernel:   TDH<48>
> > Oct  9 05:45:23  kernel:   TDT<49>
> > Oct  9 05:45:23  kernel:   next_to_use<49>
> > Oct  9 05:45:23  kernel:   next_to_clean<48>
> > Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
> > Oct  9 05:45:23  kernel:   time_stamp<102338ca6>
> > Oct  9 05:45:23  kernel:   next_to_watch<48>
> > Oct  9 05:45:23  kernel:   jiffies<102338dc1>
> > Oct  9 05:45:23  kernel:   next_to_watch.status<0>
> > Oct  9 05:45:23  kernel: MAC Status<80383>
> > Oct  9 05:45:23  kernel: PHY Status<792d>
> > Oct  9 05:45:23  kernel: PHY 1000BASE-T Status<3800>
> > Oct  9 05:45:23  kernel: PHY Extended Status<3000>
> > Oct  9 05:45:23  kernel: PCI Status<10>
> > Oct  9 05:51:54  kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit 
> > Hang:
> > Oct  9 05:51:54  kernel:   TDH<55>
> > Oct  9 05:51:54  kernel:   TDT<56>
> > Oct  9 05:51:54  kernel:   next_to_use<56>
> > Oct  9 05:51:54  kernel:   next_to_clean<55>
> > Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
> > Oct  9 05:51:54  kernel:   time_stamp<102350986>
> > Oct  9 05:51:54  kernel:   next_to_watch<55>
> > Oct  9 05:51:54  kernel:   jiffies<102350b07>
> > Oct  9 05:51:54  kernel:   next_to_watch.status<0>
> > Oct  9 05:51:54  kernel: MAC Status<80383>
> > Oct  9 05:51:54  kernel: PHY Status<792d>
> > Oct  9 05:51:54  kernel: PHY 1000BASE-T Status<3800>
> > Oct  9 05:51:54  kernel: PHY Extended Status<3000>
> > Oct  9 05:51:54  kernel: PCI Status<10>
> >
> I see the judgement of hang is:
> 
> time_after(jiffies, tx_ring->buffer_info[i].time_stamp + 
> (adapter->tx_timeout_factor * HZ))
> 
> which means the hang happened when current jiffies minus buffer's time 
> stamp is over
> (adapter->tx_timeout_factor * HZ).
> 
> And I see the tx_timeout_factor will at least be 1, so on x86 the 
> (jiffies-time_stamp) should
> over 1000, but here looks only around 300.
> 
> Could you please check the HZ number of your platform?
> 

sure, adapter->tx_timeout_factor * HZ = 0xfa/250d
That data came from a customer using kernel-xen, so HZ is 250.

Here is the debugging patch used:
http://people.redhat.com/~fleitner/linux-kernel-test.patch

The idea was to capture all the relevant values at the time
of the problem. (The print_hang_task is scheduled and sometimes
it shows timestamp=0, TDH=TDT because the packet is already sent)

This is the full output with debugging patch applied:
Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit Hang:
Oct 11 02:03:52 kernel:   TDH                  <25>
Oct 11 02:03:52 kernel:   TDT                  <26>
Oct 11 02:03:52 kernel:   next_to_use          <26>
Oct 11 02:03:52 kernel:   next_to_clean        <25>
Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
Oct 11 02:03:52 kernel:   time_stamp           <100b2aa22>
Oct 11 02:03:52 kernel:   next_to_watch        <25>
Oct 11 02:03:52 kernel:   jiffies              <100b2ab25>
Oct 11 02:03:52 kernel:   next_to_watch.status <0>
Oct 11 02:03:52 kernel:   stored_i = <25>
Oct 11 02:03:52 kernel:   stored_first = <25>
Oct 11 02:03:52 kernel:   stamp = <100b2aa22>
Oct 11 02:03:52 kernel:   factor = <fa>
Oct 11 02:03:52 kernel:   last_clean = <100b2aa1a>
Oct 11 02:03:52 kernel:   last_tx = <100b2aa22>
Oct 11 02:03:52 kernel:   count = <0>/<100>
Oct 11 02:03:52 kernel:   next_to_use cached = <26>
Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit Hang:
Oct 11 02:03:52 kernel:   TDH                  <25>
Oct 11 02:03:52 kernel:   TDT                  <26>
Oct 11 02:03:52 kernel:   next_to_use          <26>
Oct 11 02:03:52 kernel:   next_to_clean        <25>
Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
Oct 11 02:03:52 kernel:   time_stamp           <100b2aa22>
Oct 11 02:03:52 kernel:   next_to_watch        <25>
Oct 11 02:03:52 kernel:   jiffies              <100b2ab26>
Oct 11 02:03:52 kernel:   next_to_watch.status <0>
Oct 11 02:03:52 kernel: MAC Status             <80383>
Oct 11 02:03:52 kernel: PHY Status             <792d>
Oct 11 02:03:52 kernel: PHY 1000BASE-T Status  <3800>
Oct 11 02:03:52 kernel: PHY Extended Status    <3000>
Oct 11 02:03:52 kernel: PCI Status             <10>

I hope that helps.  Let me know if you need anything else.
thanks,
fbl

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Reply via email to