On Wed, 19 Oct 2011 12:49:48 +0800 wangyun <wang...@linux.vnet.ibm.com> wrote:
> Hi, Flavio > > I am new to join the community, work on e1000e driver currently, > And I found a thing strange in this issue, please check below. > > Thanks, > Michael Wang > > On 10/18/2011 10:42 PM, Flavio Leitner wrote: > > On Mon, 17 Oct 2011 11:48:22 -0700 > > Jesse Brandeburg<jesse.brandeb...@intel.com> wrote: > > > >> On Fri, 14 Oct 2011 10:04:26 -0700 > >> Flavio Leitner<f...@redhat.com> wrote: > >> > >>> Hi, > >>> > >>> I got few reports so far that 82571EB models are having the > >>> "Detected Hardware Unit Hang" issue after upgrading the kernel. > >>> > >>> Further debugging with an instrumented kernel revealed that the > >>> socket buffer time stamp matches with the last time e1000_xmit_frame() > >>> was called. Also that the time stamp of e1000_clean_tx_irq() last run > >>> is prior to the one in socket buffer. > >>> > >>> However, ~1 second later, an interrupt is fired and the old entry > >>> is found. Sometimes, the scheduled print_hang_task dumps the > >>> information _after_ the old entry is sent (shows empty ring), > >>> indicating that the HW TX unit isn't really stuck and apparently > >>> just missed the signal to initiate the transmission. > >>> > >>> Order of events: > >>> (1) skb is pushed down > >>> (2) e1000_xmit_frame() is called > >>> (3) ring is filled with one entry > >>> (4) TDT is updated > >>>> (5) nothing happens for little more than 1 second > >>> (6) interrupt is fired > >>> (7) e1000_clean_tx_irq() is called > >>> (8) finds the entry not ready with an old time stamp, > >>> schedules print_hang_task and stops the TX queue. > >>> (9) print_hang_task runs, dump the info but the old entry is now sent > >>> (10) apparently the TX queue is back. > >> Flavio, thanks for the detailed info, please be sure to supply us the > >> bugzilla number. > >> > > It was buried in the end of the first email: > > https://bugzilla.redhat.com/show_bug.cgi?id=746272 > > > >> TDH is probably not moving due to the writeback threshold settings in > >> TXDCTL. netperf UDP_RR test is likely a good way to test this. > >> > > Yeah, makes sense. I haven't heard about new events after had removed > > the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact > > same hardware and I haven't reproduced the issue in-house yet with another > > 82571EB. See below about interface statistics from sar. > > > > > >> I don't think the sequence is quite what you said. We are going to > >> work with the hardware team to get a sequence that works right, and we > >> should have a fix for you soon. > > Yeah, the sequence might not be exact, but gives us a good idea of > > what could be happening. > > > > There are two events right after another: > > > > Oct 9 05:45:23 kernel: TDH<48> > > Oct 9 05:45:23 kernel: TDT<49> > > Oct 9 05:45:23 kernel: next_to_use<49> > > Oct 9 05:45:23 kernel: next_to_clean<48> > > Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: > > Oct 9 05:45:23 kernel: time_stamp<102338ca6> > > Oct 9 05:45:23 kernel: next_to_watch<48> > > Oct 9 05:45:23 kernel: jiffies<102338dc1> > > Oct 9 05:45:23 kernel: next_to_watch.status<0> > > Oct 9 05:45:23 kernel: MAC Status<80383> > > Oct 9 05:45:23 kernel: PHY Status<792d> > > Oct 9 05:45:23 kernel: PHY 1000BASE-T Status<3800> > > Oct 9 05:45:23 kernel: PHY Extended Status<3000> > > Oct 9 05:45:23 kernel: PCI Status<10> > > Oct 9 05:51:54 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit > > Hang: > > Oct 9 05:51:54 kernel: TDH<55> > > Oct 9 05:51:54 kernel: TDT<56> > > Oct 9 05:51:54 kernel: next_to_use<56> > > Oct 9 05:51:54 kernel: next_to_clean<55> > > Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: > > Oct 9 05:51:54 kernel: time_stamp<102350986> > > Oct 9 05:51:54 kernel: next_to_watch<55> > > Oct 9 05:51:54 kernel: jiffies<102350b07> > > Oct 9 05:51:54 kernel: next_to_watch.status<0> > > Oct 9 05:51:54 kernel: MAC Status<80383> > > Oct 9 05:51:54 kernel: PHY Status<792d> > > Oct 9 05:51:54 kernel: PHY 1000BASE-T Status<3800> > > Oct 9 05:51:54 kernel: PHY Extended Status<3000> > > Oct 9 05:51:54 kernel: PCI Status<10> > > > I see the judgement of hang is: > > time_after(jiffies, tx_ring->buffer_info[i].time_stamp + > (adapter->tx_timeout_factor * HZ)) > > which means the hang happened when current jiffies minus buffer's time > stamp is over > (adapter->tx_timeout_factor * HZ). > > And I see the tx_timeout_factor will at least be 1, so on x86 the > (jiffies-time_stamp) should > over 1000, but here looks only around 300. > > Could you please check the HZ number of your platform? > sure, adapter->tx_timeout_factor * HZ = 0xfa/250d That data came from a customer using kernel-xen, so HZ is 250. Here is the debugging patch used: http://people.redhat.com/~fleitner/linux-kernel-test.patch The idea was to capture all the relevant values at the time of the problem. (The print_hang_task is scheduled and sometimes it shows timestamp=0, TDH=TDT because the packet is already sent) This is the full output with debugging patch applied: Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit Hang: Oct 11 02:03:52 kernel: TDH <25> Oct 11 02:03:52 kernel: TDT <26> Oct 11 02:03:52 kernel: next_to_use <26> Oct 11 02:03:52 kernel: next_to_clean <25> Oct 11 02:03:52 kernel: buffer_info[next_to_clean]: Oct 11 02:03:52 kernel: time_stamp <100b2aa22> Oct 11 02:03:52 kernel: next_to_watch <25> Oct 11 02:03:52 kernel: jiffies <100b2ab25> Oct 11 02:03:52 kernel: next_to_watch.status <0> Oct 11 02:03:52 kernel: stored_i = <25> Oct 11 02:03:52 kernel: stored_first = <25> Oct 11 02:03:52 kernel: stamp = <100b2aa22> Oct 11 02:03:52 kernel: factor = <fa> Oct 11 02:03:52 kernel: last_clean = <100b2aa1a> Oct 11 02:03:52 kernel: last_tx = <100b2aa22> Oct 11 02:03:52 kernel: count = <0>/<100> Oct 11 02:03:52 kernel: next_to_use cached = <26> Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit Hang: Oct 11 02:03:52 kernel: TDH <25> Oct 11 02:03:52 kernel: TDT <26> Oct 11 02:03:52 kernel: next_to_use <26> Oct 11 02:03:52 kernel: next_to_clean <25> Oct 11 02:03:52 kernel: buffer_info[next_to_clean]: Oct 11 02:03:52 kernel: time_stamp <100b2aa22> Oct 11 02:03:52 kernel: next_to_watch <25> Oct 11 02:03:52 kernel: jiffies <100b2ab26> Oct 11 02:03:52 kernel: next_to_watch.status <0> Oct 11 02:03:52 kernel: MAC Status <80383> Oct 11 02:03:52 kernel: PHY Status <792d> Oct 11 02:03:52 kernel: PHY 1000BASE-T Status <3800> Oct 11 02:03:52 kernel: PHY Extended Status <3000> Oct 11 02:03:52 kernel: PCI Status <10> I hope that helps. Let me know if you need anything else. thanks, fbl ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired