On Mon, 24 Oct 2011 16:26:28 +0800 Michael Wang <wang...@linux.vnet.ibm.com> wrote:
> On 10/21/2011 10:03 PM, Flavio Leitner wrote: > > On Fri, 21 Oct 2011 14:15:12 +0800 > > Michael Wang<wang...@linux.vnet.ibm.com> wrote: > > > >> On 10/19/2011 08:16 PM, Flavio Leitner wrote: > >>> On Wed, 19 Oct 2011 12:49:48 +0800 > >>> wangyun<wang...@linux.vnet.ibm.com> wrote: > >>> > >>>> Hi, Flavio > >>>> > >>>> I am new to join the community, work on e1000e driver currently, > >>>> And I found a thing strange in this issue, please check below. > >>>> > >>>> Thanks, > >>>> Michael Wang > >>>> > >>>> On 10/18/2011 10:42 PM, Flavio Leitner wrote: > >>>>> On Mon, 17 Oct 2011 11:48:22 -0700 > >>>>> Jesse Brandeburg<jesse.brandeb...@intel.com> wrote: > >>>>> > >>>>>> On Fri, 14 Oct 2011 10:04:26 -0700 > >>>>>> Flavio Leitner<f...@redhat.com> wrote: > >>>>>> > >>>>>> TDH is probably not moving due to the writeback threshold settings in > >>>>>> TXDCTL. netperf UDP_RR test is likely a good way to test this. > >>>>>> > >>>>> Yeah, makes sense. I haven't heard about new events after had removed > >>>>> the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the > >>>>> exact > >>>>> same hardware and I haven't reproduced the issue in-house yet with > >>>>> another > >>>>> 82571EB. See below about interface statistics from sar. > > Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the > tx descriptor only when: > > 1. the descriptor device cached is lower then 32. > 2. The descriptor host prepared is at least one. > > I don't think this will cause that issue, but another thing it done is to > set the device to write-back the processed descriptor only when the > amount reach 5(or 4). > > So may be when the device get a descriptor and processed, but the > amount not reached 5, so it don't write-back it, but actually already > transmitted. > That could explain the issue and the fact that sometimes the hang info printed shows empty ring (write-back happened in the middle). > > But this will happen only when the transmit suddenly stopped for one > second or more, I don't know whether this is the real traffic situation > or not. > At least for one customer the interface had almost no traffic. I will go over all the data again checking if this happens every time. > And may be I am wrong about this, but also I think this may be the only > reason cause this issue. > I am seeing this based on the debugging output: > >>> This is the full output with debugging patch applied: > >>> Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit > >>> Hang: > >>> Oct 11 02:03:52 kernel: TDH<25> > >>> Oct 11 02:03:52 kernel: TDT<26> > >>> Oct 11 02:03:52 kernel: next_to_use<26> > >>> Oct 11 02:03:52 kernel: next_to_clean<25> > >>> Oct 11 02:03:52 kernel: buffer_info[next_to_clean]: > >>> Oct 11 02:03:52 kernel: time_stamp<100b2aa22> > >>> Oct 11 02:03:52 kernel: next_to_watch<25> > >>> Oct 11 02:03:52 kernel: jiffies<100b2ab25> > >>> Oct 11 02:03:52 kernel: next_to_watch.status<0> > >>> Oct 11 02:03:52 kernel: stored_i =<25> > >>> Oct 11 02:03:52 kernel: stored_first =<25> > >>> Oct 11 02:03:52 kernel: stamp =<100b2aa22> > >>> Oct 11 02:03:52 kernel: factor =<fa> > >>> Oct 11 02:03:52 kernel: last_clean =<100b2aa1a> > >>> Oct 11 02:03:52 kernel: last_tx =<100b2aa22> > >>> Oct 11 02:03:52 kernel: count =<0>/<100> Notice above that buffer_info time_stamp is the same as in last_tx (last time the xmit function was called), also that last_clean (last time the clean function was called) is before that. Therefore, the system sent just one descriptor in about 1 second confirming your idea. > So have you try to use the Red Hat 6, is this problem still > exist? > Actually, I received few other reports that looks like to be same issue but with 6.2. As far as I can tell, hardware that was working just fine started to show it after the kernel upgrade (coincidentally 5.7 and 6.2 introduces FLAG2_DMA_BURST). However, I haven't heard anything back since I had provided the instrumented kernel to confirm to you. I will follow up as soon as I hear something. Assuming that your idea is true, the hang detection is broken because it's possible to have a descriptor apparently stuck that is just missing the write-back. So, is it possible to set a timer to write-back? If yes, it could expire and run before the hang detection period expires. Or perhaps force the write-back to happen before hang detection execution. Customer has a test system reproducing this with 5.7, we can test patches there if you like. Just let me know. thank you! fbl ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired