On 10/25/2011 12:26 AM, Flavio Leitner wrote:
On Mon, 24 Oct 2011 16:26:28 +0800
Michael Wangwang...@linux.vnet.ibm.com wrote:
On 10/21/2011 10:03 PM, Flavio Leitner wrote:
On Fri, 21 Oct 2011 14:15:12 +0800
Michael Wangwang...@linux.vnet.ibm.com wrote:
On 10/19/2011 08:16 PM, Flavio Leitner wrote:
On Wed, 19 Oct 2011 12:49:48 +0800
wangyunwang...@linux.vnet.ibm.comwrote:
Hi, Flavio
I am new to join the community, work on e1000e driver currently,
And I found a thing strange in this issue, please check below.
Thanks,
Michael Wang
On 10/18/2011 10:42 PM, Flavio Leitner wrote:
On Mon, 17 Oct 2011 11:48:22 -0700
Jesse Brandeburgjesse.brandeb...@intel.com wrote:
On Fri, 14 Oct 2011 10:04:26 -0700
Flavio Leitnerf...@redhat.com wrote:
TDH is probably not moving due to the writeback threshold settings in
TXDCTL. netperf UDP_RR test is likely a good way to test this.
Yeah, makes sense. I haven't heard about new events after had removed
the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the
exact
same hardware and I haven't reproduced the issue in-house yet with
another
82571EB. See below about interface statistics from sar.
Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the
tx descriptor only when:
1. the descriptor device cached is lower then 32.
2. The descriptor host prepared is at least one.
I don't think this will cause that issue, but another thing it done is to
set the device to write-back the processed descriptor only when the
amount reach 5(or 4).
So may be when the device get a descriptor and processed, but the
amount not reached 5, so it don't write-back it, but actually already
transmitted.
That could explain the issue and the fact that sometimes the hang
info printed shows empty ring (write-back happened in the middle).
But this will happen only when the transmit suddenly stopped for one
second or more, I don't know whether this is the real traffic situation
or not.
At least for one customer the interface had almost no traffic.
I will go over all the data again checking if this happens every time.
And may be I am wrong about this, but also I think this may be the only
reason cause this issue.
I am seeing this based on the debugging output:
This is the full output with debugging patch applied:
Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit
Hang:
Oct 11 02:03:52 kernel: TDH25
Oct 11 02:03:52 kernel: TDT26
Oct 11 02:03:52 kernel: next_to_use26
Oct 11 02:03:52 kernel: next_to_clean25
Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
Oct 11 02:03:52 kernel: time_stamp100b2aa22
Oct 11 02:03:52 kernel: next_to_watch25
Oct 11 02:03:52 kernel: jiffies100b2ab25
Oct 11 02:03:52 kernel: next_to_watch.status0
Oct 11 02:03:52 kernel: stored_i =25
Oct 11 02:03:52 kernel: stored_first =25
Oct 11 02:03:52 kernel: stamp =100b2aa22
Oct 11 02:03:52 kernel: factor =fa
Oct 11 02:03:52 kernel: last_clean =100b2aa1a
Oct 11 02:03:52 kernel: last_tx =100b2aa22
Oct 11 02:03:52 kernel: count =0/100
Notice above that buffer_info time_stamp is the same as in
last_tx (last time the xmit function was called), also that
last_clean (last time the clean function was called) is before
that. Therefore, the system sent just one descriptor in about
1 second confirming your idea.
So have you try to use the Red Hat 6, is this problem still
exist?
Actually, I received few other reports that looks like to be same
issue but with 6.2. As far as I can tell, hardware that was working
just fine started to show it after the kernel upgrade (coincidentally
5.7 and 6.2 introduces FLAG2_DMA_BURST). However, I haven't heard
anything back since I had provided the instrumented kernel to confirm
to you. I will follow up as soon as I hear something.
Assuming that your idea is true, the hang detection is broken because
it's possible to have a descriptor apparently stuck that is just missing
the write-back. So, is it possible to set a timer to write-back? If yes,
it could expire and run before the hang detection period expires. Or
perhaps force the write-back to happen before hang detection execution.
According to code ew32(TIDV, adapter-tx_int_delay);, I think
such timer has been already set, but I don't know if the
tx_int_delay is the default value which is 8(units of 1.024 μs).
TIDV means if the time expire, it will flush the write-back,
enforced.
The default value is very less than 1sec, it can not caused this
issue.
Customer has a test system reproducing this with 5.7, we can test
patches there if you like. Just let me know.
thank you!
fbl
May be you can just search macro
E1000_TXDCTL_DMA_BURST_ENABLE
in drivers/net/e1000e/e1000.h, change it to:
#define E1000_TXDCTL_DMA_BURST_ENABLE \
(E1000_TXDCTL_GRAN | /* set descriptor granularity */ \
E1000_TXDCTL_COUNT_DESC | \
(0 16) | /* wthresh must be +1 more than desired */\
(1 8) |