Re: [E1000-devel] [Intel-wired-lan] i40e card Tx resets

2016-03-18 Thread zhuyj
On 03/18/2016 03:28 AM, Jesse Brandeburg wrote:
> On Thu, 17 Mar 2016 14:56:14 -0400
> Sowmini Varadhan  wrote:
>
>> On (03/17/16 10:20), zhuyj wrote:
>>> 1. modprobe NET_PKTGEN
>>>
>>> 2. download the tar file and uncompress to any directory.
>>> This tar file is from kernel. It is in samples/pktgen/
>>>
>>> 3. cd pktgen
>>>
>>> 4. pktgen_sample02_multiqueue.sh -i ethx -s size -t cpu_number
>> Indeed, I see the same thing as you, and it was very easy to
>> reproduce. It was very interesting that the problem can happen with
>> as few as 3 threads, at which point I see the TX hang at exactly
>> -s 12305
> Okay, sorry I hadn't jumped into this thread yet.
>
> I can uniquivically tell you that what Sowmini saw with the MDD with
> stack based RDS-STRESS testing is *NOT* the same as what you're seeing
> while using pktgen with invalid huge skb->data buffers.
>
> We can ask on netdev if the driver should defend against this kind of
> input to hard_start_xmit (transmit routine), but the driver doesn't
> check the maximum length of the skb to see if it is invalid, because
> the stack can never build (only pktgen can) these invalid SKBs.
>
> The issue is that pktgen builds skb->data with a contiguous buffer of
> whatever size transmit requested, (regardless of MTU) and then sends it
> straight to the transmit routine, no segmentation flags, no MSS set.
>
> This causes the driver to build a transmit descriptor with an invalid
> length, which the hardware then "ASSERTS" on by issuing an MDD
> interrupt and freezing the bad acting queue.
>
>> I see:
>> i40e :82:00.0: TX driver issue detected, PF reset issued
>> i40e :82:00.0 eth2: VSI_seid 390, Hung TX queue 0, tx_pending: 492, 
>> NTC:0x140, HWB: 0x140, NTU: 0x12c, TAIL: 0x12c
>>
>> I think the common factor in both our test cases is that we have some
>> kernel thread that can efficiently send packets without any context
>> switches.
> You've found a red herring (mistakenly connected two separate events)
> so I think you can stop going down this path (pktgen).
>
>> Has anyone here seen this before? I'll see if I can find some cycles
>> to figure this out, if not, maybe its worth bringing up on netdev,
>> to see if others have seen this, and to draw some patterns.
> we don't need to bring it up on netdev.  We have a way to troubleshoot
> MDDs that I can send to you, if you want to do the work.  Otherwise we
> need to have some time to reproduce here.
>
>>> If size is set to a big number, the similar defect will occur.
>>> Adjust this size to a appropriate number, my defect will not occur.
>>>
>>> In the test, I found some types igb nic, such as i210, will work
>>> well no matter the size is a big number.
>>> some nic, such as 82580, it will not work well if the size is too big.
> This is mostly a combination of driver implementation and how the
> hardware handles a descriptor that is too large.  The driver *could*
> check to make sure the skb->data is never too large, but in that same
> vein, we *could* fix pktgen to never send a frame greater than MTU down
> to the driver.
Do you mean this is not a bug in nic?
And it is unnecessary to fix it?

But if a test tool makes tests like pktgen, how to handle it?

We just suggests not to make such tests?

Best Regards!
Zhu Yanjun
>
>>> As such, I think my problem results from the hardware and the big
>>> size triggers this problem.
>>>
>>> I hope this can help us all.
> Unfortunately Zhu's problem with pktgen is not a reproducer of
> Sowmini's problem.
>
> In the case of pktgen, it is a "don't do that, because it hurts" kind of
> bug. In the case of rds-stress, we need to reproduce it here and figure
> out what hardware constraint the driver is violating during set up of
> the transmit.
>
>


--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] [Intel-wired-lan] i40e card Tx resets

2016-03-18 Thread Sowmini Varadhan
On (03/17/16 12:28), Jesse Brandeburg wrote:
> We can ask on netdev if the driver should defend against this kind of
> input to hard_start_xmit (transmit routine), but the driver doesn't
> check the maximum length of the skb to see if it is invalid, because
> the stack can never build (only pktgen can) these invalid SKBs.
> 
> The issue is that pktgen builds skb->data with a contiguous buffer of
> whatever size transmit requested, (regardless of MTU) and then sends it
> straight to the transmit routine, no segmentation flags, no MSS set.

I see. And after you mentioned it, I checked with ixgbe, sure 
enough, that also results in a tx-hang for the pktgen test case
(whereas there were no issues with the (rds-stress , ixgbe) test.

I would surmise that pktgen is a bit of an outlier, more interesting
to focus on those cases that use the regular stack.

I dont know if dpdk can create the same issues as pktgen?

> we don't need to bring it up on netdev.  We have a way to troubleshoot
> MDDs that I can send to you, if you want to do the work.  Otherwise we
> need to have some time to reproduce here.

yes, I can do the work, since I already have this nicely set up.
Just need some hings on how to trouble-shoot the mdd.

--Sowmini


--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] [Intel-wired-lan] i40e card Tx resets

2016-03-18 Thread Jesse Brandeburg
On Thu, 17 Mar 2016 14:56:14 -0400
Sowmini Varadhan  wrote:

> On (03/17/16 10:20), zhuyj wrote:
> > 1. modprobe NET_PKTGEN
> > 
> > 2. download the tar file and uncompress to any directory.
> > This tar file is from kernel. It is in samples/pktgen/
> > 
> > 3. cd pktgen
> > 
> > 4. pktgen_sample02_multiqueue.sh -i ethx -s size -t cpu_number
> 
> Indeed, I see the same thing as you, and it was very easy to 
> reproduce. It was very interesting that the problem can happen with
> as few as 3 threads, at which point I see the TX hang at exactly
> -s 12305 

Okay, sorry I hadn't jumped into this thread yet.

I can uniquivically tell you that what Sowmini saw with the MDD with
stack based RDS-STRESS testing is *NOT* the same as what you're seeing
while using pktgen with invalid huge skb->data buffers.

We can ask on netdev if the driver should defend against this kind of
input to hard_start_xmit (transmit routine), but the driver doesn't
check the maximum length of the skb to see if it is invalid, because
the stack can never build (only pktgen can) these invalid SKBs.

The issue is that pktgen builds skb->data with a contiguous buffer of
whatever size transmit requested, (regardless of MTU) and then sends it
straight to the transmit routine, no segmentation flags, no MSS set.

This causes the driver to build a transmit descriptor with an invalid
length, which the hardware then "ASSERTS" on by issuing an MDD
interrupt and freezing the bad acting queue.

> I see:
> i40e :82:00.0: TX driver issue detected, PF reset issued
> i40e :82:00.0 eth2: VSI_seid 390, Hung TX queue 0, tx_pending: 492, 
> NTC:0x140, HWB: 0x140, NTU: 0x12c, TAIL: 0x12c
> 
> I think the common factor in both our test cases is that we have some
> kernel thread that can efficiently send packets without any context
> switches. 

You've found a red herring (mistakenly connected two separate events)
so I think you can stop going down this path (pktgen).

> Has anyone here seen this before? I'll see if I can find some cycles
> to figure this out, if not, maybe its worth bringing up on netdev,
> to see if others have seen this, and to draw some patterns.

we don't need to bring it up on netdev.  We have a way to troubleshoot
MDDs that I can send to you, if you want to do the work.  Otherwise we
need to have some time to reproduce here.

> > If size is set to a big number, the similar defect will occur.
> > Adjust this size to a appropriate number, my defect will not occur.
> > 
> > In the test, I found some types igb nic, such as i210, will work
> > well no matter the size is a big number.
> > some nic, such as 82580, it will not work well if the size is too big.

This is mostly a combination of driver implementation and how the
hardware handles a descriptor that is too large.  The driver *could*
check to make sure the skb->data is never too large, but in that same
vein, we *could* fix pktgen to never send a frame greater than MTU down
to the driver.

> > 
> > As such, I think my problem results from the hardware and the big
> > size triggers this problem.
> > 
> > I hope this can help us all.

Unfortunately Zhu's problem with pktgen is not a reproducer of
Sowmini's problem.

In the case of pktgen, it is a "don't do that, because it hurts" kind of
bug. In the case of rds-stress, we need to reproduce it here and figure
out what hardware constraint the driver is violating during set up of
the transmit.


--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
http://communities.intel.com/community/wired