Re: [E1000-devel] [Intel-wired-lan] i40e card Tx resets

2016-03-18 Thread zhuyj
On 03/18/2016 03:28 AM, Jesse Brandeburg wrote:
> On Thu, 17 Mar 2016 14:56:14 -0400
> Sowmini Varadhan  wrote:
>
>> On (03/17/16 10:20), zhuyj wrote:
>>> 1. modprobe NET_PKTGEN
>>>
>>> 2. download the tar file and uncompress to any directory.
>>> This tar file is from kernel. It is in samples/pktgen/
>>>
>>> 3. cd pktgen
>>>
>>> 4. pktgen_sample02_multiqueue.sh -i ethx -s size -t cpu_number
>> Indeed, I see the same thing as you, and it was very easy to
>> reproduce. It was very interesting that the problem can happen with
>> as few as 3 threads, at which point I see the TX hang at exactly
>> -s 12305
> Okay, sorry I hadn't jumped into this thread yet.
>
> I can uniquivically tell you that what Sowmini saw with the MDD with
> stack based RDS-STRESS testing is *NOT* the same as what you're seeing
> while using pktgen with invalid huge skb->data buffers.
>
> We can ask on netdev if the driver should defend against this kind of
> input to hard_start_xmit (transmit routine), but the driver doesn't
> check the maximum length of the skb to see if it is invalid, because
> the stack can never build (only pktgen can) these invalid SKBs.
>
> The issue is that pktgen builds skb->data with a contiguous buffer of
> whatever size transmit requested, (regardless of MTU) and then sends it
> straight to the transmit routine, no segmentation flags, no MSS set.
>
> This causes the driver to build a transmit descriptor with an invalid
> length, which the hardware then "ASSERTS" on by issuing an MDD
> interrupt and freezing the bad acting queue.
>
>> I see:
>> i40e :82:00.0: TX driver issue detected, PF reset issued
>> i40e :82:00.0 eth2: VSI_seid 390, Hung TX queue 0, tx_pending: 492, 
>> NTC:0x140, HWB: 0x140, NTU: 0x12c, TAIL: 0x12c
>>
>> I think the common factor in both our test cases is that we have some
>> kernel thread that can efficiently send packets without any context
>> switches.
> You've found a red herring (mistakenly connected two separate events)
> so I think you can stop going down this path (pktgen).
>
>> Has anyone here seen this before? I'll see if I can find some cycles
>> to figure this out, if not, maybe its worth bringing up on netdev,
>> to see if others have seen this, and to draw some patterns.
> we don't need to bring it up on netdev.  We have a way to troubleshoot
> MDDs that I can send to you, if you want to do the work.  Otherwise we
> need to have some time to reproduce here.
>
>>> If size is set to a big number, the similar defect will occur.
>>> Adjust this size to a appropriate number, my defect will not occur.
>>>
>>> In the test, I found some types igb nic, such as i210, will work
>>> well no matter the size is a big number.
>>> some nic, such as 82580, it will not work well if the size is too big.
> This is mostly a combination of driver implementation and how the
> hardware handles a descriptor that is too large.  The driver *could*
> check to make sure the skb->data is never too large, but in that same
> vein, we *could* fix pktgen to never send a frame greater than MTU down
> to the driver.
Do you mean this is not a bug in nic?
And it is unnecessary to fix it?

But if a test tool makes tests like pktgen, how to handle it?

We just suggests not to make such tests?

Best Regards!
Zhu Yanjun
>
>>> As such, I think my problem results from the hardware and the big
>>> size triggers this problem.
>>>
>>> I hope this can help us all.
> Unfortunately Zhu's problem with pktgen is not a reproducer of
> Sowmini's problem.
>
> In the case of pktgen, it is a "don't do that, because it hurts" kind of
> bug. In the case of rds-stress, we need to reproduce it here and figure
> out what hardware constraint the driver is violating during set up of
> the transmit.
>
>


--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] [Intel-wired-lan] i40e card Tx resets

2016-03-18 Thread Sowmini Varadhan
On (03/17/16 12:28), Jesse Brandeburg wrote:
> We can ask on netdev if the driver should defend against this kind of
> input to hard_start_xmit (transmit routine), but the driver doesn't
> check the maximum length of the skb to see if it is invalid, because
> the stack can never build (only pktgen can) these invalid SKBs.
> 
> The issue is that pktgen builds skb->data with a contiguous buffer of
> whatever size transmit requested, (regardless of MTU) and then sends it
> straight to the transmit routine, no segmentation flags, no MSS set.

I see. And after you mentioned it, I checked with ixgbe, sure 
enough, that also results in a tx-hang for the pktgen test case
(whereas there were no issues with the (rds-stress , ixgbe) test.

I would surmise that pktgen is a bit of an outlier, more interesting
to focus on those cases that use the regular stack.

I dont know if dpdk can create the same issues as pktgen?

> we don't need to bring it up on netdev.  We have a way to troubleshoot
> MDDs that I can send to you, if you want to do the work.  Otherwise we
> need to have some time to reproduce here.

yes, I can do the work, since I already have this nicely set up.
Just need some hings on how to trouble-shoot the mdd.

--Sowmini


--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] [Intel-wired-lan] Latest dev-queue pull has i40e VFs reporting "Device still in reset"

2016-03-18 Thread Williams, Mitch A
Thanks, Alex. I'll look into it.
-Mitch

> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Alexander Duyck
> Sent: Wednesday, March 16, 2016 10:21 AM
> To: e1000-devel@lists.sourceforge.net; intel-wired-lan  l...@lists.osuosl.org>
> Subject: [Intel-wired-lan] Latest dev-queue pull has i40e VFs reporting
> "Device still in reset"
> 
> So my system with the latest pull of the dev-queue branch is stuck
> reporting "Device is still in reset (-16), retrying" for one or more
> VFs after I reload the drivers.  I've been trying to bisect the issue
> but haven't been having much luck.
> 
> Just wanted to see if anyone in Intel was aware of the issue,
> otherwise I will probably re-run my bisection with a full system reset
> between patches as I suspect I may be having issues reproducing it due
> to stray data being left in from earlier driver loads.
> 
> - Alex
> ___
> Intel-wired-lan mailing list
> intel-wired-...@lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
http://communities.intel.com/community/wired


[E1000-devel] i40e: source for nvmupdate64e? (to clear module qualification)

2016-03-18 Thread Wesley W. Terpstra
Hi! I have an Intel X710-DA4 and want to use it with a few different SFPs.
Unfortunately, out-of-the-box this seems to be disabled.

>From poking around, it seems that the x710 firmware rejects SFPs not listed
in the NVM qualified database whenever the "Enable Module Qualification"
bit is set (see Section 6.3.23.9 in the Intel Ethernet Controller XL710
Datasheet). From what I've read in the data sheet and i40e linux driver, it
seems I just need to set this bit to 0 and I should be good to go.

I am looking for the source code for nvmupdate64e, as this userspace tool
seems to use the ethtools_ops.set_eeprom syscall to write the NVM, which is
what I'd like to do. Where can I find the source-code for this? The
i40e_type.h header in the linux driver is pretty undocumented in the
correct use of i40e_nvmupd_cmd. I don't want to screw up the CRC, so would
prefer to simply modify the existing NVM update tool.

Anyone know where the source code is?
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] [Intel-wired-lan] i40e card Tx resets

2016-03-18 Thread Jesse Brandeburg
On Thu, 17 Mar 2016 14:56:14 -0400
Sowmini Varadhan  wrote:

> On (03/17/16 10:20), zhuyj wrote:
> > 1. modprobe NET_PKTGEN
> > 
> > 2. download the tar file and uncompress to any directory.
> > This tar file is from kernel. It is in samples/pktgen/
> > 
> > 3. cd pktgen
> > 
> > 4. pktgen_sample02_multiqueue.sh -i ethx -s size -t cpu_number
> 
> Indeed, I see the same thing as you, and it was very easy to 
> reproduce. It was very interesting that the problem can happen with
> as few as 3 threads, at which point I see the TX hang at exactly
> -s 12305 

Okay, sorry I hadn't jumped into this thread yet.

I can uniquivically tell you that what Sowmini saw with the MDD with
stack based RDS-STRESS testing is *NOT* the same as what you're seeing
while using pktgen with invalid huge skb->data buffers.

We can ask on netdev if the driver should defend against this kind of
input to hard_start_xmit (transmit routine), but the driver doesn't
check the maximum length of the skb to see if it is invalid, because
the stack can never build (only pktgen can) these invalid SKBs.

The issue is that pktgen builds skb->data with a contiguous buffer of
whatever size transmit requested, (regardless of MTU) and then sends it
straight to the transmit routine, no segmentation flags, no MSS set.

This causes the driver to build a transmit descriptor with an invalid
length, which the hardware then "ASSERTS" on by issuing an MDD
interrupt and freezing the bad acting queue.

> I see:
> i40e :82:00.0: TX driver issue detected, PF reset issued
> i40e :82:00.0 eth2: VSI_seid 390, Hung TX queue 0, tx_pending: 492, 
> NTC:0x140, HWB: 0x140, NTU: 0x12c, TAIL: 0x12c
> 
> I think the common factor in both our test cases is that we have some
> kernel thread that can efficiently send packets without any context
> switches. 

You've found a red herring (mistakenly connected two separate events)
so I think you can stop going down this path (pktgen).

> Has anyone here seen this before? I'll see if I can find some cycles
> to figure this out, if not, maybe its worth bringing up on netdev,
> to see if others have seen this, and to draw some patterns.

we don't need to bring it up on netdev.  We have a way to troubleshoot
MDDs that I can send to you, if you want to do the work.  Otherwise we
need to have some time to reproduce here.

> > If size is set to a big number, the similar defect will occur.
> > Adjust this size to a appropriate number, my defect will not occur.
> > 
> > In the test, I found some types igb nic, such as i210, will work
> > well no matter the size is a big number.
> > some nic, such as 82580, it will not work well if the size is too big.

This is mostly a combination of driver implementation and how the
hardware handles a descriptor that is too large.  The driver *could*
check to make sure the skb->data is never too large, but in that same
vein, we *could* fix pktgen to never send a frame greater than MTU down
to the driver.

> > 
> > As such, I think my problem results from the hardware and the big
> > size triggers this problem.
> > 
> > I hope this can help us all.

Unfortunately Zhu's problem with pktgen is not a reproducer of
Sowmini's problem.

In the case of pktgen, it is a "don't do that, because it hurts" kind of
bug. In the case of rds-stress, we need to reproduce it here and figure
out what hardware constraint the driver is violating during set up of
the transmit.


--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
http://communities.intel.com/community/wired