from:"Jay Rolette"

[dpdk-dev] [PATCH] doc: announce KNI ethtool removal

2016-07-21 Thread Jay Rolette

On Thu, Jul 21, 2016 at 3:32 PM, Thomas Monjalon 
wrote:

> 2016-07-21 13:20, Jay Rolette:
> > On Thu, Jul 21, 2016 at 10:33 AM, Ferruh Yigit 
> > wrote:
> > > KNI ethtool is functional and maintained, and it may have users!
> > >
> > > Why just removing it, specially without providing an alternative?
> > > Is is good time to discuss KCP again?
> >
> > Yes, my product uses it.
>
> Your product uses what? KCP? KNI? KNI ethtool?
>

Sorry, that wasn't very clear. It uses KNI + ifconfig to configure the
device/interface in Linux. I'm assuming the "ethtool" bits under discussion
are the same things that make ifconfig work with KNI to the limited extent
it does.

> Seems like we are back to the same discussion we
> > had a few months ago about the KNI situation...
> >
> > It shouldn't be removed unless there is a replacement, ideally one that
> > works with the normal Linux tools like every other network device.
>
> This ethtool module works only for igb and ixgbe!
> There is already no replacement for other drivers.
> Who works on a replacement?
>

Ferruh submitted KCP previously, but you guys didn't like the fact that it
was a kernel module. IIRC, one of the gains from that was simplified
maintenance because you didn't need driver specific support for KNI.
Assuming he's still willing to beat it into shape, we have something that
is already most of the way there.

If people are going to continue to block it because it is a kernel module,
then IMO, it's better to leave the existing support on igx / ixgbe in place
instead of stepping backwards to zero support for ethtool.

> While the code wasn't ready at the time, it was a definite improvement
> over what
> > we have with KNI today.
>

[dpdk-dev] [PATCH] doc: announce KNI ethtool removal

2016-07-21 Thread Jay Rolette

On Thu, Jul 21, 2016 at 10:33 AM, Ferruh Yigit 
wrote:

> On 7/20/2016 5:07 PM, Thomas Monjalon wrote:
> > The out-of-tree kernel code must be avoided.
> > Moreover there is no good reason to keep this legacy feature
> > which is only partially supported.
> >
> > As described earlier in this plan:
> >   http://dpdk.org/ml/archives/dev/2016-July/043606.html
> > it will help to keep PCI ids in PMD code.
> >
> > Signed-off-by: Thomas Monjalon 
> > ---
> >  doc/guides/rel_notes/deprecation.rst | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> > index f502f86..9cadf6a 100644
> > --- a/doc/guides/rel_notes/deprecation.rst
> > +++ b/doc/guides/rel_notes/deprecation.rst
> > @@ -41,3 +41,10 @@ Deprecation Notices
> >  * The mempool functions for single/multi producer/consumer are
> deprecated and
> >will be removed in 16.11.
> >It is replaced by rte_mempool_generic_get/put functions.
> > +
> > +* The ethtool support will be removed from KNI in 16.11.
> > +  It is implemented only for igb and ixgbe.
> > +  It is really hard to maintain because it requires some out-of-tree
> kernel
> > +  code to be duplicated in this kernel module.
> > +  Removing this partial support will help to restrict the PCI id
> definitions
> > +  to the PMD code.
> >
>
> KNI ethtool is functional and maintained, and it may have users!
>
> Why just removing it, specially without providing an alternative?
> Is is good time to discuss KCP again?
>

Yes, my product uses it. Seems like we are back to the same discussion we
had a few months ago about the KNI situation...

It shouldn't be removed unless there is a replacement, ideally one that
works with the normal Linux tools like every other network device. While
the code wasn't ready at the time, it was a definite improvement over what
we have with KNI today.

Jay

[dpdk-dev] Help: How to read packet statistics from device registers via dpdk PMD?

2016-07-07 Thread Jay Rolette

On Thu, Jul 7, 2016 at 12:52 AM, Bill Bonaparte 
wrote:

> I am so happy to get your reply.
> My dpdk version is 2.1?and the OS is centOS 7?
> the following is the output from "dpdk_nic_bind.py --status":
>
> [root at APV35 ~]# dpdk_nic_bind.py --status
>
> Network devices using DPDK-compatible driver
> 
> :04:00.0 'VMXNET3 Ethernet Controller' drv=igb_uio unused=
> :0b:00.0 'VMXNET3 Ethernet Controller' drv=igb_uio unused=
> :13:00.0 'VMXNET3 Ethernet Controller' drv=igb_uio unused=
> :1b:00.0 'VMXNET3 Ethernet Controller' drv=igb_uio unused=
>
> Network devices using kernel driver
> ===
> :03:00.0 'VMXNET3 Ethernet Controller' if=ens160 drv=vmxnet3
> unused=igb_uio *Active*
>
> Other network devices
> =
> 
>

That's a different virtual NIC than what I'm running in my VMs, but given
your app isn't working directly on the hardware, I doubt that's the issue.
In case it helps along the way, here's what I see in my VM:

$ dpdk_nic_bind.py --status

Network devices using DPDK-compatible driver

:02:02.0 '82545EM Gigabit Ethernet Controller (Copper)' drv=igb_uio
unused=
:02:03.0 '82545EM Gigabit Ethernet Controller (Copper)' drv=igb_uio
unused=
:02:04.0 '82545EM Gigabit Ethernet Controller (Copper)' drv=igb_uio
unused=
:02:05.0 '82545EM Gigabit Ethernet Controller (Copper)' drv=igb_uio
unused=

Network devices using kernel driver
===
:02:01.0 '82545EM Gigabit Ethernet Controller (Copper)' if=eth0
drv=e1000 unused=igb_uio *Active*

Other network devices
=



> I tried it on the physical mathine, it still does not work. the OS is
> centOS 7, too.
> [root at AN ~]# dpdk_nic_bind.py --status
>
> Network devices using DPDK-compatible driver
> 
> :01:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection' drv=igb_uio
> unused=
> :01:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection' drv=igb_uio
> unused=
> :03:00.0 'I350 Gigabit Backplane Connection' drv=igb_uio unused=
> :03:00.1 'I350 Gigabit Backplane Connection' drv=igb_uio unused=
> :03:00.2 'I350 Gigabit Backplane Connection' drv=igb_uio unused=
> :03:00.3 'I350 Gigabit Backplane Connection' drv=igb_uio unused=
> :07:00.0 'I350 Gigabit Network Connection' drv=igb_uio unused=
> :07:00.1 'I350 Gigabit Network Connection' drv=igb_uio unused=
> :07:00.2 'I350 Gigabit Network Connection' drv=igb_uio unused=
> :07:00.3 'I350 Gigabit Network Connection' drv=igb_uio unused=
> :09:00.0 'I350 Gigabit Network Connection' drv=igb_uio unused=
> :09:00.1 'I350 Gigabit Network Connection' drv=igb_uio unused=
> :09:00.2 'I350 Gigabit Network Connection' drv=igb_uio unused=
> :09:00.3 'I350 Gigabit Network Connection' drv=igb_uio unused=
> :0c:00.0 'Device 0011' drv=igb_uio unused=
> :0f:00.1 'I350 Gigabit Network Connection' drv=igb_uio unused=
>
> Network devices using kernel driver
> ===
> :0f:00.0 'I350 Gigabit Network Connection' if=enp15s0f0 drv=igb
> unused=igb_uio *Active*
>
> Other network devices
> =
> 
>

With it not working on hardware and you having devices successfully bound
to DPDK, maybe it's a problem in your app. Have you tried running any of
the sample apps that use rte_eth_stats_get() and see if it works there?

Jay


> On Tue, Jul 5, 2016 at 8:03 PM, Jay Rolette  wrote:
>
>>
>> On Tue, Jul 5, 2016 at 2:35 AM, Bill Bonaparte 
>> wrote:
>>
>>> Hi:
>>> I am a new fish, I have tried my best to find answer about my question on
>>> web, but I failed. so
>>> I come here to ask for your help. the below is my question:
>>>
>>> I found that dpdk provides a api rte_eth_stats_get to read packet
>>> statistics about the interface, includes total input/output
>>> unicast/multicast/brodcast packets/bytes. but the api does not work on
>>> VMxnet interface (which is a virtual interface in VMware machine).
>>>
>>
>> Probably something in your app or environment rather than in the API
>> itself. We run rte_eth_stats_get() against interfaces in VMware Fusion,
>> VirtualBox and real hardware and they all work generally.
>>
>> Need some info before anyone can help you much:
>>
>> * What version of DPDK are you running?
>> * What OS are you running on?
>> * Output from "dpdk_nic_bind.py --status"?
>>
>> Jay
>>
>
>

[dpdk-dev] Help: How to read packet statistics from device registers via dpdk PMD?

2016-07-05 Thread Jay Rolette

On Tue, Jul 5, 2016 at 2:35 AM, Bill Bonaparte 
wrote:

> Hi:
> I am a new fish, I have tried my best to find answer about my question on
> web, but I failed. so
> I come here to ask for your help. the below is my question:
>
> I found that dpdk provides a api rte_eth_stats_get to read packet
> statistics about the interface, includes total input/output
> unicast/multicast/brodcast packets/bytes. but the api does not work on
> VMxnet interface (which is a virtual interface in VMware machine).
>

Probably something in your app or environment rather than in the API
itself. We run rte_eth_stats_get() against interfaces in VMware Fusion,
VirtualBox and real hardware and they all work generally.

Need some info before anyone can help you much:

* What version of DPDK are you running?
* What OS are you running on?
* Output from "dpdk_nic_bind.py --status"?

Jay

[dpdk-dev] random pkt generator PMD

2016-06-15 Thread Jay Rolette

On Wed, Jun 15, 2016 at 7:11 AM, Yerden Zhumabekov 
wrote:
>
>
> On 15.06.2016 17:50, Jay Rolette wrote:
>
>> On Wed, Jun 15, 2016 at 4:43 AM, Yerden Zhumabekov 
>> wrote:
>>
>> Hello everybody,
>>>
>>> DPDK already got a number of PMDs for various eth devices, it even has
>>> PMD
>>> emulations for backends such as pcap, sw rings etc.
>>>
>>> I've been thinking about the idea of having PMD which would generate
>>> mbufs
>>> on the fly in some randomized fashion. This would serve goals like, for
>>> example:
>>>
>>> 1) running tests for applications with network processing capabilities
>>> without additional software packet generators;
>>> 2) making performance measurements with no hw inteference;
>>> 3) ability to run without root privileges, --no-pci, --no-huge, for CI
>>> build, so on.
>>>
>>> Maybe there's no such need, and these goals may be achieved by other
>>> means
>>> and this idea is flawed? Any thoughts?
>>>
>>> Are you thinking of something along the lines of what BreakingPoint (now
>> part of Ixia) does, but as an open source software tool?
>>
>>
> More dreaming than thinking though :) Live flows generation, malware,
> attacks simulation etc is way out of scope of PMD dev, I guess.
>

Having a DPDK-based open-source BreakingPoint app would be a _fantastic_
tool for the security community, but yes, it doesn't really make sense to
put any of that logic in the PMD itself.

Were you more after the capabilities from that sort of tool or the
experience of writing a PMD?

[dpdk-dev] random pkt generator PMD

2016-06-15 Thread Jay Rolette

On Wed, Jun 15, 2016 at 4:43 AM, Yerden Zhumabekov 
wrote:

> Hello everybody,
>
> DPDK already got a number of PMDs for various eth devices, it even has PMD
> emulations for backends such as pcap, sw rings etc.
>
> I've been thinking about the idea of having PMD which would generate mbufs
> on the fly in some randomized fashion. This would serve goals like, for
> example:
>
> 1) running tests for applications with network processing capabilities
> without additional software packet generators;
> 2) making performance measurements with no hw inteference;
> 3) ability to run without root privileges, --no-pci, --no-huge, for CI
> build, so on.
>
> Maybe there's no such need, and these goals may be achieved by other means
> and this idea is flawed? Any thoughts?
>

Are you thinking of something along the lines of what BreakingPoint (now
part of Ixia) does, but as an open source software tool?

[dpdk-dev] [RFC] Yet another option for DPDK options

2016-06-01 Thread Jay Rolette

On Wed, Jun 1, 2016 at 10:00 AM, Wiles, Keith  wrote:

> Started from the link below, but did not want to highjack the thread.
> http://dpdk.org/ml/archives/dev/2016-June/040021.html
>
> I was thinking about this problem from a user perspective and command line
> options are very difficult to manage specifically when you have a large
> number of options as we have in dpdk. I see all of these options as a type
> of database of information for the DPDK and the application, because the
> application command line options are also getting very complex as well.
>
> I have been looking at a number of different options here and the
> direction I was thinking was using a file for the options and
> configurations with the data in a clean format. It could have been a INI
> file or JSON or XML, but they all seem to have some problems I do not like.
> The INI file is too flat and I wanted a hierarchy in the data, the JSON
> data is similar and XML is just hard to read. I wanted to be able to manage
> multiple applications and possible system the DPDK/app runs. The problem
> with the above formats is they are just data and not easy to make decisions
> about the system and applications at runtime.
>

INI format is simplest for users to read, but if you really need hierarchy,
JSON will do that just fine. Not sure what you mean by "JSON data is
similar"...


> If the ?database? of information could be queried by the EAL, drivers and
> application then we do not need to try and create a complex command line.
> It would be nice to execute a DPDK applications like this:
>
> ./some_dpdk_app ?config-file dpdk-config-filename
>

+1 much nicer than the mess that is EAL command line args today.


> The dpdk-config-filename could contain a lot of information and be able to
> startup multiple different applications. The dpdk-config-file could also
> include other config files to complete the configuration. The format of the
> data in the config file needs to be readable, but allow the user to put in
> new options, needs to be hierarchical in nature and have some simple
> functions to execute if required.
>
> The solution I was thinking is the file information is really just a
> fragment of a scripting language, which the DPDK application contains this
> scripting language interpreter. I was looking at using Lua lua.org as the
> scripting language interpreter it is small and easy to understand. Python
> and others are very big and require a lot of resources and Lua requires
> very few system resources. Also I did not want to have to write a parser
> (lex/yacc). The other nice feature of Lua is you can create a sandbox for
> the code to run in and limit the type of system resources and APIs that can
> be accessed by the application and configuration. Lua can be trimmed down
> to a fairly small size and builds on just about any system or we can just
> install Lua on the system without changes from a rpm or deb.
>

There are JSON and INI file parser libraries for pretty much any language
you care to use. That shouldn't be a factor in choosing file format.

The argument about "Python and others are very big and require a lot of
resources" doesn't end up mattering much since it is already required by a
couple of the DPDK tools (in particular, dpdk_nic_bind.py).


> I use Lua in pktgen at this time and the interface between ?C? and Lua is
> very simple and easy. Currently I include Lua in Pktgen, but I could have
> just used a system library.
>
> The data in the config file can be data statements along with some limited
> code to make some data changes at run time without having to modify the
> real application. Here is a simple config file I wrote: Some of the options
> do not make sense to be in the file at the same time, but wanted to see all
> of the options. The mk_lcore_list() and mk_coremap() are just Lua functions
> we can preload to help convert the simple strings into real data in this
> case tables of information. The application could be something like pktgen
> = { map = { ? }, more_options = 1, } this allows the same file to possible
> contain many application configurations. Needs a bit more work.
>
> dpdk_default = {
>


> }
>
> The EAL, driver, application, ? would query an API to access the data and
> the application can change his options quickly without modifying the code.
>
> Anyway comments are welcome.
>
> Regards,
> Keith
>

I like the concept overall. I'd suggest separating out the Lua thing. Lua's
fine for scripting, but nothing here really requires it or saves a lot of
development work.

Jay

[dpdk-dev] Couple of PMD questions

2016-04-20 Thread Jay Rolette

On Wed, Apr 20, 2016 at 9:05 AM, Bruce Richardson <
bruce.richardson at intel.com> wrote:

> On Wed, Apr 20, 2016 at 07:22:57AM -0500, Jay Rolette wrote:
> > On Wed, Apr 20, 2016 at 4:10 AM, Bruce Richardson <
> > bruce.richardson at intel.com> wrote:
> >
> > > On Tue, Apr 19, 2016 at 03:16:37PM -0500, Jay Rolette wrote:
> > > > In ixgbe_dev_rx_init(), there is this bit of code:
> > > >
> > > >   /*
> > > >* Configure the RX buffer size in the BSIZEPACKET field of
> > > >* the SRRCTL register of the queue.
> > > >* The value is in 1 KB resolution. Valid values can be from
> > > >* 1 KB to 16 KB.
> > > >*/
> > > >   buf_size = (uint16_t)(rte_pktmbuf_data_room_size(rxq->mb_pool)
> -
> > > >   RTE_PKTMBUF_HEADROOM);
> > > >   srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
> > > >  IXGBE_SRRCTL_BSIZEPKT_MASK);
> > > >
> > > >   IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
> > > >
> > > >   buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
> > > >  IXGBE_SRRCTL_BSIZEPKT_SHIFT);
> > > >
> > > >   /* It adds dual VLAN length for supporting dual VLAN */
> > > >   if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
> > > >   2 * IXGBE_VLAN_TAG_SIZE > buf_size)
> > > >   dev->data->scattered_rx = 1;
> > > >
> > > >
> > > > Couple of questions I have about it:
> > > >
> > > > 1) If the caller configured the MBUF memory pool data room size to be
> > > > something other than a multiple of 1K (+ RTE_PKTMBUF_HEADROOM), then
> the
> > > > driver ends up silently programming the NIC to use a smaller max RX
> size
> > > > than the caller specified.
> > > >
> > > > Should the driver error out in that case instead of only "sort of"
> > > working?
> > > > If not, it seems like it should be logging an error message.
> > >
> > > Well, it does log at the end of the whole rx init process what RX
> function
> > > is
> > > being used, so there is that.
> > > However, looking at it now, given that there is an explicit setting in
> the
> > > config
> > > to request scattered mode, I think you may be right and that we should
> > > error out
> > > here if scattered mode is needed and not set. It could help prevent
> > > unexpected
> > > problems with these drivers.
> > >
> >
> > +1. A hard fail at init if I've misconfigured things is much preferred to
> > something that mostly works for typical test cases and only fails on
> > corner/limit testing.
> >
> >
> > > > 2) Second question is about the "/* It adds dual VLAN length for
> > > supporting
> > > > dual VLAN */" bit...
> > > >
> > > > As I understand it, dev_conf.rxmode.max_rx_pkt_len is supposed to be
> set
> > > to
> > > > the max frame size you support (although it says it's only used if
> jumbo
> > > > frame is enabled). That same value is generally used when
> calculating the
> > > > size that mbuf elements should be created at.
> > > >
> > > > The description for the data_room_size parameter of
> > > > rte_pktmbuf_pool_create():
> > > >
> > > > "Size of data buffer in each mbuf, including RTE_PKTMBUF_HEADROOM."
> > > >
> > > >
> > > > If I support a max frame size of 9216 bytes (exactly a 1K multiple to
> > > make
> > > > the NIC happy), then max_rx_pkt_len is going to be 9216 and
> > > data_room_size
> > > > will be 9216 + RTE_PKTMBUF_HEADROOM.
> > > >
> > > > ixgbe_dev_rx_init() will calculate normalize that back to 9216, which
> > > will
> > > > fail the dual VLAN length check. The really nasty part about that is
> it
> > > has
> > > > a side-effect of enabling scattered RX regardless of the fact that I
> > > didn't
> > > > enable scattered RX in dev_conf.rxmode.
> > > >
> > > > The code in the e1000 PMD is similar, so nothing unique to ixgbe.
> > > >
> > > > Is that check correct? It seems wrong to be adding space for q-in-q
> on

[dpdk-dev] Couple of PMD questions

2016-04-20 Thread Jay Rolette

On Wed, Apr 20, 2016 at 4:35 AM, Andriy Berestovskyy 
wrote:

> Hi Jay,
>
> On Tue, Apr 19, 2016 at 10:16 PM, Jay Rolette  wrote:
> > Should the driver error out in that case instead of only "sort of"
> working?
>
> +1, we hit the same issue. Error or log message would help.
>
> > If I support a max frame size of 9216 bytes (exactly a 1K multiple to
> make
> > the NIC happy), then max_rx_pkt_len is going to be 9216 and
> data_room_size
> > will be 9216 + RTE_PKTMBUF_HEADROOM.
>
> Try to set max_rx_pkt_len <= 9K - 8 and mempool element size to 9K +
> headroom + size of structures.
>

Thanks, Andriy. That makes sense given how the code + NIC are working.

> > Is that check correct?
>
> Datasheet says:
> The MFS does not include the 4 bytes of the VLAN header. Packets with
> VLAN header can be as large as MFS + 4. When double VLAN is enabled,
> the device adds 8 to the MFS for any packets.
>

Gotcha. Hopefully we can get the docs and example code up to where app
developers don't need to crawl through driver source and NIC datasheets.
That or have rte_pktmbuf_pool_create() take care of that sort of detail
automatically. :-)

Jay

Regards,
> Andriy
>

[dpdk-dev] Couple of PMD questions

2016-04-20 Thread Jay Rolette

On Wed, Apr 20, 2016 at 4:10 AM, Bruce Richardson <
bruce.richardson at intel.com> wrote:

> On Tue, Apr 19, 2016 at 03:16:37PM -0500, Jay Rolette wrote:
> > In ixgbe_dev_rx_init(), there is this bit of code:
> >
> >   /*
> >* Configure the RX buffer size in the BSIZEPACKET field of
> >* the SRRCTL register of the queue.
> >* The value is in 1 KB resolution. Valid values can be from
> >* 1 KB to 16 KB.
> >*/
> >   buf_size = (uint16_t)(rte_pktmbuf_data_room_size(rxq->mb_pool) -
> >   RTE_PKTMBUF_HEADROOM);
> >   srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
> >  IXGBE_SRRCTL_BSIZEPKT_MASK);
> >
> >   IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
> >
> >   buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
> >  IXGBE_SRRCTL_BSIZEPKT_SHIFT);
> >
> >   /* It adds dual VLAN length for supporting dual VLAN */
> >   if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
> >   2 * IXGBE_VLAN_TAG_SIZE > buf_size)
> >   dev->data->scattered_rx = 1;
> >
> >
> > Couple of questions I have about it:
> >
> > 1) If the caller configured the MBUF memory pool data room size to be
> > something other than a multiple of 1K (+ RTE_PKTMBUF_HEADROOM), then the
> > driver ends up silently programming the NIC to use a smaller max RX size
> > than the caller specified.
> >
> > Should the driver error out in that case instead of only "sort of"
> working?
> > If not, it seems like it should be logging an error message.
>
> Well, it does log at the end of the whole rx init process what RX function
> is
> being used, so there is that.
> However, looking at it now, given that there is an explicit setting in the
> config
> to request scattered mode, I think you may be right and that we should
> error out
> here if scattered mode is needed and not set. It could help prevent
> unexpected
> problems with these drivers.
>

+1. A hard fail at init if I've misconfigured things is much preferred to
something that mostly works for typical test cases and only fails on
corner/limit testing.


> > 2) Second question is about the "/* It adds dual VLAN length for
> supporting
> > dual VLAN */" bit...
> >
> > As I understand it, dev_conf.rxmode.max_rx_pkt_len is supposed to be set
> to
> > the max frame size you support (although it says it's only used if jumbo
> > frame is enabled). That same value is generally used when calculating the
> > size that mbuf elements should be created at.
> >
> > The description for the data_room_size parameter of
> > rte_pktmbuf_pool_create():
> >
> > "Size of data buffer in each mbuf, including RTE_PKTMBUF_HEADROOM."
> >
> >
> > If I support a max frame size of 9216 bytes (exactly a 1K multiple to
> make
> > the NIC happy), then max_rx_pkt_len is going to be 9216 and
> data_room_size
> > will be 9216 + RTE_PKTMBUF_HEADROOM.
> >
> > ixgbe_dev_rx_init() will calculate normalize that back to 9216, which
> will
> > fail the dual VLAN length check. The really nasty part about that is it
> has
> > a side-effect of enabling scattered RX regardless of the fact that I
> didn't
> > enable scattered RX in dev_conf.rxmode.
> >
> > The code in the e1000 PMD is similar, so nothing unique to ixgbe.
> >
> > Is that check correct? It seems wrong to be adding space for q-in-q on
> top
> > of your specified max frame size...
>
> The issue here is what the correct behaviour needs to be. If we have the
> user
> specify the maximum frame size including all vlan tags, then we hit the
> problem
> where we have to subtract the VLAN tag sizes when writing the value to the
> NIC.
> In that case, we will hit a problem where we get a e.g. 9210 byte frame -
> to
> reuse your example - without any VLAN tags, which will be rejected by the
> hardware as being oversized. If we don't do the subtraction, and we get the
> same 9210 byte packet with 2 VLAN tags on it, the hardware will accept it
> and
> then split it across multiple descriptors because the actual DMA size is
> 9218 bytes.
>

As an app developer, I didn't realize the max frame size didn't include
VLAN tags. I expected max frame size to be the size of the ethernet frame
on the wire, which I would expect to include space used by any VLAN or MPLS
tags.

Is there anything in the docs or example apps about that? I did some
digging as I was debugging th

[dpdk-dev] Couple of PMD questions

2016-04-19 Thread Jay Rolette

In ixgbe_dev_rx_init(), there is this bit of code:

/*
 * Configure the RX buffer size in the BSIZEPACKET field of
 * the SRRCTL register of the queue.
 * The value is in 1 KB resolution. Valid values can be from
 * 1 KB to 16 KB.
 */
buf_size = (uint16_t)(rte_pktmbuf_data_room_size(rxq->mb_pool) -
RTE_PKTMBUF_HEADROOM);
srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
   IXGBE_SRRCTL_BSIZEPKT_MASK);

IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);

buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
   IXGBE_SRRCTL_BSIZEPKT_SHIFT);

/* It adds dual VLAN length for supporting dual VLAN */
if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
2 * IXGBE_VLAN_TAG_SIZE > buf_size)
dev->data->scattered_rx = 1;


Couple of questions I have about it:

1) If the caller configured the MBUF memory pool data room size to be
something other than a multiple of 1K (+ RTE_PKTMBUF_HEADROOM), then the
driver ends up silently programming the NIC to use a smaller max RX size
than the caller specified.

Should the driver error out in that case instead of only "sort of" working?
If not, it seems like it should be logging an error message.

2) Second question is about the "/* It adds dual VLAN length for supporting
dual VLAN */" bit...

As I understand it, dev_conf.rxmode.max_rx_pkt_len is supposed to be set to
the max frame size you support (although it says it's only used if jumbo
frame is enabled). That same value is generally used when calculating the
size that mbuf elements should be created at.

The description for the data_room_size parameter of
rte_pktmbuf_pool_create():

"Size of data buffer in each mbuf, including RTE_PKTMBUF_HEADROOM."


If I support a max frame size of 9216 bytes (exactly a 1K multiple to make
the NIC happy), then max_rx_pkt_len is going to be 9216 and data_room_size
will be 9216 + RTE_PKTMBUF_HEADROOM.

ixgbe_dev_rx_init() will calculate normalize that back to 9216, which will
fail the dual VLAN length check. The really nasty part about that is it has
a side-effect of enabling scattered RX regardless of the fact that I didn't
enable scattered RX in dev_conf.rxmode.

The code in the e1000 PMD is similar, so nothing unique to ixgbe.

Is that check correct? It seems wrong to be adding space for q-in-q on top
of your specified max frame size...

Thanks,
Jay

[dpdk-dev] Potential deadlock in KNI RX path

2016-04-07 Thread Jay Rolette

On Thu, Apr 7, 2016 at 9:33 AM, Ferruh Yigit  wrote:

> On 4/6/2016 9:22 PM, Jay Rolette wrote:
> > Over a year ago, Neil pointed out that calling netif_rx() from
> > kni_net_rx_normal() was a bug and could cause lockups. Here's the
> comment:
> >
> > http://dpdk.org/ml/archives/dev/2015-March/015783.html
> >
> > Looking at the current code base, it is still calling netif_rx() instead
> of
> > netif_rx_ni() as suggested.
> >
> > Did that fall through the cracks or is there disagreement about whether
> it
> > was the right thing to do?
> >
> > Thanks,
> > Jay
> >
> Hi Jay,
>
> Using netif_rx_ni() looks like correct thing to do. I will send a patch
> for it.
>
> But we poll on this function, this means doing lots of
> preempt_disable/enable(); which may have a negative effect on
> performance. Although I did very brief test and did not observed any
> performance degradation.
>
> It can be great if somebody out already using KNI do some performance
> analysis with the patch.
>
> Another observation, for multiple kthread mode, the ksoftirq threads CPU
> consumption removed when switched to netif_rx_ni(), I remember in mail
> list somebody complained about ksoftirq CPU usage increase for this case.
>

Thanks Ferruh. We'll make that change in our local repo and run it through
its paces.

Jay

[dpdk-dev] Potential deadlock in KNI RX path

2016-04-06 Thread Jay Rolette

Over a year ago, Neil pointed out that calling netif_rx() from
kni_net_rx_normal() was a bug and could cause lockups. Here's the comment:

http://dpdk.org/ml/archives/dev/2015-March/015783.html

Looking at the current code base, it is still calling netif_rx() instead of
netif_rx_ni() as suggested.

Did that fall through the cracks or is there disagreement about whether it
was the right thing to do?

Thanks,
Jay

[dpdk-dev] Kernel panic in KNI

2016-04-06 Thread Jay Rolette

I had a system lockup hard a couple of days ago and all we were able to get
was a photo of the LCD monitor with most of the kernel panic on it. No way
to scroll back the buffer and nothing in the logs after we rebooted. Not
surprising with a kernel panic due to an exception during interrupt
processing. We have a serial console attached in case we are able to get it
to happen again, but it's not easy to reproduce (hours of runtime for this
instance).

Ran the photo through OCR software to get a text version of the dump, so
possible I missed some fixups in this:

[39178.433262] RDX: 00ba RSI: 881fd2f350ee RDI:
a12520669126180a
[39178.464020] RBP: 880433966970 R08: a12520669126180a R09:
881fd2f35000
[39178.495091] R10:  R11: 881fd2f88000 R12:
883fdla75ee8
[39178.526594] R13: 00ba R14: 7fdad5a66780 R15:
883715ab6780
[39178.559011] FS:  77fea740() GS:88lfffc0()
knlGS:
[39178.592005] CS:  0010 DS:  ES:  CR0: 80050033
[39178.623931] CR2: 77ea2000 CR3: 001fd156f000 CR4:
001407f0
[39178.656187] Stack:
[39178.689025] c067c7ef 00ba 00ba
881fd2f88000
[39178.722682] 4000 8B3fd0bbd09c 883fdla75ee8
8804339bb9c8
[39178.756525] 81658456 881fcd2ec40c c0680700
880436bad800
[39178.790577] Call Trace:
[39178.824420] [] ? kni_net_tx+0xef/0x1a0 [rte_kni]
[39178.859190] [] dev_hard_start_xmit+0x316/0x5c0
[39178.893426] [] sch_direct_xmit+0xee/0xic0
[39178.927435] [l __dev_queue_xmit+0x200/0x4d0
[39178.961684] [l dev_queue_xmit+0x10/0x20
[39178.996194] [] neigh_connected_output+0x67/0x100
[39179.031098] [] ip_finish_output+0xid8/0x850
[39179.066709] [l ip_output+0x58/0x90
[39179.101551] [] ip_local_out_sk+0x30/0x40
[39179.136823] [] ip_queue_xmit+0xl3f/0x3d0
[39179.171742] [] tcp_transmit_skb+0x47c/0x900
[39179.206854] [l tcp_write_xmit+0x110/0xcb0
[39179.242335] [] __tcp_push_pending_frames+0x2e/0xc0
[39179.277632] [] tcp_push+0xec/0x120
[39179.311768] [] tcp_sendmsg+0xb9/0xce0
[39179.346934] [] ? tcp_recvmsg+0x6e2/0xba0
[39179.385586] [] inet_sendmsg+0x64/0x60
[39179.424228] [] ? apparmor_socket_sendmsg+0x21/0x30
[39179.4586581 [] sock_sendmsg+0x86/0xc0
[39179.493220] [] ? __inet_stream_connect+0xa5/0x320
[39179.528033] [] ? __fdget+0x13/0x20
[39179.561214] [] SYSC_sendto+0x121/0x1c0
[39179.594665] [] ? aa_sk_perm.isra.4+0x6d/0x150
[39179.6268931 [] ? read_tsc+0x9/0x20
[39179.6586541 [] ? ktime_get_ts+0x48/0xe0
[39179.689944] [] SyS_sendto+0xe/0x10
[39179.719575] [] system_call_fastpath+0xia/0xif
[39179.748760] Code: 43 58 48 Zb 43 50 88 43 4e 5b 5d c3 66 Of if 84 00 00
00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90
 90 48 89 f8 48 89 d1  a4 c3 03 83 eZ 07 f3 48 .15 89 di f3 a4 c3 20 4c
8b % 4c 86
[39179.808690] RIP  [] memcpy+0x6/0x110
[39179.837238]  RSP 
[39179.933755] ---[ end trace 2971562f425e2cf8 ]---
[39179.964856] Kernel panic - not syncing: Fatal exception in interrupt
[39179.992896] Kernel Offset: 0x0 from 0x8100 (relocation
range: 0x8000-0xbfff)
[39180.024617] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt

It blew up when kni_net_tx() called memcpy() to copy data from the skb to
an mbuf.

Disclosure: I'm not a Linux device driver guy. I dip into the kernel as
needed. Plenty of experience doing RTOS and bare metal development, but not
a Linux kernel expert.

What context does kni_net_tx() run in? On the receive path, my
understanding is that KNI always runs in process context on a kthread. I've
been assuming that the transmit path was also in process context (albeit on
the app's process), so the "Fatal exception in interrupt" is throwing me.

Does kni_net_tx() ever run in interrupt (or soft-interrupt) context?

Thanks,
Jay

[dpdk-dev] DPDK namespace

2016-04-06 Thread Jay Rolette

On Wed, Apr 6, 2016 at 12:26 AM, Yuanhan Liu 
wrote:

> On Tue, Apr 05, 2016 at 05:31:22PM +0300, Arnon Warshavsky wrote:
> > Googling rte functions or error codes usually takes you to dpdk dev email
> > archive so I don't think it is that much difficult to figure out where
> rte
> > comes from.
> > Other than that , except for my own refactoring pains when replacing a
> dpdk
> > version, I do not see a major reason why not.
> > If Going for dpdk_ prefix, I agree with the quick death approach.
>
> +1: it's a bit weird to keep both, especially for a long while, that
> every time we turn a rte_ prefix to dpdk_ prefix, we break applications.
> Instead of breaking applications many times, I'd prefer to break once.
> Therefore, applications could do a simple global rte_ -> dpdk_ substitute:
> it doesn't sound that painful then.
>
> And here are few more comments:
>
> - we should add rte_/dpdk_ prefix to all public structures as well.
>
>   I'm thinking we are doing well here. I'm just aware that vhost lib
>   does a bad job, which is something I proposed to fix in next release.
>
> - If we do the whole change once, I'd suggest to do it ASAP when this
>   release is over.
>
>   It should be a HUGE change that touches a lot of code, if we do it
>   later, at a stage that a lot of patches for new features have been
>   made or sent out, all of them need rebase. That'd be painful.
>

This last point that yliu brings up is the one that worries me. DPDK
already has a serious problem with the patch backlog. It can take months
for even small bug fixes to get merged.

>From the app side, it's not the end of the world if there is a one time,
single shot change to the prefix. However, I'm not sure it's worth it
because of the impact on patches that have been submitted. As others have
mentioned, google can sort the rte_* references just fine.

Jay

[dpdk-dev] DPDK and HW offloads

2016-03-22 Thread Jay Rolette

On Tue, Mar 22, 2016 at 5:19 AM, Bruce Richardson <
bruce.richardson at intel.com> wrote:

> On Tue, Mar 22, 2016 at 05:50:28AM +, Qiu, Michael wrote:
> >
> > Why not to implement one simple API with variable arguments, just like
> > syscall ioctl() does. And drivers implement it's specific hardware
> > features with a feature bit param, and other needed variable arguments.
> >
> > Thanks,
> > Michael
>
> A very much dislike that idea.
> * It makes the code much harder to read as you have to closely examine all
> the
>   parameters to work out what a function call is actually meant to do.
> * It makes it much harder to see that you have an implicit dependency on a
>   specific device. Having to include a driver specific header file e.g.
> i40e.h,
>   and call a function named e.g. i40e_do_magic_stuff(), makes it pretty
> explicit
>   that you have a dependency on i40e-based hardware
> * It prevents the compiler from doing type-checking on parameters and
> informing
>   you of little inconsistencies.
>
> For all these reasons, I prefer the device-specific functions option.
> However,
> at the same time, we also need to ensure we have a reasonable set of
> generic
> APIs so that the cases where users are forced to drop down to the
> lower-level
> device-specific primitives are reduced.
>

+1

Jay

[dpdk-dev] [PATCH v5 0/4] Use common Linux tools to control DPDK ports

2016-03-14 Thread Jay Rolette

Is there some technical reason or is it just the push-back you are getting
from some of the maintainers?

I chimed in on one of the other threads already, but I'm extremely
disappointed that usability and serviceability improvements to existing
DPDK capabilities (KNI) are getting blocked like this.

For companies building network appliances based on DPDK, having a kernel
module that isn't in the tree just isn't that big of a deal. Long term
goals for getting this upstream are great, but why not take advantage of
incremental improvements in the meantime?

Jay

On Mon, Mar 14, 2016 at 10:31 AM, Ferruh Yigit 
wrote:

> On 3/9/2016 11:41 AM, Ferruh Yigit wrote:
> > This patch sent to keep record of latest status of the work.
> >
> >
> > This work is to make DPDK ports more visible and to enable using common
> > Linux tools to configure DPDK ports.
> >
> > Patch is based on KNI but contains only control functionality of it,
> > also this patch does not include any Linux kernel network driver as
> > part of it.
> >
> > Basically with the help of a kernel module (KCP), virtual Linux network
> > interfaces named as "dpdk$" are created per DPDK port, control messages
> > sent to these virtual interfaces are forwarded to DPDK, and response
> > sent back to Linux application.
> >
> > Virtual interfaces created when DPDK application started and destroyed
> > automatically when DPDK application terminated.
> >
> > Communication between kernel-space and DPDK done using netlink socket.
> >
> > In long term this patch intends to replace the KNI and KNI will be
> > depreciated.
> >
>
> Self-NACK: Will work on netdev to upstream this.
>
>

[dpdk-dev] [PATCH 1/3] kcp: add kernel control path kernel module

2016-03-02 Thread Jay Rolette

On Tue, Mar 1, 2016 at 8:02 PM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> On Mon, 29 Feb 2016 08:33:25 -0600
> Jay Rolette  wrote:
>
> > On Mon, Feb 29, 2016 at 5:06 AM, Thomas Monjalon <
> thomas.monjalon at 6wind.com>
> > wrote:
> >
> > > Hi,
> > > I totally agree with Avi's comments.
> > > This topic is really important for the future of DPDK.
> > > So I think we must give some time to continue the discussion
> > > and have netdev involved in the choices done.
> > > As a consequence, these series should not be merged in the release
> 16.04.
> > > Thanks for continuing the work.
> > >
> >
> > I know you guys are very interested in getting rid of the out-of-tree
> > drivers, but please do not block incremental improvements to DPDK in the
> > meantime. Ferruh's patch improves the usability of KNI. Don't throw out
> > good and useful enhancements just because it isn't where you want to be
> in
> > the end.
> >
> > I'd like to see these be merged.
> >
> > Jay
>
> The code is really not ready. I am okay with cooperative development
> but the current code needs to go into a staging type tree.
> No compatibility, no ABI guarantees, more of an RFC.
> Don't want vendors building products with it then screaming when it
> gets rebuilt/reworked/scrapped.
>

That's fair. To be clear, it wasn't my intent for code that wasn't baked
yet to be merged.

The main point of my comment was that I think it is important not to halt
incremental improvements to existing capabilities (KNI in this case) just
because there are philosophical or directional changes that the community
would like to make longer-term.

Bird in the hand vs. two in the bush...

Jay

[dpdk-dev] [PATCH 1/3] kcp: add kernel control path kernel module

2016-02-29 Thread Jay Rolette

On Mon, Feb 29, 2016 at 5:06 AM, Thomas Monjalon 
wrote:

> Hi,
> I totally agree with Avi's comments.
> This topic is really important for the future of DPDK.
> So I think we must give some time to continue the discussion
> and have netdev involved in the choices done.
> As a consequence, these series should not be merged in the release 16.04.
> Thanks for continuing the work.
>

I know you guys are very interested in getting rid of the out-of-tree
drivers, but please do not block incremental improvements to DPDK in the
meantime. Ferruh's patch improves the usability of KNI. Don't throw out
good and useful enhancements just because it isn't where you want to be in
the end.

I'd like to see these be merged.

Jay

[dpdk-dev] [PATCH 2/3] rte_ctrl_if: add control interface library

2016-01-28 Thread Jay Rolette

On Thu, Jan 28, 2016 at 7:15 AM, Ferruh Yigit 
wrote:

> On Thu, Jan 28, 2016 at 11:14:47AM +, Remy Horton wrote:
> > On 27/01/2016 16:24, Ferruh Yigit wrote:
> >
> > > +   default:
> > > +   ret = -95 /* EOPNOTSUPP */;
> > > +   break;
> >
> > Is this intentional? -EOPNOTSUPP is -122 (-95 is -ENOTSOCK)..
> >
> Return value is not significant, callee just checks for negative value,
> I can remove comment to prevent confusion.
>

No, please fix the return value. Return values are significant when you are
trying to debug or understand the intent of the code.

Jay

[dpdk-dev] Future Direction for rte_eth_stats_get()

2016-01-22 Thread Jay Rolette

On Fri, Jan 22, 2016 at 9:22 AM, Van Haaren, Harry <
harry.van.haaren at intel.com> wrote:

> > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > Subject: Re: [dpdk-dev] Future Direction for rte_eth_stats_get()
>
> > + Harry
>
> Hey all,
>
> > xstats are driver agnostic and have a well-defined naming scheme.
>
> Indeed, described here:
>
> http://dpdk.org/doc/guides/prog_guide/poll_mode_drv.html#extended-statistics-api
>
> All of the rte_eth_stats statistics are presented in xstats consistently
> (independent of which PMD is running), as they are implemented in
> rte_ethdev.c:
> http://dpdk.org/browse/dpdk/tree/lib/librte_ether/rte_ethdev.c#n1500
>
>
> From: David Harton:
> > For example, what if there was a kind of "stats registry" composed of ID
> and name.  It
> > would work similar to xtats except instead of advertising strings only
> the "get" API would
> > return ID/count pairs.  If the application wishes to use the DPDK
> provided string they can
> > call another API to get the stat string via the ID.  These IDs would be
> well-defined
> > clearly explaining what the count represent.  This way the strings for
> counts will be
> > uniform across drivers and also make it clear to the users what the
> counts truly represent
> > and the application could obtain stats from any driver in a driver
> agnostic manner.
>
> What you (David Harton) describe about a well-defined name, and clearly
> explaining the value
> it is representing is what the goal was for xstats: a human readable
> string, which can be
> easily parsed and a value retrieved.
>
> The "rules" of encoding a statistic as an xstats string are pretty simple,
> and if any PMD
> breaks the rules, it should be considered broken and fixed.
>
> Consistency is key, in this case consistency in the PMD xstats
> implementations to make it
> useful for clients of the API.
>
> The big advantage xstats has over adding items to rte_eth_stats is that it
> won't break ABI.
>
> Do you see a fundamental problem with parsing the strings to retrieve
> values?
>

When you have an appliance with a large number of ports and a large number
of stats that you need to collect, having to parse those strings constantly
is very slow compared to having fixed struct members or at least known ID
values.

Strings are also error prone. While they may be consistent at this point,
it is remarkably easy for them to get out of sync along the way. Using an
ID instead avoids that neatly.

Jay

[dpdk-dev] [RFC 0/3] Use common Linux tools to control DPDK ports

2016-01-18 Thread Jay Rolette

On Mon, Jan 18, 2016 at 5:12 PM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> On Fri, 15 Jan 2016 16:18:01 +
> Ferruh Yigit  wrote:
>
> > This work is to make DPDK ports more visible and to enable using common
> > Linux tools to configure DPDK ports.
> >
> > Patch is based on KNI but contains only control functionality of it,
> > also this patch does not include any Linux kernel network driver as
> > part of it.
>
> I actually would like KNI to die and be replaced by something generic.
> Right now with KNI it is driver and hardware specific. It is almost as if
> there
> are three drivers for ixgbe, the Linux driver, the DPDK driver, and the
> KNI driver.
>

Any ideas about what that would look like? Having the ability to send
traffic to/from DPDK-owned ports from control plane applications that live
outside of (and are ignorant of) DPDK is a platform requirement for our
product.

I'm assuming that isn't uncommon, but that could just be the nature of the
types of products I've built over the years.

That said, I'd love there to be something that performs better and plays
nicer with the system than KNI.

Jay

[dpdk-dev] releases scheduling

2015-12-15 Thread Jay Rolette

+100 for the LTS

One of the bigger challenges for products using DPDK is that there is so
much change going in each release with very limited testing. Trying to even
remotely keep up is too risky. We end up back-porting various fixes and
enhancements to whatever version we are on (1.6rc2 currently).

Jay

On Tue, Dec 15, 2015 at 8:42 AM, Wiles, Keith  wrote:

> On 12/15/15, 7:37 AM, "dev on behalf of O'Driscoll, Tim" <
> dev-bounces at dpdk.org on behalf of tim.odriscoll at intel.com> wrote:
>
> >
> >> -Original Message-
> >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Thomas Monjalon
> >> Sent: Sunday, December 13, 2015 7:23 PM
> >> To: dev at dpdk.org
> >> Subject: [dpdk-dev] releases scheduling
> >>
> >> Hi all,
> >>
> >> We need to define the deadlines for the next releases.
> >> During 2015, we were doing a release every 4 months.
> >> If we keep the same pace, the next releases would be:
> >>  2.3: end of March
> >>  2.4: end of July
> >>  2.5: end of November
> >>
> >> However, things move fast and it may be a bit long to wait 4 months for
> >> a feature. That's why I suggest to progressively shorten release terms:
> >>  2.3: end of March
> >>  2.4: mid July
> >>  2.5: end of October
> >> and continue with a release every 3 months:
> >>  2.6: end of January
> >>  2.7: end of April
> >>  2.8: end of July
> >> This planning would preserve some of the major holiday periods
> >> (February, May, August, December).
> >>
> >> The first period, for the first submission of a feature, was 2 months
> >> long.
> >> Then we had 2 other months to discuss, merge and fix.
> >> We should shorten only the first period.
> >>
> >> Anyway, the next deadlines should be unchanged:
> >>  - January 31: end of first submission phase
> >>  - March 31: release 2.3
> >>
> >> Opinions are welcome.
> >
> >I think moving to quarterly releases is a good idea. Your proposal to do
> this in a gradual way, so that we don't change the 2.3 dates, also makes
> sense.
> >
> >Should we consider changing the release numbering at the same time? It's
> difficult to keep track of when each 2.x release is due, and we don't have
> any criteria in place for moving to 3.x in future. Many people like the
> Ubuntu numbering scheme of Year.Month. Should we consider adopting that
> convention? If we move in future to a model where we have long-term support
> releases (something that was touched on in Dublin), then we could append an
> LTS designation to the release number.
>
> +1 for the Ubuntu number and the LTS
> >
> >
> >Tim
> >
>
>
> Regards,
> Keith
>
>
>
>
>

[dpdk-dev] [RFC] location for DPDK related white papers

2015-10-23 Thread Jay Rolette

On Fri, Oct 23, 2015 at 5:28 AM, Mcnamara, John 
wrote:

> Hi,
>
> We have had a few people wishing to submit DPDK related white papers.
> These tend to focus on particular aspects of DPDK and don't fit into
> any of the existing documents.
>
> Where should these be stored/made available? Some options:
>
> * In the repo, in RST format in a doc/guides/white_paper directory.
> * On dpdk.org in PDF format.
> * Elsewhere.
>
> Any comments, suggestions?
>

PDFs on dpdk.org

[dpdk-dev] [PATCH] rte_alarm: modify it to make it not to be affected by discontinuous jumps in the system time

2015-10-14 Thread Jay Rolette

Back when this was first submitted in June, I mentioned that
CLOCK_MONOTONIC_RAW was ~10x slower than CLOCK_MONOTONIC:

http://dpdk.org/ml/archives/dev/2015-June/018687.html

It's not completely free from NTP frequency adjustments, but it won't have
any discontinuities.

That's what we've been using in our tree since then...

Jay


On Tue, Oct 13, 2015 at 7:33 PM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> On Fri,  5 Jun 2015 10:46:36 +0800
> Wen-Chi Yang  wrote:
>
> > Due to eal_alarm_callback() and rte_eal_alarm_set() use gettimeofday()
> > to get the current time, and gettimeofday() is affected by jumps.
> >
> > For example, set up a rte_alarm which will be triggerd next second (
> > current time + 1 second) by rte_eal_alarm_set(). And the callback
> > function of this rte_alarm sets up another rte_alarm which will be
> > triggered next second (current time + 2 second).
> > Once we change the system time when the callback function is triggered,
> > it is possiblb that rte alarm functionalities work out of expectation.
> >
> > Replace gettimeofday() with clock_gettime(CLOCK_MONOTONIC_RAW, )
> > could avoid this phenomenon.
> >
> > Signed-off-by: Wen-Chi Yang 
>
> Agreed, this should be applied.
> Does BSD version have same problem?
>
> Acked-by: Stephen Hemminger 
>
>

[dpdk-dev] [PATCH] ethdev: distinguish between drop and error stats

2015-10-08 Thread Jay Rolette

On Thu, Oct 8, 2015 at 9:37 AM, Tahhan, Maryam 
wrote:

> Hi Jay
>
> Yeah, I will do this to make the distinction clear for the counters I?m
> modifying in my patch. But please note that these aren?t counters that are
> defined in the NIC datasheets, these are the high level DPDK general
> counters for ethdevs. Their descriptions exist in the code and the
> generated DPDK docs.
>

Yeah, I know. The problem is that the generated docs don't make it clear
what the counters really are and you end up having to trace through the
code and, frequently, go look at NIC datasheets.

I just went through this a few months ago when I was integrating DPDK port
stats into our stat tracking system.

I appreciate you making the extra effort to make this clearer for app
developers :-)

Jay


> *From:* Jay Rolette [mailto:rolette at infiniteio.com]
> *Sent:* Friday, October 2, 2015 2:25 PM
> *To:* Tahhan, Maryam
> *Cc:* DPDK
> *Subject:* Re: [dpdk-dev] [PATCH] ethdev: distinguish between drop and
> error stats
>
>
>
> Can you improve the comments on these counters? If you didn't happen to
> follow this thread, there's no way to reasonably figure out what the
> difference is from looking at the code without chasing it all the way down
> and cross-referencing the NIC datasheet.
>
>
>
> Thanks,
>
> Jay
>
>
>
> On Fri, Oct 2, 2015 at 7:47 AM, Maryam Tahhan 
> wrote:
>
> Make a distniction between dropped packets and error statistics to allow
> a higher level fault management entity to interact with DPDK and take
> appropriate measures when errors are detected. It will also provide
> valuable information for any applications that collects/extracts DPDK
> stats, such applications include Open vSwitch.
> After this patch the distinction is:
> ierrors = Total number of packets dropped by hardware (malformed
> packets, ...) Where the # of drops can ONLY be <=  the packets received
> (without overlap between registers).
> Rx_pkt_errors = Total number of erroneous received packets. Where the #
> of errors can be >= the packets received (without overlap between
> registers), this is because there may be multiple errors associated with
> a packet.
>
> Signed-off-by: Maryam Tahhan 
> ---
>  lib/librte_ether/rte_ethdev.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> index 8a8c82b..53dd55d 100644
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -200,8 +200,9 @@ struct rte_eth_stats {
> /**< Deprecated; Total of RX packets with CRC error. */
> uint64_t ibadlen;
> /**< Deprecated; Total of RX packets with bad length. */
> -   uint64_t ierrors;   /**< Total number of erroneous received
> packets. */
> +   uint64_t ierrors;   /**< Total number of dropped received packets.
> */
> uint64_t oerrors;   /**< Total number of failed transmitted
> packets. */
> +   uint64_t ipkterrors;   /**< Total number of erroneous received
> packets. */
> uint64_t imcasts;
> /**< Deprecated; Total number of multicast received packets. */
> uint64_t rx_nombuf; /**< Total number of RX mbuf allocation
> failures. */
> --
> 2.4.3
>
>
>

[dpdk-dev] [PATCH] ethdev: distinguish between drop and error stats

2015-10-02 Thread Jay Rolette

Can you improve the comments on these counters? If you didn't happen to
follow this thread, there's no way to reasonably figure out what the
difference is from looking at the code without chasing it all the way down
and cross-referencing the NIC datasheet.

Thanks,
Jay

On Fri, Oct 2, 2015 at 7:47 AM, Maryam Tahhan 
wrote:

> Make a distniction between dropped packets and error statistics to allow
> a higher level fault management entity to interact with DPDK and take
> appropriate measures when errors are detected. It will also provide
> valuable information for any applications that collects/extracts DPDK
> stats, such applications include Open vSwitch.
> After this patch the distinction is:
> ierrors = Total number of packets dropped by hardware (malformed
> packets, ...) Where the # of drops can ONLY be <=  the packets received
> (without overlap between registers).
> Rx_pkt_errors = Total number of erroneous received packets. Where the #
> of errors can be >= the packets received (without overlap between
> registers), this is because there may be multiple errors associated with
> a packet.
>
> Signed-off-by: Maryam Tahhan 
> ---
>  lib/librte_ether/rte_ethdev.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> index 8a8c82b..53dd55d 100644
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -200,8 +200,9 @@ struct rte_eth_stats {
> /**< Deprecated; Total of RX packets with CRC error. */
> uint64_t ibadlen;
> /**< Deprecated; Total of RX packets with bad length. */
> -   uint64_t ierrors;   /**< Total number of erroneous received
> packets. */
> +   uint64_t ierrors;   /**< Total number of dropped received packets.
> */
> uint64_t oerrors;   /**< Total number of failed transmitted
> packets. */
> +   uint64_t ipkterrors;   /**< Total number of erroneous received
> packets. */
> uint64_t imcasts;
> /**< Deprecated; Total number of multicast received packets. */
> uint64_t rx_nombuf; /**< Total number of RX mbuf allocation
> failures. */
> --
> 2.4.3
>
>

[dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr() with qsort() from standard library

2015-09-10 Thread Jay Rolette

Thanks for the feedback, Sergio. Responses inline below, but unfortunately
I don't have time to submit a new patch right now. I'm at the tail-end of
our release cycle. Last year when I originally submitted the patch, it
would have been easy to update it and resubmit.

Jay

On Wed, Sep 9, 2015 at 5:35 AM, Gonzalez Monroy, Sergio <
sergio.gonzalez.monroy at intel.com> wrote:

> Following conversation in
> http://dpdk.org/ml/archives/dev/2015-September/023230.html :
>
> On 17/12/2014 13:31, rolette at infiniteio.com (Jay Rolette) wrote:
>
>> Signed-off-by: Jay Rolette 
>> ---
>>
> Update commit title/description, maybe something like:
>   eal/linux: use qsort for sorting hugepages
>   Replace O(n^2) sort in sort_by_physaddr() with qsort() from standard
> library


Ok. Pretty minor, but easy enough.


>   lib/librte_eal/linuxapp/eal/eal_memory.c | 59
>> +++-
>>   1 file changed, 20 insertions(+), 39 deletions(-)
>>
>> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> index bae2507..3656515 100644
>> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> @@ -670,6 +670,25 @@ error:
>> return -1;
>>   }
>>   +static int
>> +cmp_physaddr(const void *a, const void *b)
>> +{
>> +#ifndef RTE_ARCH_PPC_64
>> +   const struct hugepage_file *p1 = (const struct hugepage_file *)a;
>> +   const struct hugepage_file *p2 = (const struct hugepage_file *)b;
>> +#else
>> +   // PowerPC needs memory sorted in reverse order from x86
>> +   const struct hugepage_file *p1 = (const struct hugepage_file *)b;
>> +   const struct hugepage_file *p2 = (const struct hugepage_file *)a;
>> +#endif
>> +   if (p1->physaddr < p2->physaddr)
>> +   return -1;
>> +   else if (p1->physaddr > p2->physaddr)
>> +   return 1;
>> +   else
>> +   return 0;
>> +}
>> +
>>
> There were a couple of comments from Thomas about the comments style and
> #ifdef:
> - Comment style should be modified as per
> http://dpdk.org/doc/guides/contributing/coding_style.html#c-comment-style


Yep, although that came along well after I submitted the patch

- Regarding the #ifdef, I agree with Jay that it is the current approach in
> the file.
>
>   /*
>>* Sort the hugepg_tbl by physical address (lower addresses first on
>> x86,
>>* higher address first on powerpc). We use a slow algorithm, but we
>> won't
>> @@ -678,45 +697,7 @@ error:
>>   static int
>>   sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info
>> *hpi)
>>   {
>> -   unsigned i, j;
>> -   int compare_idx;
>> -   uint64_t compare_addr;
>> -   struct hugepage_file tmp;
>> -
>> -   for (i = 0; i < hpi->num_pages[0]; i++) {
>> -   compare_addr = 0;
>> -   compare_idx = -1;
>> -
>> -   /*
>> -* browse all entries starting at 'i', and find the
>> -* entry with the smallest addr
>> -*/
>> -   for (j=i; j< hpi->num_pages[0]; j++) {
>> -
>> -   if (compare_addr == 0 ||
>> -#ifdef RTE_ARCH_PPC_64
>> -   hugepg_tbl[j].physaddr > compare_addr) {
>> -#else
>> -   hugepg_tbl[j].physaddr < compare_addr) {
>> -#endif
>> -   compare_addr = hugepg_tbl[j].physaddr;
>> -   compare_idx = j;
>> -   }
>> -   }
>> -
>> -   /* should not happen */
>> -   if (compare_idx == -1) {
>> -   RTE_LOG(ERR, EAL, "%s(): error in physaddr
>> sorting\n", __func__);
>> -   return -1;
>> -   }
>> -
>> -   /* swap the 2 entries in the table */
>> -   memcpy(, _tbl[compare_idx],
>> -   sizeof(struct hugepage_file));
>> -   memcpy(_tbl[compare_idx], _tbl[i],
>> -   sizeof(struct hugepage_file));
>> -   memcpy(_tbl[i], , sizeof(struct
>> hugepage_file));
>> -   }
>> +   qsort(hugepg_tbl, hpi->num_pages[0], sizeof(struct
>> hugepage_file), cmp_physaddr);
>> return 0;
>>   }
>>
> Why not just remove sort_by_physaddr() and call qsort() directly?


That would be fine. I hadn't bothered to check whether sort_by_physaddr()
was called anywhere else, so left it there. It's not, so no real value to
keeping it in this case.


>
> Sergio
>

[dpdk-dev] [PATCH v1] change hugepage sorting to avoid overlapping memcpy

2015-09-08 Thread Jay Rolette

Most of the code in sort_by_physaddr() should be replaced by a call to
qsort() instead. Less code and gets rid of an O(n^2) sort. It's only init
code, but given how long EAL init takes, every bit helps.

I submitted a patch for this close to a year ago:
http://dpdk.org/dev/patchwork/patch/2061/

Jay

On Tue, Sep 8, 2015 at 7:45 AM, Gonzalez Monroy, Sergio <
sergio.gonzalez.monroy at intel.com> wrote:

> Hi Ralf,
>
> Just a few comments/suggestions:
>
> Add 'eal/linux:'  to the commit title, ie:
>   "eal/linux: change hugepage sorting to avoid overlapping memcpy"
>
> On 04/09/2015 11:14, Ralf Hoffmann wrote:
>
>> with only one hugepage or already sorted hugepage addresses, the sort
>> function called memcpy with same src and dst pointer. Debugging with
>> valgrind will issue a warning about overlapping area. This patch changes
>> the bubble sort to avoid this behavior. Also, the function cannot fail
>> any longer.
>>
>> Signed-off-by: Ralf Hoffmann 
>> ---
>>   lib/librte_eal/linuxapp/eal/eal_memory.c | 27
>> +--
>>   1 file changed, 13 insertions(+), 14 deletions(-)
>>
>> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> index ac2745e..6d01f61 100644
>> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> @@ -699,25 +699,25 @@ error:
>>* higher address first on powerpc). We use a slow algorithm, but we
>> won't
>>* have millions of pages, and this is only done at init time.
>>*/
>> -static int
>> +static void
>>   sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info
>> *hpi)
>>   {
>> unsigned i, j;
>> -   int compare_idx;
>> +   unsigned compare_idx;
>> uint64_t compare_addr;
>> struct hugepage_file tmp;
>> for (i = 0; i < hpi->num_pages[0]; i++) {
>> -   compare_addr = 0;
>> -   compare_idx = -1;
>> +   compare_addr = hugepg_tbl[i].physaddr;
>> +   compare_idx = i;
>> /*
>> -* browse all entries starting at 'i', and find the
>> +* browse all entries starting at 'i+1', and find the
>>  * entry with the smallest addr
>>  */
>> -   for (j=i; j< hpi->num_pages[0]; j++) {
>> +   for (j=i + 1; j < hpi->num_pages[0]; j++) {
>>
> Although there are many style/checkpatch issues in current code, we try to
> fix them
> in new patches.
> In that regard, checkpatch complains about above line with:
> ERROR:SPACING: spaces required around that '='
>
>   - if (compare_addr == 0 ||
>> +   if (
>>   #ifdef RTE_ARCH_PPC_64
>> hugepg_tbl[j].physaddr > compare_addr) {
>>   #else
>> @@ -728,10 +728,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl,
>> struct hugepage_info *hpi)
>> }
>> }
>>   - /* should not happen */
>> -   if (compare_idx == -1) {
>> -   RTE_LOG(ERR, EAL, "%s(): error in physaddr
>> sorting\n", __func__);
>> -   return -1;
>> +   if (compare_idx == i) {
>> +   /* no smaller page found */
>> +   continue;
>> }
>> /* swap the 2 entries in the table */
>> @@ -741,7 +740,8 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl,
>> struct hugepage_info *hpi)
>> sizeof(struct hugepage_file));
>> memcpy(_tbl[i], , sizeof(struct
>> hugepage_file));
>> }
>> -   return 0;
>> +
>> +   return;
>>   }
>>
> I reckon checkpatch is not picking this one because the end-of-function is
> not part of the patch,
> but it is a warning:
> WARNING:RETURN_VOID: void function return statements are not generally
> useful
>
> /*
>> @@ -1164,8 +1164,7 @@ rte_eal_hugepage_init(void)
>> goto fail;
>> }
>>   - if (sort_by_physaddr(_hp[hp_offset], hpi) < 0)
>> -   goto fail;
>> +   sort_by_physaddr(_hp[hp_offset], hpi);
>> #ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
>> /* remap all hugepages into single file segments */
>>
>>
>>
> Thanks,
> Sergio
>

[dpdk-dev] rte_eal_init() alternative?

2015-09-02 Thread Jay Rolette

On Wed, Sep 2, 2015 at 7:56 AM, Bruce Richardson  wrote:

> On Wed, Sep 02, 2015 at 12:49:40PM +, Montorsi, Francesco wrote:
> > Hi all,
> >
> > Currently it seems that the only way to initialize EAL is using
> rte_eal_init() function, correct?
> >
> > I have the problem that rte_eal_init() will call rte_panic() whenever
> something fails to initialize or in other cases it will call exit().
> > In my application, I would rather like to attempt DPDK initialization.
> If it fails I don't want to exit.
> > Unfortunately I cannot even copy the rte_eal_init() code into my
> application (removing rte_panic and exit calls) since it uses a lot of DPDK
> internal private functions.
> >
> > I think that my requirements (avoid abort/exit calls when init fails) is
> a basic requirement... would you accept a patch that adds an alternative
> rte_eal_init() function that just returns an error code upon failure,
> instead of immediately exiting?
> >
> > Thanks for your hard work!
> >
> > Francesco Montorsi
> >
> I, for one, would welcome such a patch. I think the code is overly quick in
> many places to panic or exit the app, when an error code would be more
> appropriate.
> Feel free to also look at other libraries in DPDK too, if you like :-)
>
> Regards,
> /Bruce
>

+1

[dpdk-dev] KNI example with pcap file

2015-07-28 Thread Jay Rolette

Maybe I just haven't had enough caffeine this morning yet, but why would
you add IP reassembly to KNI? If packets are going through KNI, they
already have IP reassembly via the normal Linux networking stack...

Jay

On Mon, Jul 27, 2015 at 8:55 PM, EaseTheWorld Mr. 
wrote:

> Hello. I'm a DPDK newbie, playing with KNI.
>
> I have a packet analysis application like tshark,
> which receives packets from network interface or pcap file.
>
> My plan with KNI is
> 1) Performance improvement without modifying my program
> 2) Add ip_reassembly, acl to kni example.
>(rte_eth_rx_burst -> ip_reassembly, acl -> rte_kni_tx_burst)
>
> If input is network interface, I think the plan is fine.
> but if input is pcap file, is it possible?
> Should I use vdev or something?
>

[dpdk-dev] [PATCH 1/3] kni: minor opto

2015-06-15 Thread Jay Rolette

On Sun, Jun 14, 2015 at 9:07 PM, Zhang, Helin  wrote:

> Would it be better to modify the similar thing in kni_ioctl_create()?
>

That one doesn't need to use the "safe" version of list_for_each_entry()
either, but it isn't in the packet processing path so the minor performance
improvement doesn't really matter.


>
> - Helin
>
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jay Rolette
> > Sent: Thursday, June 4, 2015 3:19 AM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] [PATCH 1/3] kni: minor opto
> >
> > Don't need the 'safe' version of list_for_each_entry() if you aren't
> deleting from
> > the list as you iterate over it
> >
> > Signed-off-by: Jay Rolette 
> > ---
> >  lib/librte_eal/linuxapp/kni/kni_misc.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/librte_eal/linuxapp/kni/kni_misc.c
> > b/lib/librte_eal/linuxapp/kni/kni_misc.c
> > index 1935d32..312f196 100644
> > --- a/lib/librte_eal/linuxapp/kni/kni_misc.c
> > +++ b/lib/librte_eal/linuxapp/kni/kni_misc.c
> > @@ -213,13 +213,12 @@ static int
> >  kni_thread_single(void *unused)
> >  {
> >   int j;
> > - struct kni_dev *dev, *n;
> > + struct kni_dev *dev;
> >
> >   while (!kthread_should_stop()) {
> >   down_read(_list_lock);
> >   for (j = 0; j < KNI_RX_LOOP_NUM; j++) {
> > - list_for_each_entry_safe(dev, n,
> > - _list_head, list) {
> > + list_for_each_entry(dev, _list_head, list) {
> >  #ifdef RTE_KNI_VHOST
> >   kni_chk_vhost_rx(dev);
> >  #else
> > --
> > 2.3.2 (Apple Git-55)
>
>

[dpdk-dev] More KNI performance

2015-06-05 Thread Jay Rolette

Follow-up to my long email about KNI performance so we don't have
chapter-length quotes in any discussion...

Increasing HZ to 1000 helped, but I'd really like a way to wake the KNI
kernel thread up on demand. I'm hoping someone with more Linux kernel
experience than I have might have some ideas.

Is there some mechanism available that the KNI kernel thread could sleep
periodically, but somehow be awoken from user space?

Basically looking for something to simulate the HW interrupt that network
drivers normally rely on. When KNI packet rates are relatively low, that
1-4 ms latency from the KNI kernel thread sleeping becomes painfully
obvious. If my DPDK app knows that KNI traffic is light, it would be nice
to "kick" the KNI kernel thread to get to work.

So far it looks like adding a syscall is about the only path to let a
user-space app kick a kernel thread. I haven't gotten as far as figuring
out the details to make sure this all works in practice.

Any suggestions or guidance here would be welcome.

Thanks,
Jay

[dpdk-dev] KNI performance

2015-06-05 Thread Jay Rolette

On Fri, Jun 5, 2015 at 10:13 AM, Marc Sune  wrote:

>
>
> On 05/06/15 17:06, Jay Rolette wrote:
>
>> The past few days I've been trying to chase down why operations over KNI
>> are so bloody slow. To give you an idea how bad it is, we did a simple
>> test
>> over an NFS mount:
>>
>> # Mount over a non-KNI interface (eth0 on vanilla Ubuntu 14.04 LTS)
>> $ time $(ls -last -R /mnt/sfs2008 > /dev/null)
>> real11m58.224s
>> user0m10.758s
>> sys 0m25.050s
>>
>> # Reboot to make sure NFS cache is cleared and mount over a KNI interface
>> $ time $(ls -last -R /mnt/sfs2008 > /dev/null)
>> real87m36.295s
>> user0m14.552s
>> sys 0m25.949s
>>
>> Packet captures showed a pretty consistent ~4ms delay. Get a READDIRPLUS
>> reply from NFS server and the TCP stack on the DPDK/KNI system took about
>> 4ms to ACK the reply. It isn't just on ACK packets either. If there was no
>> ACK required, there would be a 4ms delay before the next call was sent
>> (ACCESS, LOOKUP, another READDIR, etc.).
>>
>> This is running on top of a real DPDK app, so there are various queues and
>> ring-buffers in the path between KNI and the wire, so I started there.
>> Long
>> story short, worst case, those could only inject ~120us of latency into
>> the
>> path.
>>
>> Next stop was KNI itself. Ignoring a few minor optos I found, nothing in
>> the code looked like it could account for 4ms of latency. That wasn't
>> quite
>> right though...
>>
>> Here's the code for the processing loop in kni_thread_single():
>>
>>  while (!kthread_should_stop()) {
>>  down_read(_list_lock);
>>  for (j = 0; j < KNI_RX_LOOP_NUM; j++) {
>>  list_for_each_entry(dev, _list_head, list) {
>> #ifdef RTE_KNI_VHOST
>>  kni_chk_vhost_rx(dev);
>> #else
>>  kni_net_rx(dev);
>> #endif
>>  kni_net_poll_resp(dev);
>>  }
>>  }
>>  up_read(_list_lock);
>>  /* reschedule out for a while */
>>  schedule_timeout_interruptible(usecs_to_jiffies( \
>>  KNI_KTHREAD_RESCHEDULE_INTERVAL));
>>
>> Turns out the 4ms delay is due to the schedule_timeout() call. The code
>> specifies a 5us sleep, but the call only guarantees a sleep of *at least*
>> the time specified.
>>
>> The resolution of the sleep is controlled by the timer interrupt rate. If
>> you are using a kernel from one of the usual Linux distros, HZ = 250 on
>> x86. That works out nicely to a 4ms period. The KNI kernel thread was
>> going
>> to sleep and frequently not getting woken up for nearly 4ms.
>>
>> We rebuilt the kernel with HZ = 1000 and things improved considerably:
>>
>> # Mount over a KNI interface, HZ=1000
>> $ time $(ls -last -R /mnt/sfs2008 > /dev/null)
>>
>> real21m8.478s
>> user0m13.824s
>> sys 0m18.113s
>>
>> Still not where I'd like to get it, but much, much better.
>>
>> Hopefully my pain is your gain and this helps other KNI users.
>>
>

> Jay,
>
> If you set CONFIG_RTE_KNI_PREEMPT_DEFAULT to 'n' you should see a reduced
> latency and delay since there is no preemption (though sacrifices 1 CPU for
> the kni kmod):
>
> http://patchwork.dpdk.org/dev/patchwork/patch/3304/
>
> However, KNI is still pretty slow. Even considering that there will always
> be at least 1 copy involved, I still think is too slow. I didn't had time
> to look closer yet.
>
> Marc
>

Hi Marc,

Thanks for the pointer to the patch. I did something similar as a test
before we started mucking with rebuilding the kernel. Skipping the call to
put the KNI kernel thread to sleep improved performance and reduced
latency, but oddly enough, it wasn't as fast for the end-app as the HZ=1000
change.

Here's what I got on that test:

# Mount over "no-sleep" KNI
$ time $(ls -last -R /mnt/sfs2008 > /dev/null)

real37m49.004s
user0m23.274s
sys 0m9.010s

Jay

[dpdk-dev] KNI performance

2015-06-05 Thread Jay Rolette

The past few days I've been trying to chase down why operations over KNI
are so bloody slow. To give you an idea how bad it is, we did a simple test
over an NFS mount:

# Mount over a non-KNI interface (eth0 on vanilla Ubuntu 14.04 LTS)
$ time $(ls -last -R /mnt/sfs2008 > /dev/null)
real11m58.224s
user0m10.758s
sys 0m25.050s

# Reboot to make sure NFS cache is cleared and mount over a KNI interface
$ time $(ls -last -R /mnt/sfs2008 > /dev/null)
real87m36.295s
user0m14.552s
sys 0m25.949s

Packet captures showed a pretty consistent ~4ms delay. Get a READDIRPLUS
reply from NFS server and the TCP stack on the DPDK/KNI system took about
4ms to ACK the reply. It isn't just on ACK packets either. If there was no
ACK required, there would be a 4ms delay before the next call was sent
(ACCESS, LOOKUP, another READDIR, etc.).

This is running on top of a real DPDK app, so there are various queues and
ring-buffers in the path between KNI and the wire, so I started there. Long
story short, worst case, those could only inject ~120us of latency into the
path.

Next stop was KNI itself. Ignoring a few minor optos I found, nothing in
the code looked like it could account for 4ms of latency. That wasn't quite
right though...

Here's the code for the processing loop in kni_thread_single():

while (!kthread_should_stop()) {
down_read(_list_lock);
for (j = 0; j < KNI_RX_LOOP_NUM; j++) {
list_for_each_entry(dev, _list_head, list) {
#ifdef RTE_KNI_VHOST
kni_chk_vhost_rx(dev);
#else
kni_net_rx(dev);
#endif
kni_net_poll_resp(dev);
}
}
up_read(_list_lock);
/* reschedule out for a while */
schedule_timeout_interruptible(usecs_to_jiffies( \
KNI_KTHREAD_RESCHEDULE_INTERVAL));

Turns out the 4ms delay is due to the schedule_timeout() call. The code
specifies a 5us sleep, but the call only guarantees a sleep of *at least*
the time specified.

The resolution of the sleep is controlled by the timer interrupt rate. If
you are using a kernel from one of the usual Linux distros, HZ = 250 on
x86. That works out nicely to a 4ms period. The KNI kernel thread was going
to sleep and frequently not getting woken up for nearly 4ms.

We rebuilt the kernel with HZ = 1000 and things improved considerably:

# Mount over a KNI interface, HZ=1000
$ time $(ls -last -R /mnt/sfs2008 > /dev/null)

real21m8.478s
user0m13.824s
sys 0m18.113s

Still not where I'd like to get it, but much, much better.

Hopefully my pain is your gain and this helps other KNI users.

Jay

[dpdk-dev] [PATCH 3/3] kni: rx loop was using the wrong counter

2015-06-03 Thread Jay Rolette

Loop processing packets dequeued from rx_q was using the number of
packets requested, not how many it actually received.

Variable rename to make code a little more clear

Signed-off-by: Jay Rolette 
---
 lib/librte_eal/linuxapp/kni/kni_net.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/librte_eal/linuxapp/kni/kni_net.c 
b/lib/librte_eal/linuxapp/kni/kni_net.c
index 13ccbb8..d37a6b9 100644
--- a/lib/librte_eal/linuxapp/kni/kni_net.c
+++ b/lib/librte_eal/linuxapp/kni/kni_net.c
@@ -131,7 +131,7 @@ kni_net_rx_normal(struct kni_dev *kni)
 {
unsigned ret;
uint32_t len;
-   unsigned i, num, num_fq;
+   unsigned i, num_rx, num_fq;
struct rte_kni_mbuf *kva;
struct rte_kni_mbuf *va[MBUF_BURST_SZ];
void * data_kva;
@@ -146,15 +146,15 @@ kni_net_rx_normal(struct kni_dev *kni)
}

/* Calculate the number of entries to dequeue from rx_q */
-   num = min(num_fq, (unsigned)MBUF_BURST_SZ);
+   num_rx = min(num_fq, (unsigned)MBUF_BURST_SZ);

/* Burst dequeue from rx_q */
-   ret = kni_fifo_get(kni->rx_q, (void **)va, num);
-   if (ret == 0)
+   num_rx = kni_fifo_get(kni->rx_q, (void **)va, num_rx);
+   if (num_rx == 0)
return;

/* Transfer received packets to netif */
-   for (i = 0; i < num; i++) {
+   for (i = 0; i < num_rx; i++) {
kva = (void *)va[i] - kni->mbuf_va + kni->mbuf_kva;
len = kva->data_len;
data_kva = kva->buf_addr + kva->data_off - kni->mbuf_va
@@ -184,8 +184,8 @@ kni_net_rx_normal(struct kni_dev *kni)
}

/* Burst enqueue mbufs into free_q */
-   ret = kni_fifo_put(kni->free_q, (void **)va, num);
-   if (ret != num)
+   ret = kni_fifo_put(kni->free_q, (void **)va, num_rx);
+   if (ret != num_rx)
/* Failing should not happen */
KNI_ERR("Fail to enqueue entries into free_q\n");
 }
-- 
2.3.2 (Apple Git-55)

[dpdk-dev] [PATCH 2/3] kni: minor opto

2015-06-03 Thread Jay Rolette

No reason to check out many entries are in kni->rx_q prior to
actually pulling them from the fifo. You can't dequeue more than
are there anyway. Max entries to dequeue is either the max batch
size or however much space is available on kni->free_q (lesser of the two)

Signed-off-by: Jay Rolette 
---
 lib/librte_eal/linuxapp/kni/kni_net.c | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/lib/librte_eal/linuxapp/kni/kni_net.c 
b/lib/librte_eal/linuxapp/kni/kni_net.c
index dd95db5..13ccbb8 100644
--- a/lib/librte_eal/linuxapp/kni/kni_net.c
+++ b/lib/librte_eal/linuxapp/kni/kni_net.c
@@ -131,7 +131,7 @@ kni_net_rx_normal(struct kni_dev *kni)
 {
unsigned ret;
uint32_t len;
-   unsigned i, num, num_rq, num_fq;
+   unsigned i, num, num_fq;
struct rte_kni_mbuf *kva;
struct rte_kni_mbuf *va[MBUF_BURST_SZ];
void * data_kva;
@@ -139,24 +139,19 @@ kni_net_rx_normal(struct kni_dev *kni)
struct sk_buff *skb;
struct net_device *dev = kni->net_dev;

-   /* Get the number of entries in rx_q */
-   num_rq = kni_fifo_count(kni->rx_q);
-
/* Get the number of free entries in free_q */
-   num_fq = kni_fifo_free_count(kni->free_q);
-
-   /* Calculate the number of entries to dequeue in rx_q */
-   num = min(num_rq, num_fq);
-   num = min(num, (unsigned)MBUF_BURST_SZ);
-
-   /* Return if no entry in rx_q and no free entry in free_q */
-   if (num == 0)
+   if ((num_fq = kni_fifo_free_count(kni->free_q)) == 0) {
+   /* No room on the free_q, bail out */
return;
+   }
+
+   /* Calculate the number of entries to dequeue from rx_q */
+   num = min(num_fq, (unsigned)MBUF_BURST_SZ);

/* Burst dequeue from rx_q */
ret = kni_fifo_get(kni->rx_q, (void **)va, num);
if (ret == 0)
-   return; /* Failing should not happen */
+   return;

/* Transfer received packets to netif */
for (i = 0; i < num; i++) {
-- 
2.3.2 (Apple Git-55)

[dpdk-dev] [PATCH 1/3] kni: minor opto

2015-06-03 Thread Jay Rolette

Don't need the 'safe' version of list_for_each_entry() if you aren't deleting 
from the list as you iterate over it

Signed-off-by: Jay Rolette 
---
 lib/librte_eal/linuxapp/kni/kni_misc.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/linuxapp/kni/kni_misc.c 
b/lib/librte_eal/linuxapp/kni/kni_misc.c
index 1935d32..312f196 100644
--- a/lib/librte_eal/linuxapp/kni/kni_misc.c
+++ b/lib/librte_eal/linuxapp/kni/kni_misc.c
@@ -213,13 +213,12 @@ static int
 kni_thread_single(void *unused)
 {
int j;
-   struct kni_dev *dev, *n;
+   struct kni_dev *dev;

while (!kthread_should_stop()) {
down_read(_list_lock);
for (j = 0; j < KNI_RX_LOOP_NUM; j++) {
-   list_for_each_entry_safe(dev, n,
-   _list_head, list) {
+   list_for_each_entry(dev, _list_head, list) {
 #ifdef RTE_KNI_VHOST
kni_chk_vhost_rx(dev);
 #else
-- 
2.3.2 (Apple Git-55)

[dpdk-dev] [PATCH 3/3] kni: rx loop was using the wrong counter

2015-06-03 Thread Jay Rolette

Loop processing packets dequeued from rx_q was using the number of
packets requested, not how many it actually received.

Variable rename to make code a little more clear

Signed-off-by: Jay Rolette 
---
 lib/librte_eal/linuxapp/kni/kni_net.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/librte_eal/linuxapp/kni/kni_net.c 
b/lib/librte_eal/linuxapp/kni/kni_net.c
index 13ccbb8..d37a6b9 100644
--- a/lib/librte_eal/linuxapp/kni/kni_net.c
+++ b/lib/librte_eal/linuxapp/kni/kni_net.c
@@ -131,7 +131,7 @@ kni_net_rx_normal(struct kni_dev *kni)
 {
unsigned ret;
uint32_t len;
-   unsigned i, num, num_fq;
+   unsigned i, num_rx, num_fq;
struct rte_kni_mbuf *kva;
struct rte_kni_mbuf *va[MBUF_BURST_SZ];
void * data_kva;
@@ -146,15 +146,15 @@ kni_net_rx_normal(struct kni_dev *kni)
}

/* Calculate the number of entries to dequeue from rx_q */
-   num = min(num_fq, (unsigned)MBUF_BURST_SZ);
+   num_rx = min(num_fq, (unsigned)MBUF_BURST_SZ);

/* Burst dequeue from rx_q */
-   ret = kni_fifo_get(kni->rx_q, (void **)va, num);
-   if (ret == 0)
+   num_rx = kni_fifo_get(kni->rx_q, (void **)va, num_rx);
+   if (num_rx == 0)
return;

/* Transfer received packets to netif */
-   for (i = 0; i < num; i++) {
+   for (i = 0; i < num_rx; i++) {
kva = (void *)va[i] - kni->mbuf_va + kni->mbuf_kva;
len = kva->data_len;
data_kva = kva->buf_addr + kva->data_off - kni->mbuf_va
@@ -184,8 +184,8 @@ kni_net_rx_normal(struct kni_dev *kni)
}

/* Burst enqueue mbufs into free_q */
-   ret = kni_fifo_put(kni->free_q, (void **)va, num);
-   if (ret != num)
+   ret = kni_fifo_put(kni->free_q, (void **)va, num_rx);
+   if (ret != num_rx)
/* Failing should not happen */
KNI_ERR("Fail to enqueue entries into free_q\n");
 }
-- 
2.3.2 (Apple Git-55)

[dpdk-dev] [PATCH 2/3] kni: minor opto

2015-06-03 Thread Jay Rolette

No reason to check out many entries are in kni->rx_q prior to
actually pulling them from the fifo. You can't dequeue more than
are there anyway. Max entries to dequeue is either the max batch
size or however much space is available on kni->free_q (lesser of the two)

Signed-off-by: Jay Rolette 
---
 lib/librte_eal/linuxapp/kni/kni_net.c | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/lib/librte_eal/linuxapp/kni/kni_net.c 
b/lib/librte_eal/linuxapp/kni/kni_net.c
index dd95db5..13ccbb8 100644
--- a/lib/librte_eal/linuxapp/kni/kni_net.c
+++ b/lib/librte_eal/linuxapp/kni/kni_net.c
@@ -131,7 +131,7 @@ kni_net_rx_normal(struct kni_dev *kni)
 {
unsigned ret;
uint32_t len;
-   unsigned i, num, num_rq, num_fq;
+   unsigned i, num, num_fq;
struct rte_kni_mbuf *kva;
struct rte_kni_mbuf *va[MBUF_BURST_SZ];
void * data_kva;
@@ -139,24 +139,19 @@ kni_net_rx_normal(struct kni_dev *kni)
struct sk_buff *skb;
struct net_device *dev = kni->net_dev;

-   /* Get the number of entries in rx_q */
-   num_rq = kni_fifo_count(kni->rx_q);
-
/* Get the number of free entries in free_q */
-   num_fq = kni_fifo_free_count(kni->free_q);
-
-   /* Calculate the number of entries to dequeue in rx_q */
-   num = min(num_rq, num_fq);
-   num = min(num, (unsigned)MBUF_BURST_SZ);
-
-   /* Return if no entry in rx_q and no free entry in free_q */
-   if (num == 0)
+   if ((num_fq = kni_fifo_free_count(kni->free_q)) == 0) {
+   /* No room on the free_q, bail out */
return;
+   }
+
+   /* Calculate the number of entries to dequeue from rx_q */
+   num = min(num_fq, (unsigned)MBUF_BURST_SZ);

/* Burst dequeue from rx_q */
ret = kni_fifo_get(kni->rx_q, (void **)va, num);
if (ret == 0)
-   return; /* Failing should not happen */
+   return;

/* Transfer received packets to netif */
for (i = 0; i < num; i++) {
-- 
2.3.2 (Apple Git-55)

[dpdk-dev] [PATCH 1/3] kni: minor opto

2015-06-03 Thread Jay Rolette

Don't need the 'safe' version of list_for_each_entry() if you aren't deleting 
from the list as you iterate over it

Signed-off-by: Jay Rolette 
---
 lib/librte_eal/linuxapp/kni/kni_misc.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/linuxapp/kni/kni_misc.c 
b/lib/librte_eal/linuxapp/kni/kni_misc.c
index 1935d32..312f196 100644
--- a/lib/librte_eal/linuxapp/kni/kni_misc.c
+++ b/lib/librte_eal/linuxapp/kni/kni_misc.c
@@ -213,13 +213,12 @@ static int
 kni_thread_single(void *unused)
 {
int j;
-   struct kni_dev *dev, *n;
+   struct kni_dev *dev;

while (!kthread_should_stop()) {
down_read(_list_lock);
for (j = 0; j < KNI_RX_LOOP_NUM; j++) {
-   list_for_each_entry_safe(dev, n,
-   _list_head, list) {
+   list_for_each_entry(dev, _list_head, list) {
 #ifdef RTE_KNI_VHOST
kni_chk_vhost_rx(dev);
 #else
-- 
2.3.2 (Apple Git-55)

[dpdk-dev] rte_eal_alarm_set() is affected by discontinuous jumps in the system time

2015-06-03 Thread Jay Rolette

On Wed, Jun 3, 2015 at 7:54 AM, Bruce Richardson  wrote:

> On Wed, Jun 03, 2015 at 02:31:47PM +0800, Selmon Yang wrote:
> > Hi,
> >
> > I found that, in dpdk 2.0, rte_eal_alarm_set() is affected by
> > discontinuous jumps in the system time because eal_alarm_callback()
> > and rte_eal_alarm_set() use gettimeofday() to get the current time.
> >
> > Here is what I encountered.
> > I set up a rte eal alarm as below, and I like it to be triggered every
> second.
> > #define USE_PER_S 1000 * 1000
> > void my_alarm_cb(void *arg)
> > {
> > /* send heartbeat signal out, etc. */
> >
> > rte_eal_alarm_set(1 * US_PER_S, my_alarm_cb, NULL);
> > return;
> > }
> >
> > int main(void)
> > {
> > /* ..., do something */
> > rte_eal_alarm_set(1 * US_PER_S, my_alarm_cb, NULL);
> > /* ... do something else */
> > }
> >
> > It works fine in most of time.
> > However, if I change system time manually, it is possible that rte alarm
> > function works out of my expectation.
> > Suppose that current time is 11:00:00 AM, and eal_alarm_callback()
> > is triggered because I executed
> > rte_eal_alarm_set(1 * US_PER_S, my_alarm_cb, NULL) at 10:59:59 AM.
> > eal_alarm_callback() gets the current time (11:00:00 AM)
> > and calls my_alarm_cb() as below.
> > while ((ap = LIST_FIRST(_list)) !=NULL &&
> >   gettimeofday(, NULL) == 0 &&
> >   (ap->time.tv_sec < now.tv_sec ||
> > (ap->time.tv_sec == now.tv_sec &&
> >   ap->time.tv_usec <=
> > now.tv_usec))){
> > ap->executing = 1;
> > ap->executing_id = pthread_self();
> > rte_spinlock_unlock(_list_lk);
> >
> > ap->cb_fn(ap->cb_arg);
> >
> > rte_spinlock_lock(_list_lk);
> >
> > LIST_REMOVE(ap, next);
> > rte_free(ap);
> > }
> >
> > In my_alarm_cb(), rte_eal_alarm_set() is called again.
> > rte_eall_alarm_set() gets the current time (11:00:00 AM), plus 1 second,
> > and adds the new alarm entry to alarm_list.
> > /* use current time to calculate absolute time of alarm */
> > gettimeofday(, NULL);
> >
> > new_alarm->cb_fn = cb_fn;
> > new_alarm->cb_arg = cb_arg;
> > new_alarm->time.tv_usec = (now.tv_usec + us) % US_PER_S;
> > new_alarm->time.tv_sec = now.tv_sec + ((now.tv_usec + us) /
> US_PER_S);
> >
> > rte_spinlock_lock(_list_lk);
> > if (!handler_registered) {
> > ret |= rte_intr_callback_register(_handle,
> > eal_alarm_callback, NULL);
> > handler_registered = (ret == 0) ? 1 : 0;
> > }
> >
> > if (LIST_EMPTY(_list))
> > LIST_INSERT_HEAD(_list, new_alarm, next);
> > else {
> > LIST_FOREACH(ap, _list, next) {
> > if (ap->time.tv_sec > new_alarm->time.tv_sec ||
> > (ap->time.tv_sec ==
> > new_alarm->time.tv_sec &&
> >
> > ap->time.tv_usec > new_alarm->time.tv_usec)){
> > LIST_INSERT_BEFORE(ap, new_alarm, next);
> > break;
> > }
> > if (LIST_NEXT(ap, next) == NULL) {
> > LIST_INSERT_AFTER(ap, new_alarm, next);
> > break;
> > }
> > }
> > }
> >
> > After the new alarm entry is added to alarm_list, if current time is
> > set to 8:00:00 AM manually, the current time in eal_alarm_callback()
> > will be updated to 8:00:00 AM, too.
> > Then the new alarm entry will be triggered after 3 hours and 1 second.
> >
> > I think rte alarm should not be affected by discontinuous jumps in
> > the system time.
> > I tried to replace gettimeofday() with
> clock_gettime(CLOCK_MONOTONIC_RAW, ),
> > and it looks work fine.
> > What do you think about this modification?
> > Will you consider to modify rte_alarm functions to be not affected
> > by discontinuous jumps in the system time?
>
> I agree with you that the alarm functionality should not be affected by
> jumps
> in system time. If you have a patch that fixes this bug, it would be great
> if
> you could upstream it here.
>
> Thanks,
> /Bruce
>

I haven't looked through the RTE alarm code, but one thing to consider is
whether you want to use CLOCK_MONOTONIC_RAW or just CLOCK_MONOTONIC. The
RAW version is ~10x slower than the CLOCK_MONOTONIC.

CLOCK_MONOTONIC isn't completely protected from NTP frequency adjustments,
but it won't have discontinuities.

We've found the rte_eal_alarm calls to be surprisingly (ie., we hadn't
bothered to look at the implementation yet) variable and intermittently
slow, so the 10x difference on the call to

[dpdk-dev] add support for HTM lock elision for x86

2015-06-02 Thread Jay Rolette

On Tue, Jun 2, 2015 at 8:11 AM, Roman Dementiev 
wrote:

>
> This series of patches adds methods that use hardware memory transactions
> (HTM)
> on fast-path for DPDK locks (a.k.a. lock elision). Here the methods are
> implemented
> for x86 using Restricted Transactional Memory instructions (Intel(r)
> Transactional
> Synchronization Extensions). The implementation fall-backs to the normal
> DPDK lock
> if HTM is not available or memory transactions fail.
>

This is very interesting. Do you have any summary you can give us of what
the performance implications are from your test data?


> This is not a replacement for lock usages since not all critical sections
> protected
> by locks are friendly to HTM.
>

Any pointers to material that describe where HTM in its current incarnation
on x86 is approprate?


>
> Roman Dementiev (3):
>  spinlock: add support for HTM lock elision for x86
>  rwlock: add support for HTM lock elision for x86
>  test scaling of HTM lock elision protecting rte_hash
>
>  app/test/Makefile  |   1 +
>  app/test/test_hash_scaling.c   | 223
> +
>  lib/librte_eal/common/Makefile |   4 +-
>  .../common/include/arch/ppc_64/rte_rwlock.h|  38 
>  .../common/include/arch/ppc_64/rte_spinlock.h  |  41 
>  lib/librte_eal/common/include/arch/x86/rte_rtm.h   |  73 +++
>  .../common/include/arch/x86/rte_rwlock.h   |  82 
>  .../common/include/arch/x86/rte_spinlock.h | 107 ++
>  lib/librte_eal/common/include/generic/rte_rwlock.h | 194
> ++
>  .../common/include/generic/rte_spinlock.h  |  75 +++
>  lib/librte_eal/common/include/rte_rwlock.h | 158 ---
>  11 files changed, 836 insertions(+), 160 deletions(-)
>

Thanks!
Jay

[dpdk-dev] [PATCH v2] pipeline: add statistics for librte_pipeline ports and tables

2015-05-21 Thread Jay Rolette

On Thu, May 21, 2015 at 8:33 AM, Dumitrescu, Cristian <
cristian.dumitrescu at intel.com> wrote:

> > > The problem I see with this approach is that you cannot turn off debug
> > > messages while still having the statistics collected.  We should allow
> > > people to collects the stats without getting the debug messages. How do
> > > we do this?
> >
> > By setting build-time log level to debug, you would enable stats and
> debug
> > messages. Then you can adjust the log messages level with the run-time
> > option --log-level.
>
> This is a really clever trick! I have to say it took me some time to make
> sure I got it right :)
> So when RTE_LOG_LEVEL (build time option) is set to DEBUG (i.e. 8), then
> we get both the stats and the debug messages, but then we can adjust the
> level of debug messages at run-time through the --log-level EAL option.
> I can work with this option, thank you!
>
> There is one more simple proposal that came to my mind late last night:
> how about single config file option RTE_COLLECT_STATS=y/n that should be
> used by all the libs across the board to indicate whether stats should be
> enabled or not?
> This is logically very similar to the solution above, as they both look at
> a single option in the config file, but maybe it is also more intuitive for
> people.
>
> I will go with your choice. Which one do you pick?
>

As Stephen said previously, any real DPDK app needs stats. You can't
support network products operationally without them. What's the point of
making them optional other than trying to juice benchmarks?

Jay

[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

2015-05-13 Thread Jay Rolette

On Tue, May 12, 2015 at 8:16 PM, Ravi Kerur  wrote:

> On Mon, May 11, 2015 at 3:29 PM, Don Provan  wrote:
>
> > I probably shouldn't stick my nose into this, but I can't help myself.
> >
> > An experienced programmer will tend to ignore the documentation for
> > a routine named "blahblah_memcmp" and just assume it functions like
> > memcmp. Whether or not there's currently a use case in DPDK is
> > completely irrelevant because as soon as there *is* a use case, some
> > poor DPDK developer will try to use rte_memcmp for that and may or
> > may not have a test case that reveals their mistake.
> >
>
> In general I agree with you. However, comparison is a hit(equal) or
> miss(unequal) is generally the case in networking. I haven't seen cases
> where "less than" or "greater than" has mattered.
>

It's useful when you need to make sure packets from both sides of a
conversation go to the same processing queue/thread. Instead of hashing the
5-tuple from the packet as src.ip, dst.ip, src.dport, dst.dport, etc., you
can use lesser.ip, higher.ip, lesser.sport, higher.dport, etc.

Very common when you are doing deep packet inspection.

Jay

[dpdk-dev] Beyond DPDK 2.0

2015-04-28 Thread Jay Rolette

On Tue, Apr 28, 2015 at 12:26 PM, Neil Horman  wrote:

> On Mon, Apr 27, 2015 at 08:46:01AM -0500, Jay Rolette wrote:
> > On Sat, Apr 25, 2015 at 7:10 AM, Neil Horman 
> wrote:
> >
> > > On Fri, Apr 24, 2015 at 02:55:33PM -0500, Jay Rolette wrote:
> > > > On Fri, Apr 24, 2015 at 1:51 PM, Neil Horman 
> > > wrote:
> > > >
> > > > > So, I hear your arguments, and its understandable that you might
> not
> > > want
> > > > > a GPL
> > > > > licensed product, given that the DPDK is a library (though I'm not
> sure
> > > > > what the
> > > > > aversion to LGPL would be).  Regardless, I think this conversation
> is a
> > > > > bit more
> > > > > about participation than license choice.  While you are correct, in
> > > that
> > > > > the
> > > > > first step to support (by which I presume you mean participation
> in the
> > > > > community) is use, the goal here is to get people contributing
> patches
> > > and
> > > > > helping increase the usefulness of DPDK.
> > > >
> > > >
> > > > > Given that DPDK is primarily licensed as BSD now, whats preventing
> > > you, or
> > > > > what
> > > > > would encourage you to participate in the community?  I see emails
> from
> > > > > infiniteio addresss in the archives asking questions and making
> > > > > suggestions on
> > > > > occasion, but no patches.  What would get you (or others in a
> simmilar
> > > > > situation) to submit those?
> > > > >
> > > >
> > > > 36 hours in the day? :)
> > > >
> > > > It's not a lot, but we've submitted a couple of small patches. It's
> > > mostly
> > > > a matter of opportunity. We submit patches as we come across DPDK
> bugs or
> > > > find useful optos.
> > > >
> > >
> > 
> >
> > >
> > > Understand, I'm not trying to single you out here.  I see that you, and
> > > others
> > > from infiniteio have had some participation in the project, and thats
> > > great, and
> > > appreciated. I'm more focused on why that level of participation is not
> > > higher
> > > (not only from you, but from the larger presumed user base in
> general).  As
> > > noted at the start of this thread, one of the goals of this discussion
> is
> > > to
> > > find ways to maximize participation in the project, and from you I'm
> > > hearing
> > > that:
> > >
> > > 1) you use the dpdk because it lowers maintenence overhead
> > > 2) you don't participate more in the project because your product work
> > > schedule
> > > doesn't allow you to do so.
> > >
> > > Are those both accurate statements?
> > >
> >
> > (1) was just my response to Luke about what would motivate a company to
> > submit patches if the license didn't require it (GPL vs BSD discussion).
> > Maint overhead had little to do with why we decided to use DPDK.
> >
> > (2) is certainly true enough, but not really all that big of a factor in
> > the reasons why our participation isn't at some higher level. I'd say
> there
> > are two primary factors:
> >
> > *New Contributors*
> > Dropping a major patch on a project where we have no history would be
> > foolish and frustrating. Every open source project has a vibe to it and
> its
> > own way of operating. It normally takes some time to figure that out, but
> > for major contributions, it generally involves a level of trust.
> >
> > Major code drops or patches aren't generally well received unless it is
> > from someone that is known and trusted by the community. Building up that
> > trust ("street cred") normally involves participating in smaller, less
> > risky ways. Answering questions where you can, small patches to fix bugs,
> > possibly reviewing code (although this can be iffy as well), etc.
> >
> > This facilitates understanding the process, who the players are and what
> > sort of toes you are likely to step on.
> >
> > It also gives you time to ramp up on the internals of the code and
> > philosophy/goals of the project better. With DPDK, performance is
> obviously
> > more important than portability. Until recently, very little care was
> given
> > to API stability or the impact that has on DPDK apps. Both of those are
> > very dif

[dpdk-dev] Beyond DPDK 2.0

2015-04-27 Thread Jay Rolette

On Sat, Apr 25, 2015 at 7:10 AM, Neil Horman  wrote:

> On Fri, Apr 24, 2015 at 02:55:33PM -0500, Jay Rolette wrote:
> > On Fri, Apr 24, 2015 at 1:51 PM, Neil Horman 
> wrote:
> >
> > > So, I hear your arguments, and its understandable that you might not
> want
> > > a GPL
> > > licensed product, given that the DPDK is a library (though I'm not sure
> > > what the
> > > aversion to LGPL would be).  Regardless, I think this conversation is a
> > > bit more
> > > about participation than license choice.  While you are correct, in
> that
> > > the
> > > first step to support (by which I presume you mean participation in the
> > > community) is use, the goal here is to get people contributing patches
> and
> > > helping increase the usefulness of DPDK.
> >
> >
> > > Given that DPDK is primarily licensed as BSD now, whats preventing
> you, or
> > > what
> > > would encourage you to participate in the community?  I see emails from
> > > infiniteio addresss in the archives asking questions and making
> > > suggestions on
> > > occasion, but no patches.  What would get you (or others in a simmilar
> > > situation) to submit those?
> > >
> >
> > 36 hours in the day? :)
> >
> > It's not a lot, but we've submitted a couple of small patches. It's
> mostly
> > a matter of opportunity. We submit patches as we come across DPDK bugs or
> > find useful optos.
> >
>

>
> Understand, I'm not trying to single you out here.  I see that you, and
> others
> from infiniteio have had some participation in the project, and thats
> great, and
> appreciated. I'm more focused on why that level of participation is not
> higher
> (not only from you, but from the larger presumed user base in general).  As
> noted at the start of this thread, one of the goals of this discussion is
> to
> find ways to maximize participation in the project, and from you I'm
> hearing
> that:
>
> 1) you use the dpdk because it lowers maintenence overhead
> 2) you don't participate more in the project because your product work
> schedule
> doesn't allow you to do so.
>
> Are those both accurate statements?
>

(1) was just my response to Luke about what would motivate a company to
submit patches if the license didn't require it (GPL vs BSD discussion).
Maint overhead had little to do with why we decided to use DPDK.

(2) is certainly true enough, but not really all that big of a factor in
the reasons why our participation isn't at some higher level. I'd say there
are two primary factors:

*New Contributors*
Dropping a major patch on a project where we have no history would be
foolish and frustrating. Every open source project has a vibe to it and its
own way of operating. It normally takes some time to figure that out, but
for major contributions, it generally involves a level of trust.

Major code drops or patches aren't generally well received unless it is
from someone that is known and trusted by the community. Building up that
trust ("street cred") normally involves participating in smaller, less
risky ways. Answering questions where you can, small patches to fix bugs,
possibly reviewing code (although this can be iffy as well), etc.

This facilitates understanding the process, who the players are and what
sort of toes you are likely to step on.

It also gives you time to ramp up on the internals of the code and
philosophy/goals of the project better. With DPDK, performance is obviously
more important than portability. Until recently, very little care was given
to API stability or the impact that has on DPDK apps. Both of those are
very different approaches than typical and it affects what you might do
when approaching submitting a patch.

If you want to build up the number of folks contributing, expect them to
start small and make sure it actually GOES somewhere.

The first patch I submitted in mid-December had some quick responses and
questions back-and-forth, but then stalled on a couple of undocumented
minor stylistic things (comment style and use of #ifdefs). I never heard
anything back and 4.5 months later, a simple startup opto is still sitting
there.

The second patch I submitted (also mid-December) got no response at all for
4 months. I've replied to the feedback that came eventually, but once
again, no subsequent answers.

Neither of the patches were important, but the process doesn't exactly
inspire a strong desire to contribute more.

*Different Goals*
I see at least two different sets of people on the mailing list. The heavy
hitters generally have a product reason for their high level of involvement
with DPDK. For Intel and 6Wind, the reasons are pretty obvious. Same deal
with NIC vendors. For large c

[dpdk-dev] Beyond DPDK 2.0

2015-04-24 Thread Jay Rolette

On Fri, Apr 24, 2015 at 1:51 PM, Neil Horman  wrote:

> So, I hear your arguments, and its understandable that you might not want
> a GPL
> licensed product, given that the DPDK is a library (though I'm not sure
> what the
> aversion to LGPL would be).  Regardless, I think this conversation is a
> bit more
> about participation than license choice.  While you are correct, in that
> the
> first step to support (by which I presume you mean participation in the
> community) is use, the goal here is to get people contributing patches and
> helping increase the usefulness of DPDK.


> Given that DPDK is primarily licensed as BSD now, whats preventing you, or
> what
> would encourage you to participate in the community?  I see emails from
> infiniteio addresss in the archives asking questions and making
> suggestions on
> occasion, but no patches.  What would get you (or others in a simmilar
> situation) to submit those?
>

36 hours in the day? :)

It's not a lot, but we've submitted a couple of small patches. It's mostly
a matter of opportunity. We submit patches as we come across DPDK bugs or
find useful optos.

*Patches*

   - replaced O(n^2) sort in sort_by_physaddr() with qsort() from standard
   library 
   - Fixed spam from kni_allocate_mbufs() when no mbufs are free. If mbufs
   exhausted, 'out of memory' message logged at EXTREMELY high rates. Now logs
   no more than once per 10 mins 

*Reviews*

   - kni: optimizing the rte_kni_rx_burst
   
   - [PATCH RFC] librte_reorder: new reorder library
   
   - [PATCH v2 09/17] i40e: clean log messages
    (several in
   that series, but I figure 1 link is plenty)

*Other*
Not really patches or reviews, but trying to participate in the community:

   - VMware Fusion + DPDK and KNI
   
   - Appropriate DPDK data structures for TCP sockets
   
   - kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
   
   - segmented recv ixgbevf
   

Jay

[dpdk-dev] Beyond DPDK 2.0

2015-04-24 Thread Jay Rolette

On Fri, Apr 24, 2015 at 2:47 AM, Luke Gorrie  wrote:

> 2. How will DPDK users justify contributing to DPDK upstream?
>
> Engineers in network equipment vendors want to contribute to open source,
> but what is the incentive for the companies to support this? This would be
> easy if DPDK were GPL'd (they are compelled) or if everybody were
> dynamically linking with the upstream libdpdk (can't have private patches).
> However, in a world where DPDK is BSD-licensed and statically linked, is it
> not both cheaper and competitively advantageous to keep fixes and
> optimizations in house?
>

The main incentive for most companies to support it is that it reduces
their maintenance load. It makes it easier to not get "stuck" on a
particular version of DPDK and they don't have to waste time constantly
back-porting improvements and bug fixes.

I can tell you that if DPDK were GPL-based, my company wouldn't be using
it. I suspect we wouldn't be the only ones...

Jay

[dpdk-dev] [PATCH] Fixed spam from kni_allocate_mbufs() when no mbufs are free. If mbufs exhausted, 'out of memory' message logged at EXTREMELY high rates. Now logs no more than once per 10 mins

2015-04-14 Thread Jay Rolette

Hi Stephen,

Thanks for the feedback. Comments and questions inline below.

Jay

On Mon, Apr 13, 2015 at 8:09 PM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> On Wed, 17 Dec 2014 07:57:02 -0600
> Jay Rolette  wrote:
>
> > Signed-off-by: Jay Rolette 
> > ---
> >  lib/librte_kni/rte_kni.c | 21 -
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
> > index fdb7509..f89319c 100644
> > --- a/lib/librte_kni/rte_kni.c
> > +++ b/lib/librte_kni/rte_kni.c
> > @@ -40,6 +40,7 @@
> >  #include 
> >  #include 
> >
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -61,6 +62,9 @@
> >
> >  #define KNI_MEM_CHECK(cond) do { if (cond) goto kni_fail; } while (0)
> >
> > +// Configure how often we log "out of memory" messages (in seconds)
> > +#define KNI_SPAM_SUPPRESSION_PERIOD 60*10
> > +
> >  /**
> >   * KNI context
> >   */
> > @@ -592,6 +596,10 @@ kni_free_mbufs(struct rte_kni *kni)
> >  static void
> >  kni_allocate_mbufs(struct rte_kni *kni)
> >  {
> > + static uint64_t no_mbufs = 0;
> > + static uint64_t spam_filter = 0;
> > + static uint64_t delayPeriod = 0;
> > +
> >   int i, ret;
> >   struct rte_mbuf *pkts[MAX_MBUF_BURST_NUM];
> >
> > @@ -620,7 +628,18 @@ kni_allocate_mbufs(struct rte_kni *kni)
> >   pkts[i] = rte_pktmbuf_alloc(kni->pktmbuf_pool);
> >   if (unlikely(pkts[i] == NULL)) {
> >   /* Out of memory */
> > - RTE_LOG(ERR, KNI, "Out of memory\n");
> > + no_mbufs++;
> > +
> > + // Memory leak or need to tune? Regardless, if we
> get here once,
> > + // we will get here a *lot*. Don't spam the logs!
> > + now = rte_get_tsc_cycles();
> > + if (!delayPeriod)
> > + delayPeriod = rte_get_tsc_hz() *
> KNI_SPAM_SUPPRESSION_PERIOD;
> > +
> > + if (!spam_filter || (now - spam_filter) >
> delayPeriod) {
> > + RTE_LOG(ERR, KNI, "No mbufs available
> (%llu)\n", (unsigned long long)no_mbufs);
> > + spam_filter = now;
> > + }
> >   break;
> >   }
> >   }
>
> I agree whole completely with the intent of this.
> But just remove the log message completely. It doesn't
> help at all, use a statistic instead.
>

I'm fine with removing the log message completely. Can you point me to
where DPDK keeps stats generally? Stats like this are only useful if they
are accessible from some sort of monitoring process. There aren't any stats
in struct rte_kni right now.

If you want to do ratelimiting, then it is better to create
> a common function (see net_ratelimit() in Linux kernel) to
> have all code use the same method, rather than reinventing private
> code to do it.
>

I'll remove it.


> Minor style complaints:
>   * don't use camelCase, it goes against the style of the rest of the code.
>

ok


>   * don't use C++ style comments.
>

I didn't. I used C99 style comments :)


>   * always use rte_cycles() not TSC for things like this.
>

I don't see rte_cycles() defined anywhere. Did you mean some other function?

Please resubmit removing the log message and adding a statistic.
>

[dpdk-dev] tools brainstorming

2015-04-09 Thread Jay Rolette

On Thu, Apr 9, 2015 at 2:16 PM, Neil Horman  wrote:

> On Thu, Apr 09, 2015 at 11:31:39AM -0500, Jay Rolette wrote:
> > On Wed, Apr 8, 2015 at 5:38 PM, Stephen Hemminger <
> > stephen at networkplumber.org> wrote:
> >
> > > On Wed, 8 Apr 2015 16:29:54 -0600
> > > Jay Rolette  wrote:
> > >
> > > > "C comments" includes //, right? It's been part of the C standard
> for a
> > > long time now...
> > >
> > > Yes but.
> > > I like to use checkpatch and checkpatch enforces kernel style which
> does
> > > not allow // for
> > > comments.
> > >
> >
> > Fork checkpatch and disable that bit? DPDK isn't the kernel, so no
> > requirement to follow all of its rules
> >
>
> Doesn't that beg the question, why?  I understand the DPDK isn't the
> kernel, but
> we're not talking about clarity of code, not anything functional to that
> code.
> It seems we would be better served by just taking something that works here
> rather than re-inventing the wheel and digging into the minuate of what
> type of
> comments should be allowed (unless there is a compelling reason to change
> it
> that supercedes the avilable tools).  If not checkpath, then some other
> tool,
> but It seems to me that coding style is one of those things where we can
> bend to
> the tool rather than taking the time to make the tool do exactly whats
> desired,
> at least until someone gets the time to modify it.
>

Fair question.

It depends a bit on how much you want to encourage patch contributions. Is
it worth adding more pain for folks trying to contribute patches for things
like this?

Should we force someone to spend time redoing a patch because of which way
they do their parenthesis? What about number of spaces to indent code? //
vs /* */ comments? None of these matter functionally and they don't affect
maintenance generally.

If someone is modifying existing code, then yeah, they should follow the
prevailing style (indention level, brace alignment, etc.) of the file they
are in. It helps readability, which makes maintenance easier. However, IMO,
mixing // and /* */ for comments doesn't affect the readability of the
source.

I know if I submit a patch and the only feedback is that I should have used
/* */ for comments, I'm extremely unlikely spend extra time to resubmit the
patch for pedantry.

[dpdk-dev] tools brainstorming

2015-04-09 Thread Jay Rolette

On Wed, Apr 8, 2015 at 5:38 PM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> On Wed, 8 Apr 2015 16:29:54 -0600
> Jay Rolette  wrote:
>
> > "C comments" includes //, right? It's been part of the C standard for a
> long time now...
>
> Yes but.
> I like to use checkpatch and checkpatch enforces kernel style which does
> not allow // for
> comments.
>

Fork checkpatch and disable that bit? DPDK isn't the kernel, so no
requirement to follow all of its rules

[dpdk-dev] tools brainstorming

2015-04-08 Thread Jay Rolette

"C comments" includes //, right? It's been part of the C standard for a long 
time now...

Sent from my iPhone

> On Apr 8, 2015, at 8:40 AM, Butler, Siobhan A  
> wrote:
> 
> 
> 
>> -Original Message-
>> From: Neil Horman [mailto:nhorman at tuxdriver.com]
>> Sent: Wednesday, April 8, 2015 2:11 PM
>> To: Butler, Siobhan A
>> Cc: Thomas Monjalon; dev at dpdk.org
>> Subject: Re: [dpdk-dev] tools brainstorming
>> 
>>> On Wed, Apr 08, 2015 at 12:16:10PM +, Butler, Siobhan A wrote:
>>> 
>>> 
 -Original Message-
 From: Neil Horman [mailto:nhorman at tuxdriver.com]
 Sent: Wednesday, April 8, 2015 12:44 PM
 To: Butler, Siobhan A
 Cc: Thomas Monjalon; dev at dpdk.org
 Subject: Re: [dpdk-dev] tools brainstorming
 
> On Wed, Apr 08, 2015 at 10:43:53AM +, Butler, Siobhan A wrote:
> Hi all,
> To add to the tools brainstorming - I propose we use the following
> Coding
 Standards as the basis of guidelines on coding style going forward.
> The style outlined below is in alignment with the current
> convention used
 for the majority of the project.
> Any thoughts/suggestions or feedback welcome.
> Thanks
> Siobhan :)
> 
> 
> 
> 
> Coding Style
> ~~
> 
> Description
> ---
> 
> This document specifies the preferred style for source files in
> the DPDK
 source tree.
> It is based on the Linux Kernel coding guidelines and the FreeBSD
> 7.2 Kernel Developer's Manual (see man style(9)), but was heavily
> modified for
 the needs of the DPDK. Many of the style rules are implicit in the
>> examples.
> Be careful to check the examples before assuming that style is
> silent on an
 issue.
> 
> General Guidelines
> --
> 
> The rules and guidelines given in this document cannot cover every
 situation, so the following general guidelines should be used as a 
 fallback:
> The code style should be consistent within each individual file,
> and within each file in a given directory or module - in the case
> of creating new
 files The primary reason for coding standards is to increase code
 readability and comprehensibility, therefore always use whatever
 option will make the code easiest to read.
> 
> The following more specific recommendations apply to all sections,
> both for
 C and assembly code:
> Line length is recommended to be not more than 80 characters,
> including comments. [Tab stop size should be assumed to be at
> least 4-
 characters wide] Indentation should be to no more than 3 levels deep.
> NOTE The above are recommendations, and not hard limits. However,
> it is
 expected that the recommendations should be followed in all but the
 rarest situations.
> C Comment Style
> 
> Usual Comments
> --
> 
> These comments should be used in normal cases. To document a
> public
 API, a doxygen-like format must be used: refer to Doxygen
>> Documentation.
> /*
>  * VERY important single-line comments look like this.
>  */
> 
> /* Most single-line comments look like this. */
> 
> /*
>  * Multi-line comments look like this.  Make them real sentences. Fill
>  * them so they look like real paragraphs.
>  */
> 
> License Header
> --
> 
> Each file should begin with a special comment tag which will
> contain the
 appropriate copyright and license for the file (Generally BSD License).
> After any copyright header, a blank line should be left before any
> other
 contents, e.g. include statements in a C file.
 
 This can become very confusing.  There already is a LICENSE.GPL and
 LICENSE.LGPL file at the top of the project, allowing others to
 insert their own licenses within their files can make it difficult
 for end user to determine if it is legally safe to use a given project.
 
 I'd suggest instead that contributions be disallowed from including
 license files in their code, relying instead on only a single
 license at the top of the project (which should likely be BSD or LGPL, 
 since
>> we're shipping a library).
 
 IANAL, but it seems to me to be dangerous to do anything else.  For
 example, all the code that is dual licensed in the library (like
 rte_pci_dev_ids.h).  It allows for a BSD or GPL license.  If someone
 builds dpdk as a DSO and distributes it under the former, every
 application that links against that re-distribution may arguably
 itself become GPL licensed.  While I'm personally fine with that, I
 can see it as being a big deal to some end users.  Unifying the
 license makes the re-distribution license issue more clear for everyone.
 
 Neil
>>> 
>>> 
>>> Input appreciated Neil thank you, would it be best

[dpdk-dev] [PATCH] eal: decrease the memory init time with many hugepages setup

2015-04-02 Thread Jay Rolette

On Thu, Apr 2, 2015 at 7:55 AM, Thomas Monjalon 
wrote:

> 2015-04-02 19:30, jerry.lilijun at huawei.com:
> > From: Lilijun 
> >
> > In the function map_all_hugepages(), hugepage memory is truly allocated
> by
> > memset(virtaddr, 0, hugepage_sz). Then it costs about 40s to finish the
> > dpdk memory initialization when 4 2M hugepages are setup in host os.
>
> Yes it's something we should try to reduce.
>

I have a patch in my tree that does the same opto, but it is commented out
right now. In our case, 2/3's of the startup time for our entire app was
due to that particular call - memset(virtaddr, 0, hugepage_sz). Just
zeroing 1 byte per huge page reduces that by 30% in my tests.

The only reason I have it commented out is that I didn't have time to make
sure there weren't side-effects for DPDK or my app. For normal shared
memory on Linux, pages are initialized to zero automatically once they are
touched, so the memset isn't required but I wasn't sure whether that
applied to huge pages. Also wasn't sure how hugetlbfs factored into the
equation.

Hopefully someone can chime in on that. Would love to uncomment the opto :)

> In fact we can only write one byte to finish  the allocation.
>
> Isn't it a security hole?
>

Not necessarily. If the kernel pre-zeros the huge pages via CoW like normal
pages, then definitely not.

Even if the kernel doesn't pre-zero the pages, if DPDK takes care of
properly initializing memory structures on startup as they are carved out
of the huge pages, then it isn't a security hole. However, that approach is
susceptible to bit rot... You can audit the code and make sure everything
is kosher at first, but you have to worry about new code making assumptions
about how memory is initialized.

> This article speaks about "prezeroing optimizations" in Linux kernel:
> http://landley.net/writing/memory-faq.txt

I read through that when I was trying to figure out what whether huge pages
were pre-zeroed or not. It doesn't talk about huge pages much beyond why
they are useful for reducing TLB swaps.

Jay

[dpdk-dev] Kernel deadlock due to rte_kni

2015-03-25 Thread Jay Rolette

http://patchwork.dpdk.org/ml/archives/dev/2015-February/013335.html

Jay

On Wed, Mar 25, 2015 at 2:39 PM, Dey, Souvik  wrote:

> Hi All,
> There looks like an issue will rte_kni.ko which gets
> kernel into deadlock. We are trying to run rte_kni.ko with multiple thread
> support which are pinned to different non-isolated cores. When we test with
> tcp/tls the kernel is getting hanged in on race condition. Below is the
> kernel stack trace.
>
> PID: 19942  TASK: 880227a71950  CPU: 3   COMMAND: "CE_2N_Comp_SamP"
> #0 [88043fd87ec0] crash_nmi_callback at 8101d4a8
> -- MORE --  forward: ,  or j  backward: b or k  quit: q
> #1 [88043fd87ed0] notifier_call_chain at 81055b68
> #2 [88043fd87f00] notify_die at 81055be0
> #3 [88043fd87f30] do_nmi at 81009ddd
> #4 [88043fd87f50] nmi at 812ea9d0
> [exception RIP: _raw_spin_lock_bh+25]
> RIP: 812ea2a4  RSP: 880189439c88  RFLAGS: 0293
> RAX: 5b59  RBX: 880291708ec8  RCX: 045a
> RDX: 880189439d90  RSI:   RDI: 880291708ec8
> RBP: 880291708e80   R8: 047fef78   R9: 0001
> R10: 0009  R11: 8126c658  R12: 880423799a40
> R13: 880189439e08  R14: 045a  R15: 0017
> ORIG_RAX:   CS: 0010  SS: 0018
> ---  ---
> #5 [880189439c88] _raw_spin_lock_bh at 812ea2a4
> #6 [880189439c90] lock_sock_nested at 8122e948
> #7 [880189439ca0] tcp_sendmsg at 8126c676
> #8 [880189439d50] sock_aio_write at 8122bb12
> #9 [880189439e00] do_sync_write at 810c61c6
> #10 [880189439f10] vfs_write at 810c68a9
> #11 [880189439f40] sys_write at 810c6dfe
> #12 [880189439f80] system_call_fastpath at 812eab92
> RIP: 7fc7909bc0ed  RSP: 7fc787ffe108  RFLAGS: 0202
> RAX: 0001  RBX: 812eab92  RCX: 7fc7880aa170
> RDX: 045a  RSI: 04d56546  RDI: 002b
> RBP: 04d56546   R8: 047fef78   R9: 0001
> R10: 0009  R11: 0293  R12: 04d56546
> R13: 0483de10  R14: 045a  R15: 0001880008b0
> ORIG_RAX: 0001  CS: 0033  SS: 002b
>
> PID: 3598   TASK: 88043db21310  CPU: 1   COMMAND: "kni_pkt0"
> #0 [88043fc87ec0] crash_nmi_callback at 8101d4a8
> #1 [88043fc87ed0] notifier_call_chain at 81055b68
> #2 [88043fc87f00] notify_die at 81055be0
> #3 [88043fc87f30] do_nmi at 81009ddd
> #4 [88043fc87f50] nmi at 812ea9d0
> [exception RIP: _raw_spin_lock+16]
> RIP: 812ea0b1  RSP: 88043fc83e78  RFLAGS: 0297
> RAX: 5a59  RBX: 880291708e80  RCX: 0001
> RDX: 88043fc83ec0  RSI: 2f82  RDI: 880291708ec8
> RBP: 88043d8f4000   R8: 813a8d20   R9: 0001
> R10: 88043d9d8098  R11: 8101e62a  R12: 81279a3c
> R13: 88042d9b3fd8  R14: 880291708e80  R15: 88043fc83ec0
> ORIG_RAX:   CS: 0010  SS: 0018
> ---  ---
> #5 [88043fc83e78] _raw_spin_lock at 812ea0b1
> #6 [88043fc83e78] tcp_delack_timer at 81279a4e
> #7 [88043fc83e98] run_timer_softirq at 8104642d
> #8 [88043fc83f08] __do_softirq at 81041539
> #9 [88043fc83f48] call_softirq at 812ebd9c
> #10 [88043fc83f60] do_softirq at 8100b037
> #11 [88043fc83f80] irq_exit at 8104185a
> #12 [88043fc83f90] smp_apic_timer_interrupt at 8101eaef
> #13 [88043fc83fb0] apic_timer_interrupt at 812eb553
> ---  ---
> #14 [88042d9b3ae8] apic_timer_interrupt at 812eb553
> [exception RIP: tcp_rcv_established+1732]
> RIP: 812756ab  RSP: 88042d9b3b90  RFLAGS: 0202
> RAX: 0020  RBX: 88042d86f470  RCX: 020a
> RDX: 8801fc163864  RSI: 88032f8b2380  RDI: 880291708e80
> RBP: 88032f8b2380   R8: 88032f8b2380   R9: 81327a60
> R10: 000e  R11: 8112ae8f  R12: 812eb54e
> R13: 8122ea88  R14: 010c  R15: 05a8
> -- MORE --  forward: ,  or j  backward: b or k  quit: q
> ORIG_RAX: ff10  CS: 0010  SS: 0018
> #15 [88042d9b3bd8] tcp_v4_do_rcv at 8127b483
> #16 [88042d9b3c48] tcp_v4_rcv at 8127d89f
> #17 [88042d9b3cb8] ip_local_deliver_finish at 81260cc2
> #18 [88042d9b3cd8] __netif_receive_skb at 81239de6
> #19 [88042d9b3d28] netif_receive_skb at 81239e7f
> #20 [88042d9b3d58] kni_net_rx_normal at a022b06f [rte_kni]
> #21 [88042d9b3ec8] kni_thread_multiple at a022a2df [rte_kni]
> #22 [88042d9b3ee8] kthread at

[dpdk-dev] soft lockup calling receive burst

2015-03-20 Thread Jay Rolette

On Fri, Mar 20, 2015 at 12:53 PM, Joseph Vossen  wrote:

> hello,
>
> I am working on a dpdk-based app using version 1.6 under RH7.3.  Under
> varying traffic loads, I will intermittently notice a kernel soft lockup
> and the RIP provided by the kernel always points to the same MOV
> instruction in rte_ethdev.h (line #1982).  The stack trace looks like:
>
>
> dev = _eth_devices[port_id];
> return
> (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id], rx_pkts, nb_pkts);
>
>
> 473176: 0f b7 15 a7 68 38 00movzwl
> 0x3868a7(%rip),%edx  # 7f9a24  rte_eth_rx_burst():
> 47317d: 0f b6 31movzbl
> (%rcx),%esi
> 473180: 48 89 f0mov
> %rsi,%rax
> 473183: 0f b6 71 01 movzbl
> 0x1(%rcx),%esi
> 473187: 48 c1 e0 06 shl $0x6,%rax
> 47318b: 48 8b b8 70 ed 83 00mov
> 0x83ed70(%rax),%rdi
> >   473192: 48 8b 0fmov
> (%rdi),%rcx
> 473195: 48 8b 3c f1 mov
> (%rcx,%rsi,8),%rdi
> 473199: 4c 89 eemov
> %r13,%rsi
> 47319c: ff 90 60 ed 83 00   callq
> *0x83ed60(%rax)
>
>
> has any one else seen something like this?
>
> thanks

I haven't seen that, but there's a soft-lockup in KNI in DPDK 1.6. Here's
what I posted about that one a few weeks ago:

Found the problem. No patch to submit since it's already fixed in later
versions of DPDK, but thought I'd follow up with the details since I'm sure
we aren't the only ones trying to use bleeding-edge versions of DPDK...

In kni_net_rx_normal(), it was calling netif_receive_skb() instead of
netif_rx(). The source for netif_receive_skb() point out that it should
only be called from soft-irq context, which isn't the case for KNI.

As typical, simple fix once you track it down.

Yao-Po Wang's fix:  commit 41a6ebded53982107c1adfc0652d6cc1375a7db9.

Jay

[dpdk-dev] [PATCH] kni:optimization of rte_kni_rx_burst

2015-02-25 Thread Jay Rolette

On Wed, Feb 25, 2015 at 6:38 AM, Marc Sune  wrote:

>
> On 25/02/15 13:24, Hemant at freescale.com wrote:
>
>> Hi OIivier
>>  Comments inline.
>> Regards,
>> Hemant
>>
>>  -Original Message-
>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Deme
>>> Sent: 25/Feb/2015 5:44 PM
>>> To: dev at dpdk.org
>>> Subject: Re: [dpdk-dev] [PATCH] kni:optimization of rte_kni_rx_burst
>>>
>>> Thank you Hemant, I think there might be one issue left with the patch
>>> though.
>>> The alloc_q must initially be filled with mbufs before getting mbuf back
>>> on the
>>> tx_q.
>>>
>>> So the patch should allow rte_kni_rx_burst to check if alloc_q is empty.
>>> If so, it should invoke kni_allocate_mbufs(kni, 0) (to fill the alloc_q
>>> with
>>> MAX_MBUF_BURST_NUM mbufs)
>>>
>>> The patch for rte_kni_rx_burst would then look like:
>>>
>>> @@ -575,7 +575,7 @@ rte_kni_rx_burst(struct rte_kni *kni, struct rte_mbuf
>>> **mbufs, unsigned num)
>>>
>>>/* If buffers removed, allocate mbufs and then put them into
>>> alloc_q */
>>>if (ret)
>>> -kni_allocate_mbufs(kni);
>>> +  kni_allocate_mbufs(kni, ret);
>>> +  else if (unlikely(kni->alloc_q->write == kni->alloc_q->read))
>>> +  kni_allocate_mbufs(kni, 0);
>>>
>>>  [hemant]  This will introduce a run-time check.
>>
>> I missed to include the other change in the patch.
>>   I am doing it in kni_alloc i.e. initiate the alloc_q with default burst
>> size.
>> kni_allocate_mbufs(ctx, 0);
>>
>> In a way, we are now suggesting to reduce the size of alloc_q to only
>> default burst size.
>>
>
> As an aside comment here, I think that we should allow to tweak the
> userspace <-> kernel queue sizes (rx_q, tx_q, free_q and alloc_q) . Whether
> this should be a build configuration option or a parameter to
> rte_kni_init(), it is not completely clear to me, but I guess
> rte_kni_init() is a better option.
>

rte_kni_init() is definitely a better option. It allows things to be tuned
based on individual system config rather than requiring different builds.


> Having said that, the original mail from Hemant was describing that KNI
> was giving an out-of-memory. This to me indicates that the pool is
> incorrectly dimensioned. Even if KNI will not pre-allocate in the alloc_q,
> or not completely, in the event of high load, you will get this same "out
> of memory".
>
> We can reduce the usage of buffers by the KNI subsystem in kernel space
> and in userspace, but the kernel will always need a small cache of
> pre-allocated buffers (coming from user-space), since the KNI kernel module
> does not know where to grab the packets from (which pool). So my guess is
> that the dimensioning problem experienced by Hemant would be the same, even
> with the proposed changes.
>
>
>> Can we reach is situation, when the kernel is adding packets faster in
>> tx_q than the application is able to dequeue?
>>
>
> I think so. We cannot control much how the kernel will schedule the KNI
> thread(s), specially if the # of threads in relation to the cores is
> incorrect (not enough), hence we need at least a reasonable amount of
> buffering to prevent early dropping to those "internal" burst side effects.
>
> Marc


Strongly agree with Marc here. We *really* don't want just a single burst
worth of mbufs available to the kernel in alloc_q. That's just asking for
congestion when there's no need for it.

The original problem reported by Olivier is more of a resource tuning
problem than anything else. The number of mbufs you need in the system has
to take into account internal queue depths.

Jay

[dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]

2015-02-17 Thread Jay Rolette

On Mon, Feb 16, 2015 at 7:00 PM, Matthew Hall  wrote:

> On Mon, Feb 16, 2015 at 10:33:52AM -0600, Jay Rolette wrote:
> > In kni_net_rx_normal(), it was calling netif_receive_skb() instead of
> > netif_rx(). The source for netif_receive_skb() point out that it should
> > only be called from soft-irq context, which isn't the case for KNI.
>
> For the uninitiated among us, what was the practical effect of the coding
> error? Waiting forever for a lock which will never be available in IRQ
> context, or causing unintended re-entrancy, or what?
>

Sadly, I'm not really one of the enlightened ones when it comes to the
Linux kernel. VxWorks? sure. Linux kernel? learning as required.

I didn't chase it down to the specific mechanism in this case. Unusual for
me, but this time I took the expedient route of finding a likely
explanation plus Yao's fix on that same code with his explanation of a
deadlock and went with it. It'll be a few more days before we've had enough
run time on it to absolutely confirm (not an easy bug to repro).

If I get hand-wavy about it, my assumption is that the requirement for
netif_receive_skb() be called in soft-irq context means it doesn't expect
to be pre-empted or rentrant.  When you call netif_rx() instead, it puts
the skb on the backlog and it gets processed from there. Part of that code
disables interrupts during part of the processing. Not sure what else is
coming in and actually deadlocking things.

Honestly, I don't understand enough details of how everything works in the
Linux network stack yet. I've done tons of work on the network path of
stack-less systems, a bit of work in device drivers, but have only touched
the surface of the internals of Linux network stack. The meat of my product
avoids that like the plague because it is too slow.

Sorry, lots of words but not much light being shed this time...
Jay

[dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]

2015-02-16 Thread Jay Rolette

On Tue, Feb 10, 2015 at 7:33 PM, Jay Rolette  wrote:

> Environment:
>   * DPDK 1.6.0r2
>   * Ubuntu 14.04 LTS
>   * kernel: 3.13.0-38-generic
>
> When we start exercising KNI a fair bit (transferring files across it,
> both sending and receiving), I'm starting to see a fair bit of these kernel
> lockups:
>
> kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
>
> Frequently I can't do much other than get a screenshot of the error
> message coming across the console session once we get into this state, so
> debugging what is happening is "interesting"...
>
> I've seen this on multiple hardware platforms (so not box specific) as
> well as virtual machines.
>
> Are there any known issues with KNI that would cause kernel lockups in
> DPDK 1.6? Really hoping someone that knows KNI well can point me in the
> right direction.
>
> KNI in the 1.8 tree is significantly different, so it didn't look
> straight-forward to back-port it, although I do see a few changes that
> might be relevant.
>

Found the problem. No patch to submit since it's already fixed in later
versions of DPDK, but thought I'd follow up with the details since I'm sure
we aren't the only ones trying to use bleeding-edge versions of DPDK...

In kni_net_rx_normal(), it was calling netif_receive_skb() instead of
netif_rx(). The source for netif_receive_skb() point out that it should
only be called from soft-irq context, which isn't the case for KNI.

As typical, simple fix once you track it down.

Yao-Po Wang's fix:  commit 41a6ebded53982107c1adfc0652d6cc1375a7db9.

Cheers,
Jay

[dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]

2015-02-16 Thread Jay Rolette

Thanks Alejandro.

I'll look into the kernel dump if there is one. The system is extremely
brittle once this happens. Usually I can't do much other than power-cycle
the box. Anything requiring sudo just locks the terminal up, so little to
look at besides the messages on the console.

Matthew Hall also suggested a few things for me to look into, so I'm
following up on that as well.

Jay

On Wed, Feb 11, 2015 at 10:25 AM, Alejandro Lucero <
alejandro.lucero at netronome.com> wrote:

> Hi Jay,
>
> I saw these errors when I worked in the HPC sector. They come usually with
> a kernel dump for each core in the machine so you can know, after some
> peering at the kernel code, how the soft lockup triggers. When I did that
> it was always an issue with the memory.
>
> So those times that you can still work on the machine after the problem,
> look at the kernel messages. I will be glad to look at it.
>
>
>
> On Wed, Feb 11, 2015 at 1:33 AM, Jay Rolette 
> wrote:
>
> > Environment:
> >   * DPDK 1.6.0r2
> >   * Ubuntu 14.04 LTS
> >   * kernel: 3.13.0-38-generic
> >
> > When we start exercising KNI a fair bit (transferring files across it,
> both
> > sending and receiving), I'm starting to see a fair bit of these kernel
> > lockups:
> >
> > kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
> >
> > Frequently I can't do much other than get a screenshot of the error
> message
> > coming across the console session once we get into this state, so
> debugging
> > what is happening is "interesting"...
> >
> > I've seen this on multiple hardware platforms (so not box specific) as
> well
> > as virtual machines.
> >
> > Are there any known issues with KNI that would cause kernel lockups in
> DPDK
> > 1.6? Really hoping someone that knows KNI well can point me in the right
> > direction.
> >
> > KNI in the 1.8 tree is significantly different, so it didn't look
> > straight-forward to back-port it, although I do see a few changes that
> > might be relevant.
> >
> > Any suggestions, pointers or other general help for tracking this down?
> >
> > Thanks!
> > Jay
> >
>

[dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]

2015-02-10 Thread Jay Rolette

Environment:
  * DPDK 1.6.0r2
  * Ubuntu 14.04 LTS
  * kernel: 3.13.0-38-generic

When we start exercising KNI a fair bit (transferring files across it, both
sending and receiving), I'm starting to see a fair bit of these kernel
lockups:

kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]

Frequently I can't do much other than get a screenshot of the error message
coming across the console session once we get into this state, so debugging
what is happening is "interesting"...

I've seen this on multiple hardware platforms (so not box specific) as well
as virtual machines.

Are there any known issues with KNI that would cause kernel lockups in DPDK
1.6? Really hoping someone that knows KNI well can point me in the right
direction.

KNI in the 1.8 tree is significantly different, so it didn't look
straight-forward to back-port it, although I do see a few changes that
might be relevant.

Any suggestions, pointers or other general help for tracking this down?

Thanks!
Jay

[dpdk-dev] mmap fails with more than 40000 hugepages

2015-02-05 Thread Jay Rolette

On Thu, Feb 5, 2015 at 7:36 AM, Damjan Marion (damarion)  wrote:
>
>  On 05 Feb 2015, at 14:22, Jay Rolette  wrote:
>
>   Not directly related, but if you have to stick with 2MB hugepages, you
> might want to take a look at a patch I submitted that fixes the O(n^2)
> algorithm used in initializing hugepages.
>
>
>  I tried it hoping that it will change something, no luck?
>

I suggested that patch only to speed up your init time. There's nothing in
it that will help with the crash you are seeing.

[dpdk-dev] mmap fails with more than 40000 hugepages

2015-02-05 Thread Jay Rolette

On Thu, Feb 5, 2015 at 6:00 AM, Damjan Marion (damarion)  wrote:

> Hi,
>
> I have system with 2 NUMA nodes and 256G RAM total. I noticed that DPDK
> crashes in rte_eal_init()
> when number of available hugepages is around 4 or above.
> Everything works fine with lower values (i.e. 3).
>
> I also tried with allocating 4 on node0 and 0 on node1, same crash
> happens.
>
>
> Any idea what might be causing this?
>

Any reason you can't switch to using 1GB hugepages? You'll get better
performance and your init time will be shorter. The systems we run on are
similar (256GB, 2 NUMA nodes) and that works fine for us.

Not directly related, but if you have to stick with 2MB hugepages, you
might want to take a look at a patch I submitted that fixes the O(n^2)
algorithm used in initializing hugepages.

Jay

[dpdk-dev] Process question: reviewing older patches

2015-01-29 Thread Jay Rolette

On Wed, Jan 28, 2015 at 3:23 PM, Neil Horman  wrote:

> On Wed, Jan 28, 2015 at 02:57:58PM -0600, Jay Rolette wrote:
> > Thanks Thomas and Neil. Sadly, no joy. While I generally like gmail for
> my
> > mail, there's not a reasonable way to import the mbox file or to control
> > the message id.
> >
> Sure there is, you just need to select an appropriate MUA.  You can't use
> the
> web interface for this.  Enable imap access to your gmail account, and
> setup an
> MUA like mutt to point to it.  Then the mutt client can open the mbox
> file, or
> you can fill out the in-reply-to: header manually.
>

True. I meant there's not a way to do this in the mail interface I use.
Honestly, while I was happy to do the review since I've been digging into
KNI anyway, I'm not going to spend time setting up an MUA just for this.

Jay

[dpdk-dev] Process question: reviewing older patches

2015-01-28 Thread Jay Rolette

Thanks Thomas and Neil. Sadly, no joy. While I generally like gmail for my
mail, there's not a reasonable way to import the mbox file or to control
the message id.

If someone else wants to resend the message to the list, I can reply to
that. Otherwise, here are the relevant bits from the original patch email:

>From patchwork Wed Jul 23 06:45:12 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: [dpdk-dev] kni: optimizing the rte_kni_rx_burst
From: Hemant Agrawal <hem...@freescale.com>
X-Patchwork-Id: 84
Message-Id: <14060979121185-git-send-email-Hemant at freescale.com>
To: 
Date: Wed, 23 Jul 2014 12:15:12 +0530

The current implementation of rte_kni_rx_burst polls the fifo for buffers.
Irrespective of success or failure, it allocates the mbuf and try to put
them into the alloc_q
if the buffers are not added to alloc_q, it frees them.
This waste lots of cpu cycles in allocating and freeing the buffers if
alloc_q is full.

The logic has been changed to:
1. Initially allocand add buffer(burstsize) to alloc_q
2. Add buffers to alloc_q only when you are pulling out the buffers.

Signed-off-by: Hemant Agrawal 

---
lib/librte_kni/rte_kni.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
index 76feef4..01e85f8 100644
--- a/lib/librte_kni/rte_kni.c
+++ b/lib/librte_kni/rte_kni.c
@@ -263,6 +263,9 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,

  ctx->in_use = 1;

+ /* Allocate mbufs and then put them into alloc_q */
+ kni_allocate_mbufs(ctx);
+
  return ctx;

 fail:
@@ -369,8 +372,9 @@ rte_kni_rx_burst(struct rte_kni *kni, struct rte_mbuf
**mbufs, unsigned num)
 {
  unsigned ret = kni_fifo_get(kni->tx_q, (void **)mbufs, num);

- /* Allocate mbufs and then put them into alloc_q */
- kni_allocate_mbufs(kni);
+ /* If buffers removed, allocate mbufs and then put them into alloc_q */
+ if(ret)
+ kni_allocate_mbufs(kni);

  return ret;
 }

The patch looks good from a DPDK 1.6r2 viewpoint. We saw the same behavior
in our app and ended up avoiding it higher in the stack (in our code).

Reviewed-by: Jay Rolette 

Jay

On Wed, Jan 28, 2015 at 10:49 AM, Neil Horman  wrote:

> On Wed, Jan 28, 2015 at 09:52:48AM -0600, Jay Rolette wrote:
> > There's a fairly old KNI patch (http://dpdk.org/dev/patchwork/patch/84/)
> > that I reviewed, but I'm not seeing how to submit my "Reviewed-by" when I
> > don't have any of the emails from the patch in my mail client.
> >
> > I can copy the text from the 'mbox' link in Patchwork into an email, but
> > I'm guessing that may not make the patch toolchain happy.
> >
> > What's the right way to do this?
> >
> Just grab the message id from the patchwork site, and list it in the
> envelope
> headers in-reply-to: field when you respond.  You won't have the rest of
> the
> conversation field in the thread, but you will respond properly to the
> thread,
> and patchwork will pick up the ACK
> Neil
>
> > Thanks,
> > Jay
> >
>

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-22 Thread Jay Rolette

On Thu, Jan 22, 2015 at 12:27 PM, Luke Gorrie  wrote:

> On 22 January 2015 at 14:29, Jay Rolette  wrote:
>
>> Microseconds matter. Scaling up to 100GbE, nanoseconds matter.
>>
>
> True. Is there a cut-off point though?
>

There are always engineering trade-offs that have to be made. If I'm
optimizing something today, I'm certainly not starting at something that
takes 1ns for an app that is doing L4-7 processing. It's all about
profiling and figuring out where the bottlenecks are.

For past networking products I've built, there was a lot of traffic that
the software didn't have to do much to. Minimal L2/L3 checks, then forward
the packet. It didn't even have to parse the headers because that was
offloaded on an FPGA. The only way to make those packets faster was to turn
them around in the FPGA and not send them to the CPU at all. That change
improved small packet performance by ~30%. That was on high-end network
processors that are significantly faster than Intel processors for packet
handling.

It seems to be a strange thing when you realize that just getting the
packets into the CPU is expensive, nevermind what you do with them after
that.

Does one nanosecond matter?
>

You just have to be careful when talking about things like a nanosecond.
It's sounds really small, but IPG for a 10G link is only 9.6ns. It's all
relative.

AVX512 will fit a 64-byte packet in one register and move that to or from
> memory with one instruction. L1/L2 cache bandwidth per server is growing on
> a double-exponential curve (both bandwidth per core and cores per CPU). I
> wonder if moving data around in cache will soon be too cheap for us to
> justify worrying about.
>

Adding cores helps with aggregate performance, but doesn't really help with
latency on a single packet. That said, I'll take advantage of anything I
can from the hardware to either let me scale up how much traffic I can
handle or the amount of features I can add at the same performance level!

Jay

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-22 Thread Jay Rolette

On Thu, Jan 22, 2015 at 3:06 AM, Luke Gorrie  wrote:

Here is another thought: when is it time to start thinking of packet copy
> as a cheap unit-time operation?
>

Pretty much never short of changes to memory architecture, IMO. Frankly,
there are never enough cycles for deep packet inspection applications that
need to run at/near line-rate. Don't waste any doing something you can
avoid in the first place.

Microseconds matter. Scaling up to 100GbE, nanoseconds matter.

Jay

[dpdk-dev] [PATCH] Fixed spam from kni_allocate_mbufs() when no mbufs are free. If mbufs exhausted, 'out of memory' message logged at EXTREMELY high rates. Now logs no more than once per 10 mins

2014-12-17 Thread Jay Rolette

Signed-off-by: Jay Rolette 
---
 lib/librte_kni/rte_kni.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
index fdb7509..f89319c 100644
--- a/lib/librte_kni/rte_kni.c
+++ b/lib/librte_kni/rte_kni.c
@@ -40,6 +40,7 @@
 #include 
 #include 

+#include 
 #include 
 #include 
 #include 
@@ -61,6 +62,9 @@

 #define KNI_MEM_CHECK(cond) do { if (cond) goto kni_fail; } while (0)

+// Configure how often we log "out of memory" messages (in seconds)
+#define KNI_SPAM_SUPPRESSION_PERIOD 60*10
+
 /**
  * KNI context
  */
@@ -592,6 +596,10 @@ kni_free_mbufs(struct rte_kni *kni)
 static void
 kni_allocate_mbufs(struct rte_kni *kni)
 {
+   static uint64_t no_mbufs = 0;
+   static uint64_t spam_filter = 0;
+   static uint64_t delayPeriod = 0;
+
int i, ret;
struct rte_mbuf *pkts[MAX_MBUF_BURST_NUM];

@@ -620,7 +628,18 @@ kni_allocate_mbufs(struct rte_kni *kni)
pkts[i] = rte_pktmbuf_alloc(kni->pktmbuf_pool);
if (unlikely(pkts[i] == NULL)) {
/* Out of memory */
-   RTE_LOG(ERR, KNI, "Out of memory\n");
+   no_mbufs++;
+
+   // Memory leak or need to tune? Regardless, if we get 
here once,
+   // we will get here a *lot*. Don't spam the logs!
+   now = rte_get_tsc_cycles();
+   if (!delayPeriod)
+   delayPeriod = rte_get_tsc_hz() * 
KNI_SPAM_SUPPRESSION_PERIOD;
+
+   if (!spam_filter || (now - spam_filter) > delayPeriod) {
+   RTE_LOG(ERR, KNI, "No mbufs available 
(%llu)\n", (unsigned long long)no_mbufs);
+   spam_filter = now;
+   }
break;
}
}
-- 
1.9.3 (Apple Git-50)

[dpdk-dev] [PATCH] Fixed spam from kni_allocate_mbufs() when no mbufs are free.

2014-12-17 Thread Jay Rolette

self-NAK now that I know that gmail web client borks up the patches. I'll
re-submit.

Jay

On Fri, Dec 12, 2014 at 7:28 PM, Jay Rolette  wrote:
>
> Fixed spam from kni_allocate_mbufs() when no mbufs are free.
> If mbufs exhausted, 'out of memory' message logged at EXTREMELY high
> rates. Now logs no more than once per 10 mins
>
> Signed-off-by: Jay Rolette 
> ---
>  lib/librte_kni/rte_kni.c | 21 -
>  1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
> index fdb7509..f89319c 100644
> --- a/lib/librte_kni/rte_kni.c
> +++ b/lib/librte_kni/rte_kni.c
> @@ -40,6 +40,7 @@
>  #include 
>  #include 
>
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -61,6 +62,9 @@
>
>  #define KNI_MEM_CHECK(cond) do { if (cond) goto kni_fail; } while (0)
>
> +// Configure how often we log "out of memory" messages (in seconds)
> +#define KNI_SPAM_SUPPRESSION_PERIOD 60*10
> +
>  /**
>   * KNI context
>   */
> @@ -592,6 +596,10 @@ kni_free_mbufs(struct rte_kni *kni)
>  static void
>  kni_allocate_mbufs(struct rte_kni *kni)
>  {
> + static uint64_t no_mbufs = 0;
> + static uint64_t spam_filter = 0;
> + static uint64_t delayPeriod = 0;
> +
>   int i, ret;
>   struct rte_mbuf *pkts[MAX_MBUF_BURST_NUM];
>
> @@ -620,7 +628,18 @@ kni_allocate_mbufs(struct rte_kni *kni)
>   pkts[i] = rte_pktmbuf_alloc(kni->pktmbuf_pool);
>   if (unlikely(pkts[i] == NULL)) {
>   /* Out of memory */
> - RTE_LOG(ERR, KNI, "Out of memory\n");
> + no_mbufs++;
> +
> + // Memory leak or need to tune? Regardless, if we get here once,
> + // we will get here a *lot*. Don't spam the logs!
> + now = rte_get_tsc_cycles();
> + if (!delayPeriod)
> +delayPeriod = rte_get_tsc_hz() * KNI_SPAM_SUPPRESSION_PERIOD;
> +
> + if (!spam_filter || (now - spam_filter) > delayPeriod) {
> + RTE_LOG(ERR, KNI, "No mbufs available (%llu)\n", (unsigned long
> long)no_mbufs);
> + spam_filter = now;
> + }
>   break;
>   }
>   }
> --
>
>

[dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr() with qsort() from standard library

2014-12-17 Thread Jay Rolette

Thanks to some help from Matthew Hall, it looks like I have it working now.
I just resent the patch directly from git. Please let me know if it looks
ok now?

Sorry for the hassles. We use Mercurial internally, so while there is a lot
of overlap, sending patches isn't something I have to worry about
day-to-day. Appreciate the help getting things configured!

Jay

On Wed, Dec 17, 2014 at 7:08 AM, Qiu, Michael  wrote:
>
> On 12/17/2014 7:01 PM, Ananyev, Konstantin wrote:
> > Hi Jay,
> >
> > From: Jay Rolette [mailto:rolette at infiniteio.com]
> > Sent: Tuesday, December 16, 2014 7:21 PM
> > To: Ananyev, Konstantin
> > Cc: Dev
> > Subject: Re: [dpdk-dev] [PATCH] replaced O(n^2) sort in
> sort_by_physaddr() with qsort() from standard library
> >
> > Actually, I just relooked at the email I sent and it looks correct
> (properly indented, etc.). Any suggestions for what might be going on?
> >
> > On Tue, Dec 16, 2014 at 1:18 PM, Jay Rolette  <mailto:rolette at infiniteio.com>> wrote:
> > Thanks Konstantin. Yes, I'll resend. Not sure why gmail is removing
> whitespace when I sent in Plain Text mode.
> >
> > Sorry, don?t know either.
> > Wonder, did you see this:
> > https://www.kernel.org/pub/software/scm/git/docs/git-format-patch.html
> > There is a section about different mailers, etc.
> > Konstantin
> >
> > Ultimately I'll need to figure out how to properly configure git to send
> these directly instead of handling them more manually. The examples I saw
> assumed you were using a gmail.com<http://gmail.com> email rather than a
> corporate email hosted via google apps.
> Hi Jay,
>
> I was ever use git send-email with my gmail account, it works.
>
> So do you config your git send-email correctly?
>
> Thanks,
> Michael
>
>
> > Jay
> >
> > On Tue, Dec 16, 2014 at 12:39 PM, Ananyev, Konstantin <
> konstantin.ananyev at intel.com<mailto:konstantin.ananyev at intel.com>> 
> wrote:
> >
> > Hi Jay,
> >
> >> -Original Message-
> >> From: dev [mailto:dev-bounces at dpdk.org<mailto:dev-bounces at dpdk.org>]
> On Behalf Of Jay Rolette
> >> Sent: Thursday, December 11, 2014 4:06 PM
> >> To: Dev
> >> Subject: [dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr()
> with qsort() from standard library
> >>
> >> Signed-off-by: Jay Rolette  rolette at infiniteio.com>>
> > The patch itself looks good to me.
> > Though it seems something wrong with formatting - all lines start with
> offset 0.
> > Probably your mail client?
> > Konstantin
> >
> >
> >> ---
> >>  lib/librte_eal/linuxapp/eal/eal_memory.c | 59
> >> +++-
> >>  1 file changed, 20 insertions(+), 39 deletions(-)
> >>
> >> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
> >> b/lib/librte_eal/linuxapp/eal/eal_memory.c
> >> index bae2507..3656515 100644
> >> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> >> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> >> @@ -670,6 +670,25 @@ error:
> >>   return -1;
> >>  }
> >>
> >> +static int
> >> +cmp_physaddr(const void *a, const void *b)
> >> +{
> >> +#ifndef RTE_ARCH_PPC_64
> >> + const struct hugepage_file *p1 = (const struct hugepage_file *)a;
> >> + const struct hugepage_file *p2 = (const struct hugepage_file *)b;
> >> +#else
> >> + // PowerPC needs memory sorted in reverse order from x86
> >> + const struct hugepage_file *p1 = (const struct hugepage_file *)b;
> >> + const struct hugepage_file *p2 = (const struct hugepage_file *)a;
> >> +#endif
> >> + if (p1->physaddr < p2->physaddr)
> >> + return -1;
> >> + else if (p1->physaddr > p2->physaddr)
> >> + return 1;
> >> + else
> >> + return 0;
> >> +}
> >> +
> >>  /*
> >>   * Sort the hugepg_tbl by physical address (lower addresses first on
> x86,
> >>   * higher address first on powerpc). We use a slow algorithm, but we
> won't
> >> @@ -678,45 +697,7 @@ error:
> >>  static int
> >>  sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info
> >> *hpi)
> >>  {
> >> - unsigned i, j;
> >> - int compare_idx;
> >> - uint64_t compare_addr;
> >> - struct hugepage_file tmp;
> >> -
> >> - for (i = 0; i < hpi->num_pages[0]; i++) {
> >> - compare_addr = 0;
> >> - compare_idx = -1;
> >> -
> >

[dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr() with qsort() from standard library

2014-12-17 Thread Jay Rolette

Signed-off-by: Jay Rolette 
---
 lib/librte_eal/linuxapp/eal/eal_memory.c | 59 +++-
 1 file changed, 20 insertions(+), 39 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c 
b/lib/librte_eal/linuxapp/eal/eal_memory.c
index bae2507..3656515 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -670,6 +670,25 @@ error:
return -1;
 }

+static int
+cmp_physaddr(const void *a, const void *b)
+{
+#ifndef RTE_ARCH_PPC_64
+   const struct hugepage_file *p1 = (const struct hugepage_file *)a;
+   const struct hugepage_file *p2 = (const struct hugepage_file *)b;
+#else
+   // PowerPC needs memory sorted in reverse order from x86
+   const struct hugepage_file *p1 = (const struct hugepage_file *)b;
+   const struct hugepage_file *p2 = (const struct hugepage_file *)a;
+#endif
+   if (p1->physaddr < p2->physaddr)
+   return -1;
+   else if (p1->physaddr > p2->physaddr)
+   return 1;
+   else
+   return 0;
+}
+
 /*
  * Sort the hugepg_tbl by physical address (lower addresses first on x86,
  * higher address first on powerpc). We use a slow algorithm, but we won't
@@ -678,45 +697,7 @@ error:
 static int
 sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi)
 {
-   unsigned i, j;
-   int compare_idx;
-   uint64_t compare_addr;
-   struct hugepage_file tmp;
-
-   for (i = 0; i < hpi->num_pages[0]; i++) {
-   compare_addr = 0;
-   compare_idx = -1;
-
-   /*
-* browse all entries starting at 'i', and find the
-* entry with the smallest addr
-*/
-   for (j=i; j< hpi->num_pages[0]; j++) {
-
-   if (compare_addr == 0 ||
-#ifdef RTE_ARCH_PPC_64
-   hugepg_tbl[j].physaddr > compare_addr) {
-#else
-   hugepg_tbl[j].physaddr < compare_addr) {
-#endif
-   compare_addr = hugepg_tbl[j].physaddr;
-   compare_idx = j;
-   }
-   }
-
-   /* should not happen */
-   if (compare_idx == -1) {
-   RTE_LOG(ERR, EAL, "%s(): error in physaddr sorting\n", 
__func__);
-   return -1;
-   }
-
-   /* swap the 2 entries in the table */
-   memcpy(, _tbl[compare_idx],
-   sizeof(struct hugepage_file));
-   memcpy(_tbl[compare_idx], _tbl[i],
-   sizeof(struct hugepage_file));
-   memcpy(_tbl[i], , sizeof(struct hugepage_file));
-   }
+   qsort(hugepg_tbl, hpi->num_pages[0], sizeof(struct hugepage_file), 
cmp_physaddr);
return 0;
 }

-- 
1.9.3 (Apple Git-50)

[dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr() with qsort() from standard library

2014-12-16 Thread Jay Rolette

Actually, I just relooked at the email I sent and it looks correct
(properly indented, etc.). Any suggestions for what might be going on?

On Tue, Dec 16, 2014 at 1:18 PM, Jay Rolette  wrote:
>
> Thanks Konstantin. Yes, I'll resend. Not sure why gmail is removing
> whitespace when I sent in Plain Text mode.
>
> Ultimately I'll need to figure out how to properly configure git to send
> these directly instead of handling them more manually. The examples I saw
> assumed you were using a gmail.com email rather than a corporate email
> hosted via google apps.
>
> Jay
>
> On Tue, Dec 16, 2014 at 12:39 PM, Ananyev, Konstantin <
> konstantin.ananyev at intel.com> wrote:
>>
>>
>> Hi Jay,
>>
>> > -Original Message-
>> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jay Rolette
>> > Sent: Thursday, December 11, 2014 4:06 PM
>> > To: Dev
>> > Subject: [dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr()
>> with qsort() from standard library
>> >
>> > Signed-off-by: Jay Rolette 
>>
>> The patch itself looks good to me.
>> Though it seems something wrong with formatting - all lines start with
>> offset 0.
>> Probably your mail client?
>> Konstantin
>>
>>
>> > ---
>> >  lib/librte_eal/linuxapp/eal/eal_memory.c | 59
>> > +++-
>> >  1 file changed, 20 insertions(+), 39 deletions(-)
>> >
>> > diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> > b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> > index bae2507..3656515 100644
>> > --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> > +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> > @@ -670,6 +670,25 @@ error:
>> >   return -1;
>> >  }
>> >
>> > +static int
>> > +cmp_physaddr(const void *a, const void *b)
>> > +{
>> > +#ifndef RTE_ARCH_PPC_64
>> > + const struct hugepage_file *p1 = (const struct hugepage_file *)a;
>> > + const struct hugepage_file *p2 = (const struct hugepage_file *)b;
>> > +#else
>> > + // PowerPC needs memory sorted in reverse order from x86
>> > + const struct hugepage_file *p1 = (const struct hugepage_file *)b;
>> > + const struct hugepage_file *p2 = (const struct hugepage_file *)a;
>> > +#endif
>> > + if (p1->physaddr < p2->physaddr)
>> > + return -1;
>> > + else if (p1->physaddr > p2->physaddr)
>> > + return 1;
>> > + else
>> > + return 0;
>> > +}
>> > +
>> >  /*
>> >   * Sort the hugepg_tbl by physical address (lower addresses first on
>> x86,
>> >   * higher address first on powerpc). We use a slow algorithm, but we
>> won't
>> > @@ -678,45 +697,7 @@ error:
>> >  static int
>> >  sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info
>> > *hpi)
>> >  {
>> > - unsigned i, j;
>> > - int compare_idx;
>> > - uint64_t compare_addr;
>> > - struct hugepage_file tmp;
>> > -
>> > - for (i = 0; i < hpi->num_pages[0]; i++) {
>> > - compare_addr = 0;
>> > - compare_idx = -1;
>> > -
>> > - /*
>> > - * browse all entries starting at 'i', and find the
>> > - * entry with the smallest addr
>> > - */
>> > - for (j=i; j< hpi->num_pages[0]; j++) {
>> > -
>> > - if (compare_addr == 0 ||
>> > -#ifdef RTE_ARCH_PPC_64
>> > - hugepg_tbl[j].physaddr > compare_addr) {
>> > -#else
>> > - hugepg_tbl[j].physaddr < compare_addr) {
>> > -#endif
>> > - compare_addr = hugepg_tbl[j].physaddr;
>> > - compare_idx = j;
>> > - }
>> > - }
>> > -
>> > - /* should not happen */
>> > - if (compare_idx == -1) {
>> > - RTE_LOG(ERR, EAL, "%s(): error in physaddr sorting\n", __func__);
>> > - return -1;
>> > - }
>> > -
>> > - /* swap the 2 entries in the table */
>> > - memcpy(, _tbl[compare_idx],
>> > - sizeof(struct hugepage_file));
>> > - memcpy(_tbl[compare_idx], _tbl[i],
>> > - sizeof(struct hugepage_file));
>> > - memcpy(_tbl[i], , sizeof(struct hugepage_file));
>> > - }
>> > + qsort(hugepg_tbl, hpi->num_pages[0], sizeof(struct hugepage_file),
>> > cmp_physaddr);
>> >   return 0;
>> >  }
>> >
>> > --
>>
>

[dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr() with qsort() from standard library

2014-12-16 Thread Jay Rolette

Thanks Konstantin. Yes, I'll resend. Not sure why gmail is removing
whitespace when I sent in Plain Text mode.

Ultimately I'll need to figure out how to properly configure git to send
these directly instead of handling them more manually. The examples I saw
assumed you were using a gmail.com email rather than a corporate email
hosted via google apps.

Jay

On Tue, Dec 16, 2014 at 12:39 PM, Ananyev, Konstantin <
konstantin.ananyev at intel.com> wrote:
>
>
> Hi Jay,
>
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jay Rolette
> > Sent: Thursday, December 11, 2014 4:06 PM
> > To: Dev
> > Subject: [dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr()
> with qsort() from standard library
> >
> > Signed-off-by: Jay Rolette 
>
> The patch itself looks good to me.
> Though it seems something wrong with formatting - all lines start with
> offset 0.
> Probably your mail client?
> Konstantin
>
>
> > ---
> >  lib/librte_eal/linuxapp/eal/eal_memory.c | 59
> > +++-
> >  1 file changed, 20 insertions(+), 39 deletions(-)
> >
> > diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
> > b/lib/librte_eal/linuxapp/eal/eal_memory.c
> > index bae2507..3656515 100644
> > --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> > +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> > @@ -670,6 +670,25 @@ error:
> >   return -1;
> >  }
> >
> > +static int
> > +cmp_physaddr(const void *a, const void *b)
> > +{
> > +#ifndef RTE_ARCH_PPC_64
> > + const struct hugepage_file *p1 = (const struct hugepage_file *)a;
> > + const struct hugepage_file *p2 = (const struct hugepage_file *)b;
> > +#else
> > + // PowerPC needs memory sorted in reverse order from x86
> > + const struct hugepage_file *p1 = (const struct hugepage_file *)b;
> > + const struct hugepage_file *p2 = (const struct hugepage_file *)a;
> > +#endif
> > + if (p1->physaddr < p2->physaddr)
> > + return -1;
> > + else if (p1->physaddr > p2->physaddr)
> > + return 1;
> > + else
> > + return 0;
> > +}
> > +
> >  /*
> >   * Sort the hugepg_tbl by physical address (lower addresses first on
> x86,
> >   * higher address first on powerpc). We use a slow algorithm, but we
> won't
> > @@ -678,45 +697,7 @@ error:
> >  static int
> >  sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info
> > *hpi)
> >  {
> > - unsigned i, j;
> > - int compare_idx;
> > - uint64_t compare_addr;
> > - struct hugepage_file tmp;
> > -
> > - for (i = 0; i < hpi->num_pages[0]; i++) {
> > - compare_addr = 0;
> > - compare_idx = -1;
> > -
> > - /*
> > - * browse all entries starting at 'i', and find the
> > - * entry with the smallest addr
> > - */
> > - for (j=i; j< hpi->num_pages[0]; j++) {
> > -
> > - if (compare_addr == 0 ||
> > -#ifdef RTE_ARCH_PPC_64
> > - hugepg_tbl[j].physaddr > compare_addr) {
> > -#else
> > - hugepg_tbl[j].physaddr < compare_addr) {
> > -#endif
> > - compare_addr = hugepg_tbl[j].physaddr;
> > - compare_idx = j;
> > - }
> > - }
> > -
> > - /* should not happen */
> > - if (compare_idx == -1) {
> > - RTE_LOG(ERR, EAL, "%s(): error in physaddr sorting\n", __func__);
> > - return -1;
> > - }
> > -
> > - /* swap the 2 entries in the table */
> > - memcpy(, _tbl[compare_idx],
> > - sizeof(struct hugepage_file));
> > - memcpy(_tbl[compare_idx], _tbl[i],
> > - sizeof(struct hugepage_file));
> > - memcpy(_tbl[i], , sizeof(struct hugepage_file));
> > - }
> > + qsort(hugepg_tbl, hpi->num_pages[0], sizeof(struct hugepage_file),
> > cmp_physaddr);
> >   return 0;
> >  }
> >
> > --
>

[dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr() with qsort() from standard library

2014-12-15 Thread Jay Rolette

FWIW, it surprised the heck out of me as well.

Turns out that even though I'm compiling in 64-bit mode, gcc has
sizeof(int) as 4 bytes rather than 8. Not really sure what the scoop is on
that, but the standard leaves that up to the compiler. I'm used to int
always being the "natural size" on the CPU/OS, so for a 64-bit executable
on Intel, I assumed int would be 64-bit.

Being an embedded developer for so many years, I rarely use semi-defined
data types just to avoid these sorts of problems.

On Mon, Dec 15, 2014 at 7:20 AM, Wodkowski, PawelX <
pawelx.wodkowski at intel.com> wrote:
>
> > Because it doesn't work correctly :-)
>
> It should, what I am missing here? :P
>
> Pawel
>
>

[dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr() with qsort() from standard library

2014-12-15 Thread Jay Rolette

Because it doesn't work correctly :-)

On Mon, Dec 15, 2014 at 3:05 AM, Wodkowski, PawelX <
pawelx.wodkowski at intel.com> wrote:
>
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jay Rolette
> > Sent: Thursday, December 11, 2014 5:06 PM
> > To: Dev
> > Subject: [dpdk-dev] [PATCH] replaced O(n^2) sort in sort_by_physaddr()
> with
> > qsort() from standard library
> >
> > Signed-off-by: Jay Rolette 
> > ---
> >  lib/librte_eal/linuxapp/eal/eal_memory.c | 59
> > +++-
> >  1 file changed, 20 insertions(+), 39 deletions(-)
> >
> > diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
> > b/lib/librte_eal/linuxapp/eal/eal_memory.c
> > index bae2507..3656515 100644
> > --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> > +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> > @@ -670,6 +670,25 @@ error:
> >   return -1;
> >  }
> >
> > +static int
> > +cmp_physaddr(const void *a, const void *b)
> > +{
> > +#ifndef RTE_ARCH_PPC_64
> > + const struct hugepage_file *p1 = (const struct hugepage_file *)a;
> > + const struct hugepage_file *p2 = (const struct hugepage_file *)b;
> > +#else
> > + // PowerPC needs memory sorted in reverse order from x86
> > + const struct hugepage_file *p1 = (const struct hugepage_file *)b;
> > + const struct hugepage_file *p2 = (const struct hugepage_file *)a;
> > +#endif
> > + if (p1->physaddr < p2->physaddr)
> > + return -1;
> > + else if (p1->physaddr > p2->physaddr)
> > + return 1;
> > + else
> > + return 0;
> > +}
> > +
>
> Why not simply
>
> return (int)(p1->physaddr - p2->physaddr);
>
>

[dpdk-dev] [PATCH] Fixed spam from kni_allocate_mbufs() when no mbufs are free.

2014-12-12 Thread Jay Rolette

Fixed spam from kni_allocate_mbufs() when no mbufs are free.
If mbufs exhausted, 'out of memory' message logged at EXTREMELY high rates.
Now logs no more than once per 10 mins

Signed-off-by: Jay Rolette 
---
 lib/librte_kni/rte_kni.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
index fdb7509..f89319c 100644
--- a/lib/librte_kni/rte_kni.c
+++ b/lib/librte_kni/rte_kni.c
@@ -40,6 +40,7 @@
 #include 
 #include 

+#include 
 #include 
 #include 
 #include 
@@ -61,6 +62,9 @@

 #define KNI_MEM_CHECK(cond) do { if (cond) goto kni_fail; } while (0)

+// Configure how often we log "out of memory" messages (in seconds)
+#define KNI_SPAM_SUPPRESSION_PERIOD 60*10
+
 /**
  * KNI context
  */
@@ -592,6 +596,10 @@ kni_free_mbufs(struct rte_kni *kni)
 static void
 kni_allocate_mbufs(struct rte_kni *kni)
 {
+ static uint64_t no_mbufs = 0;
+ static uint64_t spam_filter = 0;
+ static uint64_t delayPeriod = 0;
+
  int i, ret;
  struct rte_mbuf *pkts[MAX_MBUF_BURST_NUM];

@@ -620,7 +628,18 @@ kni_allocate_mbufs(struct rte_kni *kni)
  pkts[i] = rte_pktmbuf_alloc(kni->pktmbuf_pool);
  if (unlikely(pkts[i] == NULL)) {
  /* Out of memory */
- RTE_LOG(ERR, KNI, "Out of memory\n");
+ no_mbufs++;
+
+ // Memory leak or need to tune? Regardless, if we get here once,
+ // we will get here a *lot*. Don't spam the logs!
+ now = rte_get_tsc_cycles();
+ if (!delayPeriod)
+delayPeriod = rte_get_tsc_hz() * KNI_SPAM_SUPPRESSION_PERIOD;
+
+ if (!spam_filter || (now - spam_filter) > delayPeriod) {
+ RTE_LOG(ERR, KNI, "No mbufs available (%llu)\n", (unsigned long
long)no_mbufs);
+ spam_filter = now;
+ }
  break;
  }
  }
--

[dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages

2014-12-10 Thread Jay Rolette

On Wed, Dec 10, 2014 at 12:39 PM, Ananyev, Konstantin <
konstantin.ananyev at intel.com> wrote:

> > I just got through replacing that entire function in my repo with a call
> to qsort() from the standard library last night myself. Faster
> > (although probably not material to most deployments) and less code.
>
> If you feel like it is worth it, why not to submit a patch? :)

On Haswell and IvyBridge Xeons, with 128 1G huge pages, it doesn't make a
user-noticeable difference in the time required for
rte_eal_hugepage_init(). The reason I went ahead and checked it in my repo
is because:

a) it eats at my soul to see an O(n^2) case for something where qsort() is
trivial to use
b) we will increase that up to ~232 1G huge pages soon. Likely doesn't
matter at that point either, but since it was already written...

What *does* chew up a lot of time in init is where the huge pages are being
explicitly zeroed in map_all_hugepages().

Removing that memset() makes find_numasocket() blow up, but I was able to
do a quick test where I only memset 1 byte on each page. That cut init time
by 30% (~20 seconds in my test).  Significant, but since I'm not entirely
sure it is safe, I'm not making that change right now.

On Linux, shared memory that isn't file-backed is automatically zeroed
before the app gets it. However, I haven't had a chance to chase down
whether that applies to huge pages or not, much less how hugetlbfs factors
into the equation.

Back to the question about the patch, if you guys are interested in it,
I'll have to figure out your patch submission process. Shouldn't be a huge
deal other than the fact that we are on DPDK 1.6 (r2).

Cheers,
Jay

[dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages

2014-12-10 Thread Jay Rolette

On Wed, Dec 10, 2014 at 5:08 AM, Ananyev, Konstantin <
konstantin.ananyev at intel.com> wrote:

> I wonder why we do need to write our own bubble sort procedure?
> Why we can't use standard qsort() here?
>

Sadly, even bubble sort would be better than the selection sort being used
here. It's guaranteed to be O(n^2) in all cases.

I just got through replacing that entire function in my repo with a call to
qsort() from the standard library last night myself. Faster (although
probably not material to most deployments) and less code.

Jay

[dpdk-dev] failure with gmail accounts

2014-11-03 Thread Jay Rolette

What was the period of the issue? Also does this apply to just gmail.com
email addresses or also to other domains hosted by Google?

On Mon, Nov 3, 2014 at 3:25 AM, Thomas Monjalon 
wrote:

> Hi all,
>
> There was a major failure with gmail accounts.
> Due to bounce errors, the mailing list system has removed these accounts.
> They are now restored but some preferences may be lost.
> Obviously, these users may have missed some emails.
> Please check the archives:
> http://dpdk.org/ml/archives/dev/2014-November/date.html
>
> --
> Thomas
>

[dpdk-dev] Fwd: [dpdk-announce] DPDK Features for Q1 2015

2014-10-23 Thread Jay Rolette

Tim,

Thanks for sharing this. If nothing else, I wanted to at least provide some
feedback on the parts that look useful to me for my applications/product.
Bits that make me interested in the release:



*> 2.0 (Q1 2015) DPDK Features:> Bifurcated Driver: With the Bifurcated
Driver, the kernel will retain direct control of the NIC, and will assign
specific queue pairs to DPDK. Configuration of the NIC is controlled by the
kernel via ethtool.*

Having NIC configuration, port stats, etc. available via the normal Linux
tools is very helpful - particularly on new products just getting started
with DPDK.


*> Packet Reordering: Assign a sequence number to packets on Rx, and then
provide the ability to reorder on Tx to preserve the original order.*

This could be extremely useful but it depends on where it goes. The current
design being discussed seems fundamentally flawed to me. See the thread on
the RFC for details.


*> Packet Distributor (phase 2): Implement the following enhancements to
the Packet Distributor that was originally delivered in the DPDK 1.7
release: performance improvements; the ability for packets from a flow to
be processed by multiple worker cores in parallel and then reordered on Tx
using the Packet Reordering feature; the ability to have multiple
Distributors which share Worker cores.*

TBD on this for me. The 1.0 version of our product is based on DPDK 1.6 and
I haven't had a chance to look at what is happening with Packet Distributor
yet. An area of potential interest at least.


*> Cuckoo Hash: A new hash algorithm was implemented as part of the Cuckoo
Switch project (see http://www.cs.cmu.edu/~dongz/papers/cuckooswitch.pdf
), and shows some
promising performance results. This needs to be modified to make it more
generic, and then incorporated into DPDK.*

More performance == creamy goodness, especially if it is in the plumbing
and doesn't require significant app changes.


*> Interrupt mode for PMD: Allow DPDK process to transition to interrupt
mode when load is low so that other processes can run, or else power can be
saved. This will increase latency/jitter.*

Yes! I don't care about power savings, but I do care about giving a good
product impression in the lab during evals without having to sacrifice
overall system performance when under load. Hybrid drivers that use
interrupts when load is low and poll-mode when loaded are ideal, IMO.

It seems an odd thing, but during lab testing, it is normal for customers
to fire the box up and just start running pings or some other low volume
traffic through the box. If the PMDs are configured to batch in sizes
optimal for best performance under load, the system can look *really* bad
in these initial tests. We go through a fair bit of gymnastics right now to
work around this without just giving up on batching in the PMDs.


*> DPDK Headroom: Provide a mechanism to indicate how much headroom (spare
capacity) exists in a DPDK process.*

Very helpful in the field. Anything that helps customers understand how
much headroom is left on their box before they need to take action is a
huge win. CPU utilization is a bad indicator, especially with a PMD
architecture.

Hope this type of feedback is helpful.

Regards,
Jay

[dpdk-dev] [PATCH RFC] librte_reorder: new reorder library

2014-10-17 Thread Jay Rolette

Thanks for the responses, Reshma.

Can you provide a little more context about the use case that your reorder
library is intended to help with? If I'm understanding your answers
correctly, the functionality seems pretty limited and not something I would
ever end up using, but that may be more about the types of products I build
(deep packet inspection and working at L4-L7 generally, even though running
at or near line-rate).

Please take my comments in the spirit intended... If the design makes sense
for different use cases and I'm not the target audience, that's perfectly
ok and there are probably different trade-offs being made. But if it is
intended to be useful for DPI applications, I'd hate to just be quiet and
end up with something that doesn't get used as much as it might.

I haven't looked at the distributor library, so entirely possible it makes
more sense in that context.

More detailed responses to your previous answers inline.

Regards,
Jay

On Fri, Oct 17, 2014 at 4:44 AM, Pattan, Reshma 
wrote:

>  Hi Jay,
>
>
>
> Please find comments inline.
>
>
>
> Thanks,
>
> Reshma
>
>
>
> *From:* Jay Rolette [mailto:rolette at infiniteio.com]
> *Sent:* Thursday, October 9, 2014 8:02 PM
> *To:* Pattan, Reshma
> *Cc:* dev at dpdk.org
> *Subject:* Re: [dpdk-dev] [PATCH RFC] librte_reorder: new reorder library
>
>
>
> Hi Reshma,
>
>
>
> A few comments and questions about your design...
>
>
>
> 1) How do you envision the reorder library to be used? Based on the
> description, it seems like the expectation is that packet order would be
> maintained at either the interface/port level or maybe at the RX queue
> level. Is that right or am I reading too much between the lines?
>
>
>
> For my purposes (and for network security products I've developed in the
> past), I mostly don't care about box or port-level order. The most
> important thing is to maintain packet order within a flow. Relative order
> from packets in different flows doesn't matter. If there is a way I can
> process packets in parallel and transmit out-of-order transmission *within
> the flow*, that's very useful. Architecturally, it helps avoid hot-spotting
> in my packet processing pipeline and wasting cycles when load-balancing
> isn't perfect (and it never is).
>
> [Reshma]: Generic parallel processing of packets is planned in phase2
> version of distributor based on sequence numbers, but not flow based
>  parallel processing.
>

See question at the top of my email about the intended use-case. For DPI
applications, global (box-wide or per port) reordering isn't normally
required. Maintaining order within flows is the important part. Depending
on your implementation and the guarantees you make, the impact it has on
aggregate system throughput can be significant.

>   2) If the reorder library is "flow aware", then give me flexibility on
> deciding what a flow is. Let me define pseudo-flows even if the protocol
> itself isn't connection oriented (ie., frequently useful to treat UDP
> 5-tuples as a flow). I may want to include tunnels/VLANs/etc. as part of my
> "flow" definition. I may need to include the physical port as part of the
> flow definition.
>
>
>
> Ideally, the library includes the common cases and gives me the option to
> register a callback function for doing whatever sort of "flows" I require
> for my app.
>
> [Reshma]:It is not flow aware. But to reorder packets of particular flow,
> you can handover particular flow to the library and library will give you
> back the reordered data.
>

I think given how a couple of other bits are described, this doesn't end up
helping. More a bit further down.

>   3) Is there a way to apply the reorder library to some packets and not
> others? I might want to use for TCP and UDP, but not care about order for
> other IP traffic (for example).
>
> [Reshma]:No, reorder library will not have intelligence about  traffic
> type (i.e. flow or protocols based).
>
> Applications can do  traffic  splitting into flows or  protocol based and
>  handover to library for reordering
>

Ditto

  4) How are you dealing with internal congestion? If I drop a packet
> somewhere in my processing pipeline, how does the TX side of the reorder
> queue/buffer deal with the missing sequence number? Is there some sort of
> timeout mechanism so that it will only wait for X microseconds for a
> missing sequence number?
>
> [Reshma]: Library just takes care of packets what it has  got. No waiting
> mechanism is used for missing packets.
>
> Reorder processing will skip the dropped packets(i.e. will create a gap in
> reorder buffer) and proceed with allocation of slot to the later packets
> which are available.
>
>

[dpdk-dev] [PATCH RFC] librte_reorder: new reorder library

2014-10-09 Thread Jay Rolette

Hi Reshma,

A few comments and questions about your design...

1) How do you envision the reorder library to be used? Based on the
description, it seems like the expectation is that packet order would be
maintained at either the interface/port level or maybe at the RX queue
level. Is that right or am I reading too much between the lines?

For my purposes (and for network security products I've developed in the
past), I mostly don't care about box or port-level order. The most
important thing is to maintain packet order within a flow. Relative order
from packets in different flows doesn't matter. If there is a way I can
process packets in parallel and transmit out-of-order transmission *within
the flow*, that's very useful. Architecturally, it helps avoid hot-spotting
in my packet processing pipeline and wasting cycles when load-balancing
isn't perfect (and it never is).

2) If the reorder library is "flow aware", then give me flexibility on
deciding what a flow is. Let me define pseudo-flows even if the protocol
itself isn't connection oriented (ie., frequently useful to treat UDP
5-tuples as a flow). I may want to include tunnels/VLANs/etc. as part of my
"flow" definition. I may need to include the physical port as part of the
flow definition.

Ideally, the library includes the common cases and gives me the option to
register a callback function for doing whatever sort of "flows" I require
for my app.

3) Is there a way to apply the reorder library to some packets and not
others? I might want to use for TCP and UDP, but not care about order for
other IP traffic (for example).

4) How are you dealing with internal congestion? If I drop a packet
somewhere in my processing pipeline, how does the TX side of the reorder
queue/buffer deal with the missing sequence number? Is there some sort of
timeout mechanism so that it will only wait for X microseconds for a
missing sequence number?

Need the ability to bound how long packets are held up in the reorder
engine before they are released.

Assuming you address this, the reorder engine will also need to deal with
slow packets that show up after "later" packets were transmitted.

Regards,
Jay

On Tue, Oct 7, 2014 at 4:33 AM, Pattan, Reshma 
wrote:

> Hi All,
>
> I am planning  to implement packet reorder library. Details are as below,
> please go through them and provide the comments.
>
> Requirement:
>To reorder out of ordered packets that are received from
> different cores.
>
> Usage:
> To be used along with distributor library. Next version of distributor are
> planned to distribute incoming packets to all worker cores irrespective of
> the flow type.
> In this case to ensure in order delivery of the packets at output side
> reorder library will used by the tx end.
>
> Assumption:
> All input packets will be marked with sequence number in seqn field of
> mbuf, this will be the reference for reordering at the tx end.
> Sequence number will be of type uint32_t. New sequence number field seqn
> will be added to mbuf structure.
>
> Design:
> a)There will be reorder buffer(circular buffer) structure maintained in
> reorder library to store reordered packets and other details of buffer like
> head to drain the packet from, min sequence number and other details.
>b)Library will provide insert and drain functions to
> reorder and fetch out the reordered packets respectively.
> c)Users of library should pass the packets to insert functions for
> reordering.
>
> Insertion logic:
> Sequence number of current packet will be used to calculate offset in
> reorder buffer and write packet to the location  in the reorder buffer
> corresponding to offset.
>  Offset is calculated as difference of current
> packet  sequence number and sequence number associated with  reorder buffer.
>
> During sequence number wrapping or wrapping over of reorder buffer size,
> before inserting the new packet we should move offset number of packets to
> other buffer called overflow buffer and advance the head of reorder buffer
> by "offset-reorder buffer size" and insert the new packet.
>
> Insert function:
> int rte_reorder_insert(struct rte_reorder_buffer *buffer, struct rte_mbuf
> *mbuf);
> Note: Other insert function is also under plan to insert burst of packets.
>
>Reorder buffer:
> struct rte_reorder_buffer {
> unsigned int size;  /* The size (number of entries) of the
> buffer. */
> unsigned int mask;  /* Mask (size - 1) of the buffer */
> unsigned int head;  /* Current head of buffer */
> uint32_t min_seqn;  /* latest sequence number associated with
> buffer */
> struct rte_mbuf *entries[MAX_REORDER_BUFFER_SIZE]; /* buffer to
> hold reordered mbufs */
> };
>
> d)Users can fetch out the reordered packets by drain function provided by
> library. Users must pass the mbuf array , drain function should fill
> passed mbuff array  with the reordered buffer packets.

[dpdk-dev] KNI and memzones

2014-09-23 Thread Jay Rolette

Yep, good way to describe it. Not really related to network security
functions but very similar architecture.

On Tue, Sep 23, 2014 at 2:12 PM, Zhou, Danny  wrote:

>  It looks like a typical network middle box usage with IDS/IPS/DPI sort
> of functionalities.  Good enough performance rather than line-rate
> performance should be ok for this case, and multi-threaded KNI(multiple
> software rx/tx queues are established between DPDK and a single vEth netdev
> with multiple kernel threads affinities to several lcores) should fit, with
> linear performance scaling if you can allocate multiple lcores to achieve
> satisfied throughput for relatively big packets.
>
>
>
> Since NIC control is still in DPDK? PMD for this case, bifurcated driver
> does not fit, unless you only use DPDK to rx/tx packets in your box.
>
>
>
> *From:* Jay Rolette [mailto:rolette at infiniteio.com]
> *Sent:* Wednesday, September 24, 2014 2:53 AM
> *To:* Zhou, Danny
> *Cc:* Marc Sune; ; dev-team at bisdn.de
>
> *Subject:* Re: [dpdk-dev] KNI and memzones
>
>
>
> I can't discuss product details openly yet, but I'm happy to have a
> detailed discussion under NDA with Intel. In fact, we had an early NDA
> discussion with Intel about it a few months ago.
>
>
>
> That said, the use case isn't tied so closely to my product that I can't
> describe it in general terms...
>
>
>
> Imagine a box that installs in your network as a transparent
> bump-in-the-wire. Traffic comes in port 1 and is processed by our
> DPDK-based engine, then the packets are forwarded out port 2, where they
> head to their original destination. From a network topology point of view,
> the box is mostly invisible.
>
>
>
> Same process applies for traffic going the other way (RX on port 2,
> special-sauce processing in DPDK app, TX on port 1).
>
>
>
> If you are familiar with network security products, this is very much how
> IPS devices work.
>
>
>
> Where KNI comes into play is for several user-space apps that need to use
> the normal network stack (sockets) to communicate over the _same_ ports
> used on the main data path. We use KNI to create a virtual port with an IP
> address overlaid on the "invisible" data path ports.
>
>
>
> This isn't just for control traffic. It's obviously not line-rate
> processing, but we need to get all the bandwidth we can out of it.
>
>
>
> Let me know if that makes sense or if I need to clarify some things. If
> you'd rather continue this as an NDA discussion, just shoot me an email
> directly.
>
>
>
> Regards,
>
> Jay
>
>
>
>
>
>
>
> On Tue, Sep 23, 2014 at 11:38 AM, Zhou, Danny 
> wrote:
>
>
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jay Rolette
> > Sent: Tuesday, September 23, 2014 8:39 PM
> > To: Marc Sune
> > Cc: ; dev-team at bisdn.de
> > Subject: Re: [dpdk-dev] KNI and memzones
> >
> > *> p.s. Lately someone involved with DPDK said KNI would be deprecated in
> > future DPDK releases; I haven't read or listen to this before, is this
> > true? What would be the natural replacement then?*
> >
> > KNI is a non-trivial part of the product I'm in the process of building.
> > I'd appreciate someone "in the know" addressing this one please. Are
> there
> > specific roadmap plans relative to KNI that we need to be aware of?
> >
>
> KNI and multi-threaded KNI has several limitation:
> 1) Flow classification and packet distribution are done both software,
> specifically KNI user space library, at the cost of CPU cycles.
> 2) Low performance, skb creation/free and packetscopy between skb and mbuf
> kills performance significantly.
> 3) Dedicate cores in user space and kernel space responsible for rx/tx
> packets between DPDK App and KNI device, it seems to me waste too many core
> resources.
> 4) GPL license jail as KNI sits in kernel.
>
> We actually have a bifurcated driver prototype that meets both high
> performance and upstreamable requirement, which is treated as alternative
> solution of KNI. The idea is to
> leverage NIC' flow director capability to bifurcate data plane packets to
> DPDK and keep control plane packets or whatever packets need to go through
> kernel' TCP/IP stack remains
> being processed in kernel(NIC driver + stack). Basically, kernel NIC
> driver and DPDK co-exists to driver a same NIC device, but manipulate
> different rx/tx queue pairs. Though there is some
> tough consistent NIC control issue which needs to be resolved and
> upstreamed to kernel, which I do not want to expose details at the moment.
>
> IMHO, KNI should NOT be removed

[dpdk-dev] KNI and memzones

2014-09-23 Thread Jay Rolette

I can't discuss product details openly yet, but I'm happy to have a
detailed discussion under NDA with Intel. In fact, we had an early NDA
discussion with Intel about it a few months ago.

That said, the use case isn't tied so closely to my product that I can't
describe it in general terms...

Imagine a box that installs in your network as a transparent
bump-in-the-wire. Traffic comes in port 1 and is processed by our
DPDK-based engine, then the packets are forwarded out port 2, where they
head to their original destination. From a network topology point of view,
the box is mostly invisible.

Same process applies for traffic going the other way (RX on port 2,
special-sauce processing in DPDK app, TX on port 1).

If you are familiar with network security products, this is very much how
IPS devices work.

Where KNI comes into play is for several user-space apps that need to use
the normal network stack (sockets) to communicate over the _same_ ports
used on the main data path. We use KNI to create a virtual port with an IP
address overlaid on the "invisible" data path ports.

This isn't just for control traffic. It's obviously not line-rate
processing, but we need to get all the bandwidth we can out of it.

Let me know if that makes sense or if I need to clarify some things. If
you'd rather continue this as an NDA discussion, just shoot me an email
directly.

Regards,
Jay

On Tue, Sep 23, 2014 at 11:38 AM, Zhou, Danny  wrote:

>
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jay Rolette
> > Sent: Tuesday, September 23, 2014 8:39 PM
> > To: Marc Sune
> > Cc: ; dev-team at bisdn.de
> > Subject: Re: [dpdk-dev] KNI and memzones
> >
> > *> p.s. Lately someone involved with DPDK said KNI would be deprecated in
> > future DPDK releases; I haven't read or listen to this before, is this
> > true? What would be the natural replacement then?*
> >
> > KNI is a non-trivial part of the product I'm in the process of building.
> > I'd appreciate someone "in the know" addressing this one please. Are
> there
> > specific roadmap plans relative to KNI that we need to be aware of?
> >
>
> KNI and multi-threaded KNI has several limitation:
> 1) Flow classification and packet distribution are done both software,
> specifically KNI user space library, at the cost of CPU cycles.
> 2) Low performance, skb creation/free and packetscopy between skb and mbuf
> kills performance significantly.
> 3) Dedicate cores in user space and kernel space responsible for rx/tx
> packets between DPDK App and KNI device, it seems to me waste too many core
> resources.
> 4) GPL license jail as KNI sits in kernel.
>
> We actually have a bifurcated driver prototype that meets both high
> performance and upstreamable requirement, which is treated as alternative
> solution of KNI. The idea is to
> leverage NIC' flow director capability to bifurcate data plane packets to
> DPDK and keep control plane packets or whatever packets need to go through
> kernel' TCP/IP stack remains
> being processed in kernel(NIC driver + stack). Basically, kernel NIC
> driver and DPDK co-exists to driver a same NIC device, but manipulate
> different rx/tx queue pairs. Though there is some
> tough consistent NIC control issue which needs to be resolved and
> upstreamed to kernel, which I do not want to expose details at the moment.
>
> IMHO, KNI should NOT be removed unless there is a really good user space,
> open-source and socket backward-compatible() TCP/IP stack which should not
> become true very soon.
> The bifurcated driver approach could certainly replace KNI for some use
> cases where DPDK does not own the NIC control.
>
> Do you mind share your KNI use case in more details to help determine
> whether bifurcate driver could help with?
>
> > Regards,
> > Jay
> >
> > On Tue, Sep 23, 2014 at 4:27 AM, Marc Sune  wrote:
> >
> > > Hi all,
> > >
> > > So we are having some problems with KNI. In short, we have a DPDK
> > > application that creates KNI interfaces and destroys them during its
> > > lifecycle and connecting them to DOCKER containers. Interfaces may
> > > eventually be even named the same (see below).
> > >
> > > We were wondering why even calling rte_kni_release() the hugepages
> memory
> > > was rapidly being exhausted, and we also realised even after
> destruction,
> > > you cannot use the same name for the interface.
> > >
> > > After close inspection of the rte_kni lib we think the core issue and
> is
> > > mostly a design issue. rte_kni_alloc ends up calling
> kni_memzone_reserve()
> > > that calls at the end rte_memzone_reserve() whic

[dpdk-dev] KNI and memzones

2014-09-23 Thread Jay Rolette

*> p.s. Lately someone involved with DPDK said KNI would be deprecated in
future DPDK releases; I haven't read or listen to this before, is this
true? What would be the natural replacement then?*

KNI is a non-trivial part of the product I'm in the process of building.
I'd appreciate someone "in the know" addressing this one please. Are there
specific roadmap plans relative to KNI that we need to be aware of?

Regards,
Jay

On Tue, Sep 23, 2014 at 4:27 AM, Marc Sune  wrote:

> Hi all,
>
> So we are having some problems with KNI. In short, we have a DPDK
> application that creates KNI interfaces and destroys them during its
> lifecycle and connecting them to DOCKER containers. Interfaces may
> eventually be even named the same (see below).
>
> We were wondering why even calling rte_kni_release() the hugepages memory
> was rapidly being exhausted, and we also realised even after destruction,
> you cannot use the same name for the interface.
>
> After close inspection of the rte_kni lib we think the core issue and is
> mostly a design issue. rte_kni_alloc ends up calling kni_memzone_reserve()
> that calls at the end rte_memzone_reserve() which cannot be unreserved by
> rte_kni_relese() (by design of memzones). The exhaustion is rapid due to
> the number of FIFOs created (6).
>
> If this would be right, we would propose and try to provide a patch as
> follows:
>
> * Create a new rte_kni_init(unsigned int max_knis);
>
> This would preallocate all the FIFO rings(TX, RX, ALLOC, FREE, Request
> and  Response)*max_knis by calling kni_memzone_reserve(), and store them in
> a kni_fifo_pool. This should only be called once by DPDK applications at
> bootstrapping time.
>
> * rte_kni_allocate would just use one of the kni_fifo_pool (one => meaning
> a a set of 6 FIFOs making a single slot)
> * rte_kni_release would return to the pool.
>
> This should solve both issues. We would base the patch on 1.7.2.
>
> Thoughts?
> marc
>
> p.s. Lately someone involved with DPDK said KNI would be deprecated in
> future DPDK releases; I haven't read or listen to this before, is this
> true? What would be the natural replacement then?
>

[dpdk-dev] [PATCH v2 14/17] e1000: clean log messages

2014-09-02 Thread Jay Rolette

cv msg from VF %d", vf);
> return retval;
> }
>
> @@ -432,7 +432,7 @@ igb_rcv_msg_from_vf(struct rte_eth_dev *dev, uint16_t
> vf)
> retval = igb_vf_set_vlan(dev, vf, msgbuf);
> break;
> default:
> -   PMD_INIT_LOG(DEBUG, "Unhandled Msg %8.8x\n",
> +   PMD_INIT_LOG(DEBUG, "Unhandled Msg %8.8x",
>  (unsigned) msgbuf[0]);
> retval = E1000_ERR_MBX;
> break;
> diff --git a/lib/librte_pmd_e1000/igb_rxtx.c
> b/lib/librte_pmd_e1000/igb_rxtx.c
> index 3aa9609..5ca06c9 100644
> --- a/lib/librte_pmd_e1000/igb_rxtx.c
> +++ b/lib/librte_pmd_e1000/igb_rxtx.c
> @@ -396,7 +396,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf
> **tx_pkts,
> tx_last = (uint16_t) (tx_last - txq->nb_tx_desc);
>
> PMD_TX_LOG(DEBUG, "port_id=%u queue_id=%u pktlen=%u"
> -  " tx_first=%u tx_last=%u\n",
> +  " tx_first=%u tx_last=%u",
>(unsigned) txq->port_id,
>(unsigned) txq->queue_id,
>(unsigned) pkt_len,
> @@ -548,7 +548,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf
> **tx_pkts,
> txd->read.cmd_type_len |=
> rte_cpu_to_le_32(E1000_TXD_CMD_EOP |
> E1000_TXD_CMD_RS);
> }
> - end_of_tx:
> +end_of_tx:
> rte_wmb();
>
> /*
> @@ -697,8 +697,8 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf
> **rx_pkts,
>  * to happen by sending specific "back-pressure" flow
> control
>  * frames to its peer(s).
>  */
> -   PMD_RX_LOG(DEBUG, "\nport_id=%u queue_id=%u rx_id=%u "
> -  "staterr=0x%x pkt_len=%u\n",
> +   PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> +  "staterr=0x%x pkt_len=%u",
>(unsigned) rxq->port_id, (unsigned)
> rxq->queue_id,
>(unsigned) rx_id, (unsigned) staterr,
>(unsigned)
> rte_le_to_cpu_16(rxd.wb.upper.length));
> @@ -706,7 +706,7 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf
> **rx_pkts,
> nmb = rte_rxmbuf_alloc(rxq->mb_pool);
> if (nmb == NULL) {
> PMD_RX_LOG(DEBUG, "RX mbuf alloc failed port_id=%u
> "
> -  "queue_id=%u\n", (unsigned)
> rxq->port_id,
> +  "queue_id=%u", (unsigned) rxq->port_id,
>(unsigned) rxq->queue_id);
>
> rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed++;
> break;
> @@ -794,7 +794,7 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf
> **rx_pkts,
> nb_hold = (uint16_t) (nb_hold + rxq->nb_rx_hold);
> if (nb_hold > rxq->rx_free_thresh) {
> PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
> -  "nb_hold=%u nb_rx=%u\n",
> +  "nb_hold=%u nb_rx=%u",
>(unsigned) rxq->port_id, (unsigned)
> rxq->queue_id,
>(unsigned) rx_id, (unsigned) nb_hold,
>(unsigned) nb_rx);
> @@ -881,8 +881,8 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct
> rte_mbuf **rx_pkts,
>  * to happen by sending specific "back-pressure" flow
> control
>  * frames to its peer(s).
>  */
> -   PMD_RX_LOG(DEBUG, "\nport_id=%u queue_id=%u rx_id=%u "
> -  "staterr=0x%x data_len=%u\n",
> +   PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> +  "staterr=0x%x data_len=%u",
>(unsigned) rxq->port_id, (unsigned)
> rxq->queue_id,
>(unsigned) rx_id, (unsigned) staterr,
>(unsigned)
> rte_le_to_cpu_16(rxd.wb.upper.length));
> @@ -890,7 +890,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct
> rte_mbuf **rx_pkts,
> nmb = rte_rxmbuf_alloc(rxq->mb_pool);
> if (nmb == NULL) {
>         PMD_R

[dpdk-dev] [PATCH v2 12/17] e1000: use the right debug macro

2014-09-02 Thread Jay Rolette

V_LOG(DEBUG, "SMBI lock released");
> }
> e1000_put_hw_semaphore_generic(hw);
>
> @@ -416,7 +416,8 @@ igb_reset_swfw_lock(struct e1000_hw *hw)
> if (hw->bus.func > E1000_FUNC_1)
> mask <<= 2;
> if (hw->mac.ops.acquire_swfw_sync(hw, mask) < 0) {
> -   DEBUGOUT1("SWFW phy%d lock released",
> hw->bus.func);
> +   PMD_DRV_LOG(DEBUG, "SWFW phy%d lock released",
> +   hw->bus.func);
> }
> hw->mac.ops.release_swfw_sync(hw, mask);
>
> @@ -428,7 +429,7 @@ igb_reset_swfw_lock(struct e1000_hw *hw)
>  */
> mask = E1000_SWFW_EEP_SM;
> if (hw->mac.ops.acquire_swfw_sync(hw, mask) < 0) {
> -   DEBUGOUT("SWFW common locks released");
> +   PMD_DRV_LOG(DEBUG, "SWFW common locks released");
> }
> hw->mac.ops.release_swfw_sync(hw, mask);
> }
> @@ -707,7 +708,7 @@ igb_vmdq_vlan_hw_filter_enable(struct rte_eth_dev *dev)
>  static int
>  rte_igbvf_pmd_init(const char *name __rte_unused, const char *params
> __rte_unused)
>  {
> -   DEBUGFUNC("rte_igbvf_pmd_init");
> +   PMD_INIT_FUNC_TRACE();
>
> rte_eth_driver_register(_igbvf_pmd);
> return (0);
> diff --git a/lib/librte_pmd_e1000/igb_pf.c b/lib/librte_pmd_e1000/igb_pf.c
> index 3d405f0..76033ad 100644
> --- a/lib/librte_pmd_e1000/igb_pf.c
> +++ b/lib/librte_pmd_e1000/igb_pf.c
> @@ -404,7 +404,7 @@ igb_rcv_msg_from_vf(struct rte_eth_dev *dev, uint16_t
> vf)
>
> retval = e1000_read_mbx(hw, msgbuf, mbx_size, vf);
> if (retval) {
> -   RTE_LOG(ERR, PMD, "Error mbx recv msg from VF %d\n", vf);
> +   PMD_INIT_LOG(ERR, "Error mbx recv msg from VF %d\n", vf);
> return retval;
> }
>
> @@ -432,7 +432,8 @@ igb_rcv_msg_from_vf(struct rte_eth_dev *dev, uint16_t
> vf)
> retval = igb_vf_set_vlan(dev, vf, msgbuf);
> break;
> default:
> -   RTE_LOG(DEBUG, PMD, "Unhandled Msg %8.8x\n", (unsigned)
> msgbuf[0]);
> +   PMD_INIT_LOG(DEBUG, "Unhandled Msg %8.8x\n",
> +(unsigned) msgbuf[0]);
> retval = E1000_ERR_MBX;
> break;
> }
> diff --git a/lib/librte_pmd_e1000/igb_rxtx.c
> b/lib/librte_pmd_e1000/igb_rxtx.c
> index 977c4a2..3aa9609 100644
> --- a/lib/librte_pmd_e1000/igb_rxtx.c
> +++ b/lib/librte_pmd_e1000/igb_rxtx.c
> @@ -1210,17 +1210,15 @@ eth_igb_tx_queue_setup(struct rte_eth_dev *dev,
>  * driver.
>  */
> if (tx_conf->tx_free_thresh != 0)
> -   RTE_LOG(WARNING, PMD,
> -   "The tx_free_thresh parameter is not "
> -   "used for the 1G driver.\n");
> +   PMD_INIT_LOG(WARNING, "The tx_free_thresh parameter is not
> "
> +"used for the 1G driver.\n");
> if (tx_conf->tx_rs_thresh != 0)
> -   RTE_LOG(WARNING, PMD,
> -   "The tx_rs_thresh parameter is not "
> -   "used for the 1G driver.\n");
> +   PMD_INIT_LOG(WARNING, "The tx_rs_thresh parameter is not "
> +"used for the 1G driver.\n");
> if (tx_conf->tx_thresh.wthresh == 0)
> -   RTE_LOG(WARNING, PMD,
> -   "To improve 1G driver performance, consider
> setting "
> -   "the TX WTHRESH value to 4, 8, or 16.\n");
> +   PMD_INIT_LOG(WARNING, "To improve 1G driver performance, "
> +"consider setting the TX WTHRESH value to 4,
> 8, "
> +"or 16.\n");
>
> /* Free memory prior to re-allocation if needed */
> if (dev->data->tx_queues[queue_idx] != NULL) {
> --
> 1.7.10.4
>
> Reviewed-by: Jay Rolette

[dpdk-dev] [PATCH v2 07/17] i40e: use the right debug macro

2014-09-02 Thread Jay Rolette

_pf_host_send_msg_to_vf(struct i40e_pf_vf *vf,
> ret = i40e_aq_send_msg_to_vf(hw, abs_vf_id, opcode, retval,
> msg, msglen, NULL);
> if (ret) {
> -   PMD_DRV_LOG(ERR, "Fail to send message to VF, err %u\n",
> -   hw->aq.asq_last_status);
> -   printf("Fail to send message to VF, err %u\n",
> -   hw->aq.asq_last_status);
> +   PMD_INIT_LOG(ERR, "Fail to send message to VF, err %u\n",
> +hw->aq.asq_last_status);
> }
>
> return ret;
> diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c
> b/lib/librte_pmd_i40e/i40e_rxtx.c
> index f153844..6987200 100644
> --- a/lib/librte_pmd_i40e/i40e_rxtx.c
> +++ b/lib/librte_pmd_i40e/i40e_rxtx.c
> @@ -1788,50 +1788,50 @@ i40e_dev_tx_queue_setup(struct rte_eth_dev *dev,
> tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
> tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
> if (tx_rs_thresh >= (nb_desc - 2)) {
> -   RTE_LOG(ERR, PMD, "tx_rs_thresh must be less than the "
> -   "number of TX descriptors minus 2. "
> -   "(tx_rs_thresh=%u port=%d queue=%d)\n",
> -   (unsigned int)tx_rs_thresh,
> -   (int)dev->data->port_id,
> -   (int)queue_idx);
> +   PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the "
> +"number of TX descriptors minus 2. "
> +"(tx_rs_thresh=%u port=%d queue=%d)\n",
> +(unsigned int)tx_rs_thresh,
> +(int)dev->data->port_id,
> +(int)queue_idx);
> return I40E_ERR_PARAM;
> }
> if (tx_free_thresh >= (nb_desc - 3)) {
> -   RTE_LOG(ERR, PMD, "tx_rs_thresh must be less than the "
> -   "tx_free_thresh must be less than the "
> -   "number of TX descriptors minus 3. "
> -   "(tx_free_thresh=%u port=%d queue=%d)\n",
> -   (unsigned int)tx_free_thresh,
> -   (int)dev->data->port_id,
> -   (int)queue_idx);
> +   PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the "
> +"tx_free_thresh must be less than the "
> +"number of TX descriptors minus 3. "
> +"(tx_free_thresh=%u port=%d queue=%d)\n",
> +(unsigned int)tx_free_thresh,
> +(int)dev->data->port_id,
> +(int)queue_idx);
> return I40E_ERR_PARAM;
> }
> if (tx_rs_thresh > tx_free_thresh) {
> -   RTE_LOG(ERR, PMD, "tx_rs_thresh must be less than or "
> -       "equal to tx_free_thresh.
> (tx_free_thresh=%u"
> -   " tx_rs_thresh=%u port=%d queue=%d)\n",
> -   (unsigned
> int)tx_free_thresh,
> -   (unsigned int)tx_rs_thresh,
> -   (int)dev->data->port_id,
> -   (int)queue_idx);
> +   PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than or "
> +"equal to tx_free_thresh. (tx_free_thresh=%u"
> +" tx_rs_thresh=%u port=%d queue=%d)\n",
> +(unsigned int)tx_free_thresh,
> +(unsigned int)tx_rs_thresh,
> +(int)dev->data->port_id,
> +(int)queue_idx);
> return I40E_ERR_PARAM;
> }
> if ((nb_desc % tx_rs_thresh) != 0) {
> -   RTE_LOG(ERR, PMD, "tx_rs_thresh must be a divisor of the "
> -   "number of TX descriptors.
> (tx_rs_thresh=%u"
> -   " port=%d queue=%d)\n",
> -   (unsigned int)tx_rs_thresh,
> -   (int)dev->data->port_id,
> -   (int)queue_idx);
> +   PMD_INIT_LOG(ERR, "tx_rs_thresh must be a divisor of the "
> +"number of TX descriptors. (tx_rs_thresh=%u"
> +" port=%d queue=%d)\n",
> +(unsigned int)tx_rs_thresh,
> +(int)dev->data->port_id,
> +(int)queue_idx);
> return I40E_ERR_PARAM;
> }
> if ((tx_rs_thresh > 1) && (tx_conf->tx_thresh.wthresh != 0)) {
> -   RTE_LOG(ERR, PMD, "TX WTHRESH must be set to 0 if "
> -   "tx_rs_thresh is greater than 1. "
> -   "(tx_rs_thresh=%u port=%d queue=%d)\n",
> -   (unsigned int)tx_rs_thresh,
> -   (int)dev->data->port_id,
> -   (int)queue_idx);
> +   PMD_INIT_LOG(ERR, "TX WTHRESH must be set to 0 if "
> +"tx_rs_thresh is greater than 1. "
> +"(tx_rs_thresh=%u port=%d queue=%d)\n",
> +(unsigned int)tx_rs_thresh,
> +(int)dev->data->port_id,
> +(int)queue_idx);
> return I40E_ERR_PARAM;
> }
>
> --
> 1.7.10.4
>
> Reviewed-by: Jay Rolette

[dpdk-dev] [PATCH v2 01/17] ixgbe: use the right debug macro

2014-09-02 Thread Jay Rolette

On Tue, Sep 2, 2014 at 9:21 AM, Thomas Monjalon 
wrote:

> >> -#defineFUNC_PTR_OR_ERR_RET(func, retval) do { \
> > >> -   if ((func) == NULL) {  \
> > >> -   DEBUGOUT("%s:%d function not supported\n", \
> > >> -   __func__, __LINE__);   \
> > >> -   return (retval);   \
> > >> -   }  \
> > >> +#defineFUNC_PTR_OR_ERR_RET(func, retval) do {  \
> > >> +   if ((func) == NULL) {   \
> > >> +   PMD_DRV_LOG("%s:%d function not supported", \
> > >> +   __func__, __LINE__);\
> > >> +   return retval;\
> > >>
> > > Need to keep the parens around retval in your macro
> >
> > Actually, checkpatch complained about this.
> > So I can keep the parenthesis, but then I don't want Thomas to tell me my
> > patch does not pass checkpatch :-)
>
> You're right, I care about checkpatch :)
> I don't see a case where parens are needed with return. Please give an
> example.

Looking at it again, in this specific case you are correct. It is good
hygiene to always use parens around macro arguments, but this specific case
is ok without. It does add a small bit of danger if retval ends up getting
used elsewhere in the macro, but that's about it. Probably not worth
redoing the patch over that.

Jay

[dpdk-dev] [PATCH v2 01/17] ixgbe: use the right debug macro

2014-09-02 Thread Jay Rolette

Hi David,

A couple of minor comments inline below. Looks good otherwise.

Jay


On Mon, Sep 1, 2014 at 5:24 AM, David Marchand 
wrote:

> - We should not use DEBUGOUT*/DEBUGFUNC macros in non-shared code.
> These macros come as compat wrappers for shared code.
> - We should avoid calling RTE_LOG directly as pmd provides a wrapper for
> logs.
>
> Signed-off-by: David Marchand 
> ---
>  lib/librte_pmd_ixgbe/ixgbe_82599_bypass.c |   14 
>  lib/librte_pmd_ixgbe/ixgbe_bypass.c   |   26 +++---
>  lib/librte_pmd_ixgbe/ixgbe_ethdev.c   |   27 +++
>  lib/librte_pmd_ixgbe/ixgbe_pf.c   |4 +--
>  lib/librte_pmd_ixgbe/ixgbe_rxtx.c |   53
> +++--
>  5 files changed, 63 insertions(+), 61 deletions(-)
>
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_82599_bypass.c
> b/lib/librte_pmd_ixgbe/ixgbe_82599_bypass.c
> index 0fc..2623419 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_82599_bypass.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_82599_bypass.c
> @@ -63,7 +63,7 @@ ixgbe_set_fiber_fixed_speed(struct ixgbe_hw *hw,
> ixgbe_link_speed speed)
> rs = IXGBE_SFF_SOFT_RS_SELECT_1G;
> break;
> default:
> -   DEBUGOUT("Invalid fixed module speed\n");
> +   PMD_DRV_LOG("Invalid fixed module speed");
> return;
> }
>
> @@ -72,7 +72,7 @@ ixgbe_set_fiber_fixed_speed(struct ixgbe_hw *hw,
> ixgbe_link_speed speed)
>IXGBE_I2C_EEPROM_DEV_ADDR2,
>_data);
> if (status) {
> -   DEBUGOUT("Failed to read Rx Rate Select RS0\n");
> +   PMD_DRV_LOG("Failed to read Rx Rate Select RS0");
> goto out;
> }
>
> @@ -82,7 +82,7 @@ ixgbe_set_fiber_fixed_speed(struct ixgbe_hw *hw,
> ixgbe_link_speed speed)
> IXGBE_I2C_EEPROM_DEV_ADDR2,
> eeprom_data);
> if (status) {
> -   DEBUGOUT("Failed to write Rx Rate Select RS0\n");
> +   PMD_DRV_LOG("Failed to write Rx Rate Select RS0");
> goto out;
> }
>
> @@ -91,7 +91,7 @@ ixgbe_set_fiber_fixed_speed(struct ixgbe_hw *hw,
> ixgbe_link_speed speed)
>IXGBE_I2C_EEPROM_DEV_ADDR2,
>_data);
> if (status) {
> -   DEBUGOUT("Failed to read Rx Rate Select RS1\n");
> +   PMD_DRV_LOG("Failed to read Rx Rate Select RS1");
> goto out;
> }
>
> @@ -101,7 +101,7 @@ ixgbe_set_fiber_fixed_speed(struct ixgbe_hw *hw,
> ixgbe_link_speed speed)
> IXGBE_I2C_EEPROM_DEV_ADDR2,
> eeprom_data);
> if (status) {
> -   DEBUGOUT("Failed to write Rx Rate Select RS1\n");
> +   PMD_DRV_LOG("Failed to write Rx Rate Select RS1");
> goto out;
> }
>  out:
> @@ -130,7 +130,7 @@ ixgbe_setup_mac_link_multispeed_fixed_fiber(struct
> ixgbe_hw *hw,
> bool link_up = false;
> bool negotiation;
>
> -   DEBUGFUNC("");
> +   PMD_INIT_FUNC_TRACE();
>
> /* Mask off requested but non-supported speeds */
> status = ixgbe_get_link_capabilities(hw, _speed,
> );
> @@ -261,7 +261,7 @@ ixgbe_bypass_get_media_type(struct ixgbe_hw *hw)
>  {
> enum ixgbe_media_type media_type;
>
> -   DEBUGFUNC("");
> +   PMD_INIT_FUNC_TRACE();
>
> if (hw->device_id == IXGBE_DEV_ID_82599_BYPASS) {
> media_type = ixgbe_media_type_fiber;
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_bypass.c
> b/lib/librte_pmd_ixgbe/ixgbe_bypass.c
> index 1d21dc0..1a980b8 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_bypass.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_bypass.c
> @@ -40,20 +40,20 @@
>  #defineBYPASS_STATUS_OFF_MASK  3
>
>  /* Macros to check for invlaid function pointers. */
> -#defineFUNC_PTR_OR_ERR_RET(func, retval) do { \
> -   if ((func) == NULL) {  \
> -   DEBUGOUT("%s:%d function not supported\n", \
> -   __func__, __LINE__);   \
> -   return (retval);   \
> -   }  \
> +#defineFUNC_PTR_OR_ERR_RET(func, retval) do {  \
> +   if ((func) == NULL) {   \
> +   PMD_DRV_LOG("%s:%d function not supported", \
> +   __func__, __LINE__);\
> +   return retval;\
>
Need to keep the parens around retval in your macro


> +   }   \
>  } while(0)
>
> -#defineFUNC_PTR_OR_RET(func) do { \
> -   if ((func) == NULL) {

[dpdk-dev] Clang Scan build results

2014-08-27 Thread Jay Rolette

Here's the link you want: https://scan.coverity.com/

Check the FAQ for terms and you'll need to register the product, but I
didn't see anything obvious that should get in the way from being able to
use it for DPDK.

Agree with Coverity not being cheap. I bought it for my last company.
Really liked that Coverity had a much higher signal-to-noise ratio than
other tools at the time. Possibly the gap has closed since then, but you
know it can't be as bad as lint!


On Wed, Aug 27, 2014 at 10:52 AM, Wiles, Roger Keith <
keith.wiles at windriver.com> wrote:

>  Nice, we had to buy one and that was not cheap :-) I groped around on the
> Coverity site and did not find any statement about being free to open
> source, but I may have just missed it.
>
>  I did find that PC-Lint, Coverity, scan-build, ? all seem to test
> different parts of your code and some are better then others. Some have a
> lot false positives and some report huge number of issues, but it just
> depends on the type and level of scan you want. One thing I found was you
> need to run different tools to find different problems as none of them do
> everything IMO.
>
>  On Aug 27, 2014, at 10:24 AM, Jay Rolette  wrote:
>
>  *> We could run something like PC-Lint or Coverity, but they cost money
> :-)*
> Pretty sure Coverity is free for open source projects...
>
>  Jay
>
> On Wed, Aug 27, 2014 at 10:13 AM, Wiles, Roger Keith <
> keith.wiles at windriver.com> wrote:
>
>> Hi Everyone,
>>
>> I built dpdk with Clang and used the scan-build analyzer to produce a
>> report. The report is about 13M in size so not very nice to send to the
>> list. I decided to place the report on github.com if you want to see the
>> results.
>>
>> While running scan-build the build would fail, but I forced the build to
>> continue using the scan-build option to get as much of the report as
>> possible. I did not have time yet to understand why the build stopped and I
>> hope to dig into it more later.
>>
>> Running scan-build is pretty simple, so would be a nice test report if
>> you wanted to add it to dpdk.org site and maintain its output for
>> review. It would be nice to run once in a while to weed out any basic
>> problems. We could run something like PC-Lint or Coverity, but they cost
>> money :-)
>>
>> Here is the link to my Pktgen-DPDK repos on github:
>>
>> # git clone git://github.com/Pktgen/dpdk-scan-build-results.git
>>
>> Let me know if you have any questions or suggestions.
>>
>> Thanks
>> ++Keith
>>
>>
>> Keith Wiles, Principal Technologist with CTO office, Wind River mobile
>> 972-213-5533
>>
>>
>
>  *Keith **Wiles*, Principal Technologist with CTO office, *Wind River *mobile
> 972-213-5533
>
>

[dpdk-dev] Clang Scan build results

2014-08-27 Thread Jay Rolette

*> We could run something like PC-Lint or Coverity, but they cost money :-)*
Pretty sure Coverity is free for open source projects...

Jay

On Wed, Aug 27, 2014 at 10:13 AM, Wiles, Roger Keith <
keith.wiles at windriver.com> wrote:

> Hi Everyone,
>
> I built dpdk with Clang and used the scan-build analyzer to produce a
> report. The report is about 13M in size so not very nice to send to the
> list. I decided to place the report on github.com if you want to see the
> results.
>
> While running scan-build the build would fail, but I forced the build to
> continue using the scan-build option to get as much of the report as
> possible. I did not have time yet to understand why the build stopped and I
> hope to dig into it more later.
>
> Running scan-build is pretty simple, so would be a nice test report if you
> wanted to add it to dpdk.org site and maintain its output for review. It
> would be nice to run once in a while to weed out any basic problems. We
> could run something like PC-Lint or Coverity, but they cost money :-)
>
> Here is the link to my Pktgen-DPDK repos on github:
>
> # git clone git://github.com/Pktgen/dpdk-scan-build-results.git
>
> Let me know if you have any questions or suggestions.
>
> Thanks
> ++Keith
>
>
> Keith Wiles, Principal Technologist with CTO office, Wind River mobile
> 972-213-5533
>
>

[dpdk-dev] [PATCH 01/11] ixgbe: clean log messages

2014-08-27 Thread Jay Rolette

Hi David,

The updated output is definitely an improvement, but if you'll go with the
first solution you described (adding \n in the log macro), it works much
better for products using DPDK. For developer use, I think it ends up being
a wash either way.

At least for my product (embedded network appliance), we want to capture
anything that is logging in syslog. The log macros in DPDK make that very
reasonable to do, but if there are embedded newlines, it ends up screwing
up log parsing when someone wants to process logs later.

We end up having to remove all of those newlines, which makes it harder to
merge as new releases coming out.

I'm assuming most products have similar requirements for logging. That's at
least been the case for the products I've been involved with over the years.

If the PMDs are using PMD_LOG as a replacement macro for debugging
printf's, I can see where this might be a little more pain, but having
PMD_LOG is a lot more useful it if easily integrates with central logging
facilities.

Thanks,
Jay

On Tue, Aug 26, 2014 at 9:55 AM, David Marchand 
wrote:

> Hello Jay,
>
> On Tue, Aug 26, 2014 at 4:23 PM, Jay Rolette 
> wrote:
>
>> Why are you adding newlines to log message strings? Shouldn't that be up
>> to whatever the messages end up getting routed to?
>>
>
> Actually, I wanted to have consistent log formats in the PMDs so that the
> log messages displayed at runtime are aligned (and not bouncing with \n
> before or after the log message).
> There was two solutions from my point of view :
> - either always add a \n in the log macro and remove all trailing \n
> - do the opposite
>
> I preferred the latter as it let users of the macro set the message format
> as they want.
>
>
> Before this change :
>
> Configuring Port 0 (socket 1)
> PMD: ixgbe_dev_tx_queue_setup(): sw_ring=0x7f3e59cd8080
> hw_ring=0x7f3e59d02080 dma_addr=0x727702080
>
> PMD: ixgbe_dev_tx_queue_setup(): Using simple tx code path
>
> PMD: ixgbe_dev_tx_queue_setup(): Vector tx enabled.
>
> PMD: ixgbe_dev_rx_queue_setup(): sw_ring=0x7f3e59cd57c0
> hw_ring=0x7f3e59d12080 dma_addr=0x727712080
>
> PMD: ixgbe_dev_rx_queue_setup(): Rx Burst Bulk Alloc Preconditions are
> satisfied. Rx Burst Bulk Alloc function will be used on port=0, queue=0.
>
> PMD: ixgbe_dev_rx_queue_setup(): Vector rx enabled, please make sure RX
> burst size no less than 32.
>
>
> After this change :
>
> Configuring Port 0 (socket 1)
> PMD: ixgbe_dev_tx_queue_setup(): sw_ring=0x7fd47ecd8080
> hw_ring=0x7fd47ed02080 dma_addr=0x727702080
> PMD: ixgbe_dev_tx_queue_setup(): Using simple tx code path
> PMD: ixgbe_dev_tx_queue_setup(): Vector tx enabled.
> PMD: ixgbe_dev_rx_queue_setup(): sw_ring=0x7fd47ecd57c0
> hw_ring=0x7fd47ed12080 dma_addr=0x727712080
> PMD: ixgbe_dev_rx_queue_setup(): Rx Burst Bulk Alloc Preconditions are
> satisfied. Rx Burst Bulk Alloc function will be used on port=0, queue=0.
> PMD: ixgbe_dev_rx_queue_setup(): Vector rx enabled, please make sure RX
> burst size no less than 32.
> Port 0: 90:E2:BA:29:DF:58
>
>
> --
> David Marchand
>

[dpdk-dev] VMware Fusion + DPDK and KNI

2014-08-18 Thread Jay Rolette

Thought I'd put this out there in case anyone else runs into it.

Using DPDK 1.6 on Ubuntu 14.04 LTS in a hardware appliance. Also using KNI
to share the data ports with an app that needs a normal TCP/IP stack
interface.

We had everything working reasonably well on the hardware, but most of our
developers run their code in a VM on their laptops: OS X (Mavericks) +
VMware Fusion 6 Pro.

On some VMs, we were getting errors trying to configure KNI ports:

$ sudo ifconfig ffEth0 10.111.2.100 netmask 255.255.0.0 up
SIOCSIFFLAGS: Timer expired
SIOCSIFFLAGS: Timer expired

Skipping the "fun" involved with trying to track down the problem, here's
what ended up fixing it.

We had 4 network ports on the VM:

   - eth0 - Management port
   - eth1 - "other" function not related to the problem
   - eth2 & eth3 - inline datapath (bump-in-the-wire), but also KNI mapped
   to ffEth0 & ffEth1 by our DPDK app

If eth2 and eth3 are on the same vmnet, you'll get the "SIOCSIFFLAGS: Timer
expired" errors. Depending on what parameters you try to set, ifconfig may
think some of them have taken effect (they haven't) or it won't (setting
the MTU, etc.).

If you put eth2 and eth3 on separate vmnets, then no issues and you can
configure the KNI ports via ifconfig as expected.

No idea why having the ports on the same vmnet matters, since our app
doesn't care, but I haven't gone spelunking through the KNI source to find
the root cause.

Doubtful this will matter to many (any?), but maybe it'll save someone some
time.

Jay Rolette
*infinite io*

94 matches

Mail list logo