date:20160721

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-21 Thread Ananyev, Konstantin



> 
> This is an ABI deprecation notice for DPDK 16.11 in librte_ether about
> changes in rte_eth_dev and rte_eth_desc_lim structures.
> 
> As discussed in that thread:
> 
> http://dpdk.org/ml/archives/dev/2015-September/023603.html
> 
> Different NIC models depending on HW offload requested might impose
> different requirements on packets to be TX-ed in terms of:
> 
>  - Max number of fragments per packet allowed
>  - Max number of fragments per TSO segments
>  - The way pseudo-header checksum should be pre-calculated
>  - L3/L4 header fields filling
>  - etc.
> 
> 
> MOTIVATION:
> ---
> 
> 1) Some work cannot (and didn't should) be done in rte_eth_tx_burst.
>However, this work is sometimes required, and now, it's an
>application issue.
> 
> 2) Different hardware may have different requirements for TX offloads,
>other subset can be supported and so on.
> 
> 3) Some parameters (eg. number of segments in ixgbe driver) may hung
>device. These parameters may be vary for different devices.
> 
>For example i40e HW allows 8 fragments per packet, but that is after
>TSO segmentation. While ixgbe has a 38-fragment pre-TSO limit.
> 
> 4) Fields in packet may require different initialization (like eg. will
>require pseudo-header checksum precalculation, sometimes in a
>different way depending on packet type, and so on). Now application
>needs to care about it.
> 
> 5) Using additional API (rte_eth_tx_prep) before rte_eth_tx_burst let to
>prepare packet burst in acceptable form for specific device.
> 
> 6) Some additional checks may be done in debug mode keeping tx_burst
>implementation clean.
> 
> 
> PROPOSAL:
> -
> 
> To help user to deal with all these varieties we propose to:
> 
> 1. Introduce rte_eth_tx_prep() function to do necessary preparations of
>packet burst to be safely transmitted on device for desired HW
>offloads (set/reset checksum field according to the hardware
>requirements) and check HW constraints (number of segments per
>packet, etc).
> 
>While the limitations and requirements may differ for devices, it
>requires to extend rte_eth_dev structure with new function pointer
>"tx_pkt_prep" which can be implemented in the driver to prepare and
>verify packets, in devices specific way, before burst, what should to
>prevent application to send malformed packets.
> 
> 2. Also new fields will be introduced in rte_eth_desc_lim:
>nb_seg_max and nb_mtu_seg_max, providing an information about max
>segments in TSO and non-TSO packets acceptable by device.
> 
>This information is useful for application to not create/limit
>malicious packet.
> 
> 
> APPLICATION (CASE OF USE):
> --
> 
> 1) Application should to initialize burst of packets to send, set
>required tx offload flags and required fields, like l2_len, l3_len,
>l4_len, and tso_segsz
> 
> 2) Application passes burst to the rte_eth_tx_prep to check conditions
>required to send packets through the NIC.
> 
> 3) The result of rte_eth_tx_prep can be used to send valid packets
>and/or restore invalid if function fails.
> 
> eg.
> 
>   for (i = 0; i < nb_pkts; i++) {
> 
>   /* initialize or process packet */
> 
>   bufs[i]->tso_segsz = 800;
>   bufs[i]->ol_flags = PKT_TX_TCP_SEG | PKT_TX_IPV4
>   | PKT_TX_IP_CKSUM;
>   bufs[i]->l2_len = sizeof(struct ether_hdr);
>   bufs[i]->l3_len = sizeof(struct ipv4_hdr);
>   bufs[i]->l4_len = sizeof(struct tcp_hdr);
>   }
> 
>   /* Prepare burst of TX packets */
>   nb_prep = rte_eth_tx_prep(port, 0, bufs, nb_pkts);
> 
>   if (nb_prep < nb_pkts) {
>   printf("tx_prep failed\n");
> 
>   /* drop or restore invalid packets */
> 
>   }
> 
>   /* Send burst of TX packets */
>   nb_tx = rte_eth_tx_burst(port, 0, bufs, nb_prep);
> 
>   /* Free any unsent packets. */
> 
> 
> 
> Signed-off-by: Tomasz Kulasek 
> ---
>  doc/guides/rel_notes/deprecation.rst |7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst 
> b/doc/guides/rel_notes/deprecation.rst
> index f502f86..485aacb 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -41,3 +41,10 @@ Deprecation Notices
>  * The mempool functions for single/multi producer/consumer are deprecated and
>will be removed in 16.11.
>It is replaced by rte_mempool_generic_get/put functions.
> +
> +* In 16.11 ABI changes are plained: the ``rte_eth_dev`` structure will be
> +  extended with new function pointer ``tx_pkt_prep`` allowing verification
> +  and processing of packet burst to meet HW specific requirements before
> +  transmit. Also new fields will be added to the ``rte_eth_desc_lim`` 
> structure:
> +  ``nb_seg_max`` and ``nb_mtu_seg_max`` provideing information about number 
>

[dpdk-dev] [PATCH] doc: announce KNI ethtool removal

2016-07-21 Thread Thomas Monjalon

2016-07-21 13:20, Jay Rolette:
> On Thu, Jul 21, 2016 at 10:33 AM, Ferruh Yigit 
> wrote:
> > KNI ethtool is functional and maintained, and it may have users!
> >
> > Why just removing it, specially without providing an alternative?
> > Is is good time to discuss KCP again?
> 
> Yes, my product uses it.

Your product uses what? KCP? KNI? KNI ethtool?

> Seems like we are back to the same discussion we
> had a few months ago about the KNI situation...
> 
> It shouldn't be removed unless there is a replacement, ideally one that
> works with the normal Linux tools like every other network device.

This ethtool module works only for igb and ixgbe!
There is already no replacement for other drivers.
Who works on a replacement?

> While the code wasn't ready at the time, it was a definite improvement over 
> what
> we have with KNI today.

[dpdk-dev] [PATCH v3] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu

On Thu, Jul 21, 2016 at 04:43:25PM +0300, Ilya Maximets wrote:
> 
> 
> On 21.07.2016 16:35, Yuanhan Liu wrote:
> > On Thu, Jul 21, 2016 at 04:19:35PM +0300, Ilya Maximets wrote:
> >> If something abnormal happened to QEMU, 'connect()' can block calling
> >> thread (e.g. main thread of OVS) forever or for a really long time.
> >> This can break whole application or block the reconnection thread.
> >>
> >> Example with OVS:
> >>
> >>ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
> >>(gdb) bt
> >>#0  connect () from /lib64/libpthread.so.0
> >>#1  vhost_user_create_client (vsocket=0xa816e0)
> >>#2  rte_vhost_driver_register
> >>#3  netdev_dpdk_vhost_user_construct
> >>#4  netdev_open (name=0xa664b0 "vhost1")
> >>[...]
> >>#11 main
> >>
> >> Fix that by setting non-blocking mode for client sockets for connection.
> >>
> >> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> >>
> >> Signed-off-by: Ilya Maximets 
> > 
> > Acked-by: Yuanhan Liu 
> > 
> > One help I'd like to ask is that I'd appriciate if you could do the test
> > to make sure that your 2 (latest) patches fix the two issues you reported.
> > 
> > You might have already done that; I just want to make sure.
> 
> I've performed the test with 'ofport_request' script before sending patches.
> And currently test still works. No leaks of descriptors, no hangs,
> no QEMU crashes observed.
> Sometimes network device breaks on QEMU side, but it's QEMU issue. In this
> case I'm receiving following message from DPDK's vhost:
> 
> VHOST_CONFIG: vhost-user client: socket created, fd: 28
> VHOST_CONFIG: failed to connect to /vhost1: Resource temporarily unavailable
> VHOST_CONFIG: /vhost1: reconnecting...
> 
> Before the 'hang' patch there was hang of main thread.
> 
> After QEMU restart all works normally. OVS restart not required.

Good to know and appreciate that!

--yliu

[dpdk-dev] [PATCH v3] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu

On Thu, Jul 21, 2016 at 04:19:35PM +0300, Ilya Maximets wrote:
> If something abnormal happened to QEMU, 'connect()' can block calling
> thread (e.g. main thread of OVS) forever or for a really long time.
> This can break whole application or block the reconnection thread.
> 
> Example with OVS:
> 
>   ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
>   (gdb) bt
>   #0  connect () from /lib64/libpthread.so.0
>   #1  vhost_user_create_client (vsocket=0xa816e0)
>   #2  rte_vhost_driver_register
>   #3  netdev_dpdk_vhost_user_construct
>   #4  netdev_open (name=0xa664b0 "vhost1")
>   [...]
>   #11 main
> 
> Fix that by setting non-blocking mode for client sockets for connection.
> 
> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> 
> Signed-off-by: Ilya Maximets 

Acked-by: Yuanhan Liu 

One help I'd like to ask is that I'd appriciate if you could do the test
to make sure that your 2 (latest) patches fix the two issues you reported.

You might have already done that; I just want to make sure.

Thanks.

--yliu

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-21 Thread Adrien Mazarguil

Hi Jerin,

Sorry, looks like I missed your reply. Please see below.

On Mon, Jul 11, 2016 at 04:11:43PM +0530, Jerin Jacob wrote:
> On Tue, Jul 05, 2016 at 08:16:46PM +0200, Adrien Mazarguil wrote:
> 
> Hi Adrien,
> 
> Overall this proposal looks very good. I could easily map to the
> classification hardware engines I am familiar with.

Great, thanks.

> > Priorities
> > ~~
> > 
> > A priority can be assigned to a matching pattern.
> > 
> > The default priority level is 0 and is also the highest. Support for more
> > than a single priority level in hardware is not guaranteed.
> > 
> > If a packet is matched by several filters at a given priority level, the
> > outcome is undefined. It can take any path and can even be duplicated.
> 
> In some cases fatal unrecoverable error too

Right, do you think I need to elaborate regarding unrecoverable errors?

How much unrecoverable by the way? Like not being able to receive any more
packets?

> > Matching pattern items for packet data must be naturally stacked (ordered
> > from lowest to highest protocol layer), as in the following examples:
> > 
> > +--+
> > | TCPv4 as L4  |
> > +===+==+
> > | 0 | Ethernet |
> > +---+--+
> > | 1 | IPv4 |
> > +---+--+
> > | 2 | TCP  |
> > +---+--+
> > 
> > ++
> > | TCPv6 in VXLAN |
> > +===++
> > | 0 | Ethernet   |
> > +---++
> > | 1 | IPv4   |
> > +---++
> > | 2 | UDP|
> > +---++
> > | 3 | VXLAN  |
> > +---++
> > | 4 | Ethernet   |
> > +---++
> > | 5 | IPv6   |
> > +---++
> 
> How about enumerating as "Inner-IPV6" flow type to avoid any confusion. 
> Though spec
> can be same for both IPv6 and Inner-IPV6.

I'm not sure, if we have a more than two encapsulated IPv6 headers, knowing
that one of them is "inner" is not really useful. This is why I choose to
enforce the stack ordering instead, I think it makes more sense.

> > | 6 | TCP|
> > +---++
> > 
> > +-+
> > | TCPv4 as L4 with meta items |
> > +===+=+
> > | 0 | VOID|
> > +---+-+
> > | 1 | Ethernet|
> > +---+-+
> > | 2 | VOID|
> > +---+-+
> > | 3 | IPv4|
> > +---+-+
> > | 4 | TCP |
> > +---+-+
> > | 5 | VOID|
> > +---+-+
> > | 6 | VOID|
> > +---+-+
> > 
> > The above example shows how meta items do not affect packet data matching
> > items, as long as those remain stacked properly. The resulting matching
> > pattern is identical to "TCPv4 as L4".
> > 
> > ++
> > | UDPv6 anywhere |
> > +===++
> > | 0 | IPv6   |
> > +---++
> > | 1 | UDP|
> > +---++
> > 
> > If supported by the PMD, omitting one or several protocol layers at the
> > bottom of the stack as in the above example (missing an Ethernet
> > specification) enables hardware to look anywhere in packets.
> 
> It would be good if the common code can give it as Ethernet, IPV6, UDP
> to PMD(to avoid common code duplication across PMDs)

I left this mostly at PMD's discretion for now. Applications must provide
explicit rules if they need a consistent behavior. PMDs may not support this
at all, I've just documented what applications should expect when attempting
this kind of pattern.

> > It is unspecified whether the payload of supported encapsulations
> > (e.g. VXLAN inner packet) is matched by such a pattern, which may apply to
> > inner, outer or both packets.
> 
> a separate flow type enumeration may fix that problem. like "Inner-IPV6"
> mentioned above.

Not sure about that, for the same reason as above. Which "inner" level would
be matched by such a pattern? Note that it could have started with VXLAN
followed by ETH and then IPv6 if the application cared.

This is basically the ability to remain vague about a rule. I didn't want to
forbid it outright because I'm sure there are possible use cases:

- PMD validation and debugging.

- Rough filtering according to protocols a packet might contain somewhere
  (think of the network admins who cannot stand anything other than packets
  addressed to TCP port 80).

> > +-+
> > | Invalid, missing L3 |
> > +===+=+
> > | 0 | Ethernet|
> > +---+-+
> > | 1 | UDP |
> > +---+-+
> > 
> > The above pattern is invalid due to a missing L3 specification between L2
> > and L4. It is only allowed at the bottom and at the top of the stack.
> > 
> 
> > ``SIGNATURE``
> > ^
> > 
> > Requests hash-based signature dispatching for this rule.
> > 
> > Considering this is a global setting on devices

[dpdk-dev] [PATCH v3] vhost: fix driver unregister for client mode

2016-07-21 Thread Yuanhan Liu

On Thu, Jul 21, 2016 at 03:55:36PM +0300, Ilya Maximets wrote:
> Currently while calling of 'rte_vhost_driver_unregister()' connection
> to QEMU will not be closed. This leads to inability to register driver
> again and reconnect to same virtual machine.
> 
> This scenario is reproducible with OVS. While executing of the following
> command vhost port will be re-created (will be executed
> 'rte_vhost_driver_register()' followed by 'rte_vhost_driver_unregister()')
> network will be broken and QEMU possibly will crash:
> 
>   ovs-vsctl set Interface vhost1 ofport_request=15
> 
> Fix this by closing all established connections on driver unregister and
> removing of pending connections from reconnection list.
> 
> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> 
> Signed-off-by: Ilya Maximets 
> ---
> 
> Version 3:
>   * fixed leak of file descriptors by adding of
> 'close(reconn->fd)' to 'vhost_user_remove_reconnect()'
> 
> Version 2:
>   * style fixes.

Acked-by: Yuanhan Liu 

--yliu

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu

On Thu, Jul 21, 2016 at 03:58:11PM +0300, Ilya Maximets wrote:
> On 21.07.2016 15:58, Yuanhan Liu wrote:
> > On Thu, Jul 21, 2016 at 03:42:54PM +0300, Ilya Maximets wrote:
> >> On 21.07.2016 15:35, Yuanhan Liu wrote:
> >>> On Thu, Jul 21, 2016 at 03:13:14PM +0300, Ilya Maximets wrote:
> >>
> >> What do you think of it?
> >
> > I found that we can't check connection status without select/poll
> > on it. 'getsockopt()' will return 0 with no errors if connection
> > is not still established just like if it was.
> > So, I think, the first version of this patch is the only
> > acceptable solution.
> 
>  Sorry, v2 is acceptable too, because it always calls 'connect()'.
> >>>
> >>> So you have done the test that it works?
> >>
> >> No, it's just theory. I don't know how to test this.
> >>
> >>> I'm more curious to know
> >>> could your above case hit the getsockopt() code path, I mean, the
> >>> path that errno is set to EINPROGRESS or EISCONN?
> >>
> >> As I already told, I don't sure that we're able to get EINPROGRESS
> >> on our AF_UNIX sockets.
> >> In v2 'getsockopt()' check is unnecessary.
> > 
> > We then have no reason to keep it?
> 
> You want me to re-send v2 without 'getsockopt()' check?

Yes, because I'm not sure it will work without select or poll.

--yliu

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu

On Thu, Jul 21, 2016 at 03:42:54PM +0300, Ilya Maximets wrote:
> On 21.07.2016 15:35, Yuanhan Liu wrote:
> > On Thu, Jul 21, 2016 at 03:13:14PM +0300, Ilya Maximets wrote:
> 
>  What do you think of it?
> >>>
> >>> I found that we can't check connection status without select/poll
> >>> on it. 'getsockopt()' will return 0 with no errors if connection
> >>> is not still established just like if it was.
> >>> So, I think, the first version of this patch is the only
> >>> acceptable solution.
> >>
> >> Sorry, v2 is acceptable too, because it always calls 'connect()'.
> > 
> > So you have done the test that it works?
> 
> No, it's just theory. I don't know how to test this.
> 
> > I'm more curious to know
> > could your above case hit the getsockopt() code path, I mean, the
> > path that errno is set to EINPROGRESS or EISCONN?
> 
> As I already told, I don't sure that we're able to get EINPROGRESS
> on our AF_UNIX sockets.
> In v2 'getsockopt()' check is unnecessary.

We then have no reason to keep it?

--yliu

[dpdk-dev] [PATCH] eal: fix check number of bytes from read function

2016-07-21 Thread Jastrzebski, MichalX K

> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Thursday, July 21, 2016 4:36 PM
> To: Jastrzebski, MichalX K 
> Cc: dev at dpdk.org; Richardson, Bruce ;
> Kobylinski, MichalX ; Gonzalez Monroy,
> Sergio ; david.marchand at 6wind.com
> Subject: Re: [dpdk-dev] [PATCH] eal: fix check number of bytes from read
> function
> 
> Hi,
> 
> 2016-07-20 16:24, Michal Jastrzebski:
> > -   if (read(fd, , sizeof(uint64_t)) < 0) {
> > +
> > +   retval = read(fd, , sizeof(uint64_t));
> > +   if (retval < 0) {
> > RTE_LOG(ERR, EAL, "%s(): cannot read /proc/self/pagemap:
> %s\n",
> > __func__, strerror(errno));
> > close(fd);
> > return RTE_BAD_PHYS_ADDR;
> > +   }   else if (retval >= 0 && retval < (int)sizeof(uint64_t)) {
> 

Hi Thomas,

> I have 4 comments about the above line:
That's too much for one line. I should improve next time:)

> - the check retval >= 0 is not needed because implied by else
> - why not checking retval != sizeof(uint64_t) as it is the exact expected
> value?

Yes, it is better solution,

> - (int)sizeof(uint64_t) can be replaced by 8 but it's shorter ;)

I didn't want to change all invokes of read() function here. 
I can use some macro:
#define PFN_MASK_SIZE   8
How do You think?

> - only 1 space is required between } and else
> 
> > +   RTE_LOG(ERR, EAL, "%s(): read %d bytes from
> /proc/self/pagemap "
> > +   "but expected %d: %s\n",
> > +   __func__, retval, (int)sizeof(uint64_t),
> strerror(errno));
> 
> Are you sure errno is meaningful here?

I think it is not. Will send v2.
> 
> > +   close(fd);
> > +   return RTE_BAD_PHYS_ADDR;
> > }

Thanks for a review
Michal.

[dpdk-dev] [PATCH] eal: fix parsing of argument of option --lcores

2016-07-21 Thread bynes adam

On Thu, Jul 21, 2016 at 02:03:38PM +0800, Wei Dai wrote:
Hi Wei,
> The '-' in lcores set overrides cpu set of following
> lcore set in the argument of EAL option --lcores.
> 
> Fixes: 53e54bf81700 ("eal: new option --lcores for cpu assignment")
> 
> Signed-off-by: Wei Dai 
> ---
>  lib/librte_eal/common/eal_common_options.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/lib/librte_eal/common/eal_common_options.c 
> b/lib/librte_eal/common/eal_common_options.c
> index 0a594d7..96eb1a9 100644
> --- a/lib/librte_eal/common/eal_common_options.c
> +++ b/lib/librte_eal/common/eal_common_options.c
> @@ -563,6 +563,7 @@ convert_to_cpuset(rte_cpuset_t *cpusetp,
>   * lcores, cpus could be a single digit/range or a group.
>   * '(' and ')' are necessary if it's a group.
>   * If not supply '@cpus', the value of cpus uses the same as lcores.
> + * The 'a-b' in lcores not within '(' and ')' means a,a+1,...,b-1,b .
this description is not very clear, a-b and (a-b) are both the same meaning.
may be need a table for comparison
a-b@(c-d)
a-b at c-d
(a-b)@c-d
(a-b)@(c-d)
all the above I believe are the same
only the following two cases:
a-b,
(a-b),
so the key point here is the @ and (), not only @
>   * e.g. '1,2@(5-7),(3-5)@(0,2),(0,6),7-8' means start 9 EAL thread as below
>   *   lcore 0 runs on cpuset 0x41 (cpu 0,6)
>   *   lcore 1 runs on cpuset 0x2 (cpu 1)
> @@ -571,6 +572,15 @@ convert_to_cpuset(rte_cpuset_t *cpusetp,
>   *   lcore 6 runs on cpuset 0x41 (cpu 0,6)
>   *   lcore 7 runs on cpuset 0x80 (cpu 7)
>   *   lcore 8 runs on cpuset 0x100 (cpu 8)
> + * e.g. '0-2,(3-5)@(3,4),6@(5,6),7@(5-7)'means start 8 EAL threads as below
> + *   lcore 0 runs on cpuset 0x1 (cpu 0)
> + *   lcore 1 runs on cpuset 0x2 (cpu 1)
> + *   lcore 2 runs on cpuset ox4 (cpu 2)
> + *   lcore 3,4,5 runs on cpuset 0x18 (cpu 3,4)
> + *   lcore 6 runs on cpuset 0x60 (cpu 5,6)
> + *   lcore 7 runs on cpuset 0xe0 (cpu 5,6,7)
> + * The second case is used to test bugfix for lflags not be cleared after use
you can put this sentance and description into the commit log
I don't think you should put bugfix description in comments here.
> + */
>   */
>  static int
>  eal_parse_lcores(const char *lcores)
> @@ -679,6 +689,8 @@ eal_parse_lcores(const char *lcores)
>  sizeof(rte_cpuset_t));
>   }
>  
> + lflags = 0;
> +
>   lcores = end + 1;
>   } while (*end != '\0');
>  
> -- 
> 2.5.5
Adam Bynes

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu

On Thu, Jul 21, 2016 at 03:13:14PM +0300, Ilya Maximets wrote:
> >>
> >> What do you think of it?
> > 
> > I found that we can't check connection status without select/poll
> > on it. 'getsockopt()' will return 0 with no errors if connection
> > is not still established just like if it was.
> > So, I think, the first version of this patch is the only
> > acceptable solution.
> 
> Sorry, v2 is acceptable too, because it always calls 'connect()'.

So you have done the test that it works? I'm more curious to know
could your above case hit the getsockopt() code path, I mean, the
path that errno is set to EINPROGRESS or EISCONN?

--yliu

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu

On Thu, Jul 21, 2016 at 02:14:59PM +0300, Ilya Maximets wrote:
> > Hmm, how about this fixup:
> > --
> > diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
> > b/lib/librte_vhost/vhost_user/vhost-net-user.c
> > index 8626d13..b0f45e6 100644
> > --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> > +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> > @@ -537,18 +537,7 @@ vhost_user_connect_nonblock(int fd, struct sockaddr 
> > *un, size_t sz)
> > errno = EINVAL;
> >  
> > ret = connect(fd, un, sz);
> > -   if (ret == -1 && errno != EINPROGRESS)
> > -   return -1;
> > -   if (ret == 0)
> > -   goto connected;
> > -
> > -   FD_ZERO();
> > -   FD_SET(fd, );
> > -
> > -   ret = select(fd + 1, NULL, , NULL, );
> > -   if (!ret)
> > -   errno = ETIMEDOUT;
> > -   if (ret != 1)
> > +   if (ret < 0 && errno != EISCONN)
> > return -1;
> >  
> > ret = getsockopt(fd, SOL_SOCKET, SO_ERROR, _error, );
> > @@ -558,7 +547,6 @@ vhost_user_connect_nonblock(int fd, struct sockaddr 
> > *un, size_t sz)
> > return -1;
> > }
> >  
> > -connected:
> > flags = fcntl(fd, F_GETFL, 0);
> > if (flags < 0) {
> > RTE_LOG(ERR, VHOST_CONFIG,
> > --
> > ?
> > 
> > We will not check the EINPROGRESS, but subsequent 'connect()' will return
> > EISCONN if connection already established. getsockopt() is kept just in
> > case. Subsequent 'connect()' will happen on the next iteration of
> > reconnection cycle (1 second sleep).
> 
> I've sent v2 with this changes.

Thanks. But still, it doesn't look clean to me. I was thinking following
might be cleaner?

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c
b/lib/librte_vhost/vhost_user/vhost-net-user.
index f0f92f8..c0ef290 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -532,6 +532,10 @@ vhost_user_client_reconnect(void *arg __rte_unused)
 reconn != NULL; reconn = next) {
next = TAILQ_NEXT(reconn, next);

+   if (reconn->conn_inprogress) {
+   /* do connect check here */
+   }
+
if (connect(reconn->fd, (struct sockaddr 
*)>un,
sizeof(reconn->un)) < 0)
continue;
@@ -605,6 +609,7 @@ vhost_user_create_client(struct vhost_user_socket 
*vsocket)
reconn->un = un;
reconn->fd = fd;
reconn->vsocket = vsocket;
+   reconn->conn_inprogress = errno == EINPROGRESS;
pthread_mutex_lock(_list.mutex);
TAILQ_INSERT_TAIL(_list.head, reconn, next);
pthread_mutex_unlock(_list.mutex);

It's just a rough diff, hopefully it shows my idea clearly. And of
course, we should not call connect() anymore when conn_inprogress
is set.

What do you think of it?

--yliu

[dpdk-dev] [PATCH] timer: fix break list when timer_cb reset running timer

2016-07-21 Thread Hiroyuki Mikita

Hi Robert,

Thank you for reviewing.

In the following case, the skip list is broken.
- Timer A and timer B are configured on the same lcore, in the same
pending list.
- The expire time of timer A is earlier than that of timer B.
- rte_timer_manage() is called on the lcore after the expire time of timer B.
- The callback of timer A resets timer B.

In rte_timer_manage() process,
both timers are expired at the same time, so the running list includes
both timers.
States of both timers are transited from PENDING to RUNNING, then
callbacks are executed sequentially.
The callback of timer A resets timer B which is still RUNNING, so
timer B is added to the pending-list.
In this time, timer B is in both the running list and the pending list.
It means that the running list is chained to the pending list. Both
lists are broken.

Regards,
Hiroyuki

[dpdk-dev] [PATCH 3/3] net/thunderx: add 81xx SoC support

2016-07-21 Thread Jerin Jacob

81xx NIC subsystem differs in new PCI subsystem_device_id and
NICVF_CAP_CQE_RX2 capability.

Signed-off-by: Jerin Jacob 
---
 doc/guides/nics/thunderx.rst | 1 +
 drivers/net/thunderx/base/nicvf_hw.c | 3 +++
 drivers/net/thunderx/base/nicvf_hw.h | 1 +
 drivers/net/thunderx/nicvf_ethdev.c  | 7 +++
 4 files changed, 12 insertions(+)

diff --git a/doc/guides/nics/thunderx.rst b/doc/guides/nics/thunderx.rst
index e38f260..0604d25 100644
--- a/doc/guides/nics/thunderx.rst
+++ b/doc/guides/nics/thunderx.rst
@@ -60,6 +60,7 @@ Features of the ThunderX PMD are:
 Supported ThunderX SoCs
 ---
 - CN88xx
+- CN81xx

 Prerequisites
 -
diff --git a/drivers/net/thunderx/base/nicvf_hw.c 
b/drivers/net/thunderx/base/nicvf_hw.c
index 2b12d9c..4bdd183 100644
--- a/drivers/net/thunderx/base/nicvf_hw.c
+++ b/drivers/net/thunderx/base/nicvf_hw.c
@@ -143,6 +143,9 @@ nicvf_base_init(struct nicvf *nic)
if (nicvf_hw_version(nic) == PCI_SUB_DEVICE_ID_CN88XX_PASS2_NICVF)
nic->hwcap |= NICVF_CAP_TUNNEL_PARSING;

+   if (nicvf_hw_version(nic) == PCI_SUB_DEVICE_ID_CN81XX_NICVF)
+   nic->hwcap |= NICVF_CAP_TUNNEL_PARSING | NICVF_CAP_CQE_RX2;
+
return NICVF_OK;
 }

diff --git a/drivers/net/thunderx/base/nicvf_hw.h 
b/drivers/net/thunderx/base/nicvf_hw.h
index 5629d9c..a6cda82 100644
--- a/drivers/net/thunderx/base/nicvf_hw.h
+++ b/drivers/net/thunderx/base/nicvf_hw.h
@@ -42,6 +42,7 @@
 #definePCI_DEVICE_ID_THUNDERX_NICVF0xA034
 #definePCI_SUB_DEVICE_ID_CN88XX_PASS1_NICVF0xA11E
 #definePCI_SUB_DEVICE_ID_CN88XX_PASS2_NICVF0xA134
+#definePCI_SUB_DEVICE_ID_CN81XX_NICVF  0xA234

 #define NICVF_ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))

diff --git a/drivers/net/thunderx/nicvf_ethdev.c 
b/drivers/net/thunderx/nicvf_ethdev.c
index f6faddf..4402f6a 100644
--- a/drivers/net/thunderx/nicvf_ethdev.c
+++ b/drivers/net/thunderx/nicvf_ethdev.c
@@ -1758,6 +1758,13 @@ static const struct rte_pci_id pci_id_nicvf_map[] = {
.subsystem_device_id = PCI_SUB_DEVICE_ID_CN88XX_PASS2_NICVF,
},
{
+   .class_id = RTE_CLASS_ANY_ID,
+   .vendor_id = PCI_VENDOR_ID_CAVIUM,
+   .device_id = PCI_DEVICE_ID_THUNDERX_NICVF,
+   .subsystem_vendor_id = PCI_VENDOR_ID_CAVIUM,
+   .subsystem_device_id = PCI_SUB_DEVICE_ID_CN81XX_NICVF,
+   },
+   {
.vendor_id = 0,
},
 };
-- 
2.5.5

[dpdk-dev] [PATCH 2/3] net/thunderx: introduce cqe_rx2 HW capability flag

2016-07-21 Thread Jerin Jacob

Certain thunderx SoC pass has additional optional word
in Rx descriptor to hold tunneling extension info.
Based on this capability, the location where packet pointer
address stored in Rx descriptor will vary.

Signed-off-by: Jerin Jacob 
---
 drivers/net/thunderx/base/nicvf_hw.h | 5 +++--
 drivers/net/thunderx/nicvf_ethdev.c  | 7 ++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/thunderx/base/nicvf_hw.h 
b/drivers/net/thunderx/base/nicvf_hw.h
index 602a6ff..5629d9c 100644
--- a/drivers/net/thunderx/base/nicvf_hw.h
+++ b/drivers/net/thunderx/base/nicvf_hw.h
@@ -50,8 +50,9 @@
 #define NICVF_GET_TX_STATS(reg) \
nicvf_reg_read(nic, NIC_VNIC_TX_STAT_0_4 | (reg << 3))

-
-#define NICVF_CAP_TUNNEL_PARSING  (1ULL << 0)
+#define NICVF_CAP_TUNNEL_PARSING   (1ULL << 0)
+/* Additional word in Rx descriptor to hold optional tunneling extension info 
*/
+#define NICVF_CAP_CQE_RX2  (1ULL << 1)

 enum nicvf_tns_mode {
NIC_TNS_BYPASS_MODE,
diff --git a/drivers/net/thunderx/nicvf_ethdev.c 
b/drivers/net/thunderx/nicvf_ethdev.c
index 3802d49..f6faddf 100644
--- a/drivers/net/thunderx/nicvf_ethdev.c
+++ b/drivers/net/thunderx/nicvf_ethdev.c
@@ -1142,7 +1142,12 @@ nicvf_dev_rx_queue_setup(struct rte_eth_dev *dev, 
uint16_t qidx,
rxq->cq_status = nicvf_qset_base(nic, qidx) + NIC_QSET_CQ_0_7_STATUS;
rxq->cq_door = nicvf_qset_base(nic, qidx) + NIC_QSET_CQ_0_7_DOOR;
rxq->precharge_cnt = 0;
-   rxq->rbptr_offset = NICVF_CQE_RBPTR_WORD;
+
+   if (nicvf_hw_cap(nic) & NICVF_CAP_CQE_RX2)
+   rxq->rbptr_offset = NICVF_CQE_RX2_RBPTR_WORD;
+   else
+   rxq->rbptr_offset = NICVF_CQE_RBPTR_WORD;
+

/* Alloc completion queue */
if (nicvf_qset_cq_alloc(nic, rxq, rxq->queue_id, nb_desc)) {
-- 
2.5.5

[dpdk-dev] [PATCH 1/3] net/thunderx: remove generic passx references from the driver

2016-07-21 Thread Jerin Jacob

thunderx pmd driver needs to support multiple SoC
variants in ThunderX family.
Remove generic pass references from driver as each SoC
can have same pass number.

Signed-off-by: Jerin Jacob 
---
 drivers/net/thunderx/base/nicvf_hw.c |  2 +-
 drivers/net/thunderx/base/nicvf_hw.h | 12 +---
 drivers/net/thunderx/nicvf_ethdev.c  | 24 
 3 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/drivers/net/thunderx/base/nicvf_hw.c 
b/drivers/net/thunderx/base/nicvf_hw.c
index 001b0ed..2b12d9c 100644
--- a/drivers/net/thunderx/base/nicvf_hw.c
+++ b/drivers/net/thunderx/base/nicvf_hw.c
@@ -140,7 +140,7 @@ nicvf_base_init(struct nicvf *nic)
if (nic->subsystem_device_id == 0)
return NICVF_ERR_BASE_INIT;

-   if (nicvf_hw_version(nic) == NICVF_PASS2)
+   if (nicvf_hw_version(nic) == PCI_SUB_DEVICE_ID_CN88XX_PASS2_NICVF)
nic->hwcap |= NICVF_CAP_TUNNEL_PARSING;

return NICVF_OK;
diff --git a/drivers/net/thunderx/base/nicvf_hw.h 
b/drivers/net/thunderx/base/nicvf_hw.h
index 9db1d30..602a6ff 100644
--- a/drivers/net/thunderx/base/nicvf_hw.h
+++ b/drivers/net/thunderx/base/nicvf_hw.h
@@ -37,11 +37,11 @@

 #include "nicvf_hw_defs.h"

-#definePCI_VENDOR_ID_CAVIUM0x177D
-#definePCI_DEVICE_ID_THUNDERX_PASS1_NICVF  0x0011
-#definePCI_DEVICE_ID_THUNDERX_PASS2_NICVF  0xA034
-#definePCI_SUB_DEVICE_ID_THUNDERX_PASS1_NICVF  0xA11E
-#definePCI_SUB_DEVICE_ID_THUNDERX_PASS2_NICVF  0xA134
+#definePCI_VENDOR_ID_CAVIUM0x177D
+#definePCI_DEVICE_ID_THUNDERX_CN88XX_PASS1_NICVF   0x0011
+#definePCI_DEVICE_ID_THUNDERX_NICVF0xA034
+#definePCI_SUB_DEVICE_ID_CN88XX_PASS1_NICVF0xA11E
+#definePCI_SUB_DEVICE_ID_CN88XX_PASS2_NICVF0xA134

 #define NICVF_ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))

@@ -50,8 +50,6 @@
 #define NICVF_GET_TX_STATS(reg) \
nicvf_reg_read(nic, NIC_VNIC_TX_STAT_0_4 | (reg << 3))

-#define NICVF_PASS1(PCI_SUB_DEVICE_ID_THUNDERX_PASS1_NICVF)
-#define NICVF_PASS2(PCI_SUB_DEVICE_ID_THUNDERX_PASS2_NICVF)

 #define NICVF_CAP_TUNNEL_PARSING  (1ULL << 0)

diff --git a/drivers/net/thunderx/nicvf_ethdev.c 
b/drivers/net/thunderx/nicvf_ethdev.c
index 4f875c0..3802d49 100644
--- a/drivers/net/thunderx/nicvf_ethdev.c
+++ b/drivers/net/thunderx/nicvf_ethdev.c
@@ -265,7 +265,7 @@ nicvf_dev_supported_ptypes_get(struct rte_eth_dev *dev)
size_t copied;
static uint32_t ptypes[32];
struct nicvf *nic = nicvf_pmd_priv(dev);
-   static const uint32_t ptypes_pass1[] = {
+   static const uint32_t ptypes_common[] = {
RTE_PTYPE_L3_IPV4,
RTE_PTYPE_L3_IPV4_EXT,
RTE_PTYPE_L3_IPV6,
@@ -274,7 +274,7 @@ nicvf_dev_supported_ptypes_get(struct rte_eth_dev *dev)
RTE_PTYPE_L4_UDP,
RTE_PTYPE_L4_FRAG,
};
-   static const uint32_t ptypes_pass2[] = {
+   static const uint32_t ptypes_tunnel[] = {
RTE_PTYPE_TUNNEL_GRE,
RTE_PTYPE_TUNNEL_GENEVE,
RTE_PTYPE_TUNNEL_VXLAN,
@@ -282,12 +282,12 @@ nicvf_dev_supported_ptypes_get(struct rte_eth_dev *dev)
};
static const uint32_t ptypes_end = RTE_PTYPE_UNKNOWN;

-   copied = sizeof(ptypes_pass1);
-   memcpy(ptypes, ptypes_pass1, copied);
-   if (nicvf_hw_version(nic) == NICVF_PASS2) {
-   memcpy((char *)ptypes + copied, ptypes_pass2,
-   sizeof(ptypes_pass2));
-   copied += sizeof(ptypes_pass2);
+   copied = sizeof(ptypes_common);
+   memcpy(ptypes, ptypes_common, copied);
+   if (nicvf_hw_cap(nic) & NICVF_CAP_TUNNEL_PARSING) {
+   memcpy((char *)ptypes + copied, ptypes_tunnel,
+   sizeof(ptypes_tunnel));
+   copied += sizeof(ptypes_tunnel);
}

memcpy((char *)ptypes + copied, _end, sizeof(ptypes_end));
@@ -1741,16 +1741,16 @@ static const struct rte_pci_id pci_id_nicvf_map[] = {
{
.class_id = RTE_CLASS_ANY_ID,
.vendor_id = PCI_VENDOR_ID_CAVIUM,
-   .device_id = PCI_DEVICE_ID_THUNDERX_PASS1_NICVF,
+   .device_id = PCI_DEVICE_ID_THUNDERX_CN88XX_PASS1_NICVF,
.subsystem_vendor_id = PCI_VENDOR_ID_CAVIUM,
-   .subsystem_device_id = PCI_SUB_DEVICE_ID_THUNDERX_PASS1_NICVF,
+   .subsystem_device_id = PCI_SUB_DEVICE_ID_CN88XX_PASS1_NICVF,
},
{
.class_id = RTE_CLASS_ANY_ID,
.vendor_id = PCI_VENDOR_ID_CAVIUM,
-   .device_id = PCI_DEVICE_ID_THUNDERX_PASS2_NICVF,
+   .device_id = PCI_DEVICE_ID_THUNDERX_NICVF,
.subsystem_vendor_id = PCI_VENDOR_ID_CAVIUM,
-   .subsystem_device_id = PCI_SUB_DEVICE_ID_THUNDERX_PASS2_NICVF,
+

[dpdk-dev] [PATCH 0/3] net/thunderx: add 81xx SoC support

2016-07-21 Thread Jerin Jacob

CN81xx is four core version of ThunderX SoC.

Added the support by adding new HW capability flag to select
the difference in runtime.

Jerin Jacob (3):
  net/thunderx: remove generic passx references from the driver
  net/thunderx: introduce cqe_rx2 HW capability flag
  net/thunderx: add 81xx SoC support

 doc/guides/nics/thunderx.rst |  1 +
 drivers/net/thunderx/base/nicvf_hw.c |  5 -
 drivers/net/thunderx/base/nicvf_hw.h | 18 -
 drivers/net/thunderx/nicvf_ethdev.c  | 38 
 4 files changed, 39 insertions(+), 23 deletions(-)

-- 
2.5.5

[dpdk-dev] [PATCH] net/i40e: fix out-of-bounds writes during vector Rx

2016-07-21 Thread bynes adam

On Thu, Jul 21, 2016 at 02:03:38PM +0300, Ilya Maximets wrote:
> From: Sergey Dyasly 
> 
> Rx loop inside _recv_raw_pkts_vec() ignores nb_pkts argument and always
> tries to receive RTE_I40E_VPMD_RX_BURST (32) packets. This is a violation
> of rte_eth_rx_burst() API and can lead to memory corruption (out-of-bounds
> writes to struct rte_mbuf **rx_pkts) if nb_pkts is less than 32.
> 
> Fix this by actually using nb_pkts inside the loop.
> 
> Fixes: 9ed94e5bb04e ("i40e: add vector Rx")
> 
> Signed-off-by: Sergey Dyasly 
> Acked-by: Ilya Maximets 
> ---
>  drivers/net/i40e/i40e_rxtx_vec.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx_vec.c 
> b/drivers/net/i40e/i40e_rxtx_vec.c
> index 05cb415..51fb282 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec.c
> @@ -269,7 +269,7 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>* D. fill info. from desc to mbuf
>*/
>  
> - for (pos = 0, nb_pkts_recd = 0; pos < RTE_I40E_VPMD_RX_BURST;
> + for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
>   pos += RTE_I40E_DESCS_PER_LOOP,
>   rxdp += RTE_I40E_DESCS_PER_LOOP) {
>   __m128i descs[RTE_I40E_DESCS_PER_LOOP];
> -- 
> 2.7.4
 Acked-by: Adam Bynes

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-21 Thread Adrien Mazarguil

Hi Rahul,

Please see below.

On Thu, Jul 21, 2016 at 01:43:37PM +0530, Rahul Lakkireddy wrote:
> Hi Adrien,
> 
> The proposal looks very good.  It satisfies most of the features
> supported by Chelsio NICs.  We are looking for suggestions on exposing
> more additional features supported by Chelsio NICs via this API.
> 
> Chelsio NICs have two regions in which filters can be placed -
> Maskfull and Maskless regions.  As their names imply, maskfull region
> can accept masks to match a range of values; whereas, maskless region
> don't accept any masks and hence perform a more strict exact-matches.
> Filters without masks can also be placed in maskfull region.  By
> default, maskless region have higher priority over the maskfull region.
> However, the priority between the two regions is configurable.

I understand this configuration affects the entire device. Just to be clear,
assuming some filters are already configured, are they affected by a change
of region priority later?

> Please suggest on how we can let the apps configure in which region
> filters must be placed and set the corresponding priority accordingly
> via this API.

Okay. Applications, like customers, are always right.

With this in mind, PMDs are not allowed to break existing flow rules, and
face two options when applications provide a flow specification that would
break an existing rule:

- Refuse to create it (easiest).

- Find a workaround to apply it anyway (possibly quite complicated).

The reason you have these two regions is performance right? Otherwise I'd
just say put everything in the maskfull region.

PMDs are allowed to rearrange existing rules or change device parameters as
long as existing constraints are satisfied. In my opinion it does not matter
which region has the highest default priority. Applications always want the
best performance so the first created rule should be in the most appropriate
region.

If a subsequent rule requires it to be in the other region but the
application specified the wrong priority for this to work, then the PMD can
either choose to swap region priorities on the device (assuming it does not
affect other rules), or destroy and recreate the original rule in a way that
satisfies all constraints (i.e. moving conflicting rules from the maskless
region to the maskfull one).

Going further, when subsequent rules get destroyed the PMD should ideally
move back maskfull rules back into the maskless region for better
performance.

This is only a suggestion. PMDs have the right to say "no" at some point.

More important in my opinion is to make sure applications can create a given
set of flow rules in any order. If rules a/b/c can be created, then it won't
make sense from an application point of view if c/a/b for some reason cannot
and the PMD maintainers will rightfully get a bug report.

> More comments below.
> 
> On Tuesday, July 07/05/16, 2016 at 20:16:46 +0200, Adrien Mazarguil wrote:
> > Hi All,
> > 
> [...]
> 
> > 
> > ``ETH``
> > ^^^
> > 
> > Matches an Ethernet header.
> > 
> > - ``dst``: destination MAC.
> > - ``src``: source MAC.
> > - ``type``: EtherType.
> > - ``tags``: number of 802.1Q/ad tags defined.
> > - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> > 
> >  - ``tpid``: Tag protocol identifier.
> >  - ``tci``: Tag control information.
> > 
> > ``IPV4``
> > 
> > 
> > Matches an IPv4 header.
> > 
> > - ``src``: source IP address.
> > - ``dst``: destination IP address.
> > - ``tos``: ToS/DSCP field.
> > - ``ttl``: TTL field.
> > - ``proto``: protocol number for the next layer.
> > 
> > ``IPV6``
> > 
> > 
> > Matches an IPv6 header.
> > 
> > - ``src``: source IP address.
> > - ``dst``: destination IP address.
> > - ``tc``: traffic class field.
> > - ``nh``: Next header field (protocol).
> > - ``hop_limit``: hop limit field (TTL).
> > 
> > ``ICMP``
> > 
> > 
> > Matches an ICMP header.
> > 
> > - TBD.
> > 
> > ``UDP``
> > ^^^
> > 
> > Matches a UDP header.
> > 
> > - ``sport``: source port.
> > - ``dport``: destination port.
> > - ``length``: UDP length.
> > - ``checksum``: UDP checksum.
> > 
> > .. raw:: pdf
> > 
> >PageBreak
> > 
> > ``TCP``
> > ^^^
> > 
> > Matches a TCP header.
> > 
> > - ``sport``: source port.
> > - ``dport``: destination port.
> > - All other TCP fields and bits.
> > 
> > ``VXLAN``
> > ^
> > 
> > Matches a VXLAN header.
> > 
> > - TBD.
> > 
> 
> In addition to above matches, Chelsio NICs have some additional
> features:
> 
> - Match based on unicast DST-MAC, multicast DST-MAC, broadcast DST-MAC.
>   Also, there is a match criteria available called 'promisc' - which
>   matches packets that are not destined for the interface, but had
>   been received by the hardware due to interface being in promiscuous
>   mode.

There is no unicast/multicast/broadcast distinction in the ETH pattern item,
those are covered by the specified MAC address.

Now about this "promisc" match criteria, it can be added as a

[dpdk-dev] [PATCH 04/12] mbuf: add function to calculate a checksum

2016-07-21 Thread Olivier Matz

Dear Don,

On 07/21/2016 06:26 PM, Don Provan wrote:
>> -Original Message-
>> From: Ananyev, Konstantin [mailto:konstantin.ananyev at intel.com]
>> Sent: Thursday, July 21, 2016 3:51 AM
>> Subject: Re: [dpdk-dev] [PATCH 04/12] mbuf: add function to calculate a
>> checksum
>>
>> ...
>>> +  Added a new function ``rte_pktmbuf_cksum()`` to process the checksum
>>> + of  data embedded in an mbuf chain.
>>> ...
>>> +#include 
>>
>> As a nit, do we need to introduce a dependency for librte_mbuf on librte_net?
>> Might be better to put this functionality into librte_net?
> 
> That's not a nit at all. This is clearly a net function that has no place in 
> the mbuf code.
> That should be obvious even before we notice this circular dependency.

The function is called rte_pktmbuf_cksum(), and takes a mbuf as a
parameter. You cannot haughtily say "it no place in the mbuf code".

As you can see, librte_net only contains headers files. The initial goal
of librte_net was to contain network headers and nothing more.
See:
http://dpdk.org/browse/dpdk/commit/lib/librte_net?id=af75078fece3615088e561357c1e97603e43a5fe

To me, the question of having a dependency in one direction or another
(librte_net needs librte_mbuf, or librte_mbuf needs librte_net) is an
open debate.

I've asked myself the same question when software packet type parsing,
that needs definitions of network headers. As packet_type is a pure mbuf
notion, my choice was to have this parse in mbuf library, and using
network headers definitions provided by librte_net.

[dpdk-dev] [PATCH] doc: announce KNI ethtool removal

2016-07-21 Thread Thomas Monjalon

2016-07-21 16:41, Igor Ryzhov:
> On Thu, Jul 21, 2016 at 4:33 PM, Ferruh Yigit 
> wrote:
> > On 7/20/2016 5:07 PM, Thomas Monjalon wrote:
> > > The out-of-tree kernel code must be avoided.
> > > Moreover there is no good reason to keep this legacy feature
> > > which is only partially supported.
> > >
> > > As described earlier in this plan:
> > >   http://dpdk.org/ml/archives/dev/2016-July/043606.html
> > > it will help to keep PCI ids in PMD code.
> > >
> > > Signed-off-by: Thomas Monjalon 
[...]
> > > +
> > > +* The ethtool support will be removed from KNI in 16.11.
> > > +  It is implemented only for igb and ixgbe.
> > > +  It is really hard to maintain because it requires some out-of-tree 
> > > kernel
> > > +  code to be duplicated in this kernel module.
> > > +  Removing this partial support will help to restrict the PCI id 
> > > definitions
> > > +  to the PMD code.
> >
> > KNI ethtool is functional and maintained, and it may have users!
> >
> > Why just removing it, specially without providing an alternative?

Because
1/ It is using the shared PCI ids that we want to move
2/ It has a poor support (igb/ixgbe) and makes users confused
3/ It is a big import of another version of igb/ixgbe drivers

About the point 1, if we decide to keep KNI ethtool, please could you
duplicate the igb/ixgbe PCI ids in KNI?

> > Is is good time to discuss KCP again?
> 
> I think good alternative is rte_ethtool library from ethtool sample
> application.

Yes I think so.

> But I am wondering why this code is only in app, not in lib.

It is an example lib because we were not sure wether we wanted to
support it. But maybe it is time to discuss its status and check
if it can be integrated with other DPDK libs?

[dpdk-dev] [RFC 21/21] net/pcap: remove rte prefix from static functions

2016-07-21 Thread Ferruh Yigit

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 552a1de..55ff2c0 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -783,7 +783,7 @@ open_tx_iface(const char *key, const char *value, void 
*extra_args)
 }

 static int
-rte_pmd_init_internals(const char *name, const unsigned nb_rx_queues,
+pmd_init_internals(const char *name, const unsigned nb_rx_queues,
const unsigned nb_tx_queues, struct pmd_internals **internals,
struct rte_eth_dev **eth_dev)
 {
@@ -846,7 +846,7 @@ error:
 }

 static int
-rte_eth_from_pcaps_common(const char *name, struct pmd_devargs *rx_queues,
+eth_from_pcaps_common(const char *name, struct pmd_devargs *rx_queues,
const unsigned nb_rx_queues, struct pmd_devargs *tx_queues,
const unsigned nb_tx_queues, struct rte_kvargs *kvlist,
struct pmd_internals **internals, struct rte_eth_dev **eth_dev)
@@ -861,7 +861,7 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
if (tx_queues == NULL && nb_tx_queues > 0)
return -1;

-   if (rte_pmd_init_internals(name, nb_rx_queues, nb_tx_queues, internals,
+   if (pmd_init_internals(name, nb_rx_queues, nb_tx_queues, internals,
eth_dev) < 0)
return -1;

@@ -899,7 +899,7 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
 }

 static int
-rte_eth_from_pcaps(const char *name, struct pmd_devargs *rx_queues,
+eth_from_pcaps(const char *name, struct pmd_devargs *rx_queues,
const unsigned nb_rx_queues, struct pmd_devargs *tx_queues,
const unsigned nb_tx_queues, struct rte_kvargs *kvlist,
int single_iface, unsigned int using_dumpers)
@@ -908,7 +908,7 @@ rte_eth_from_pcaps(const char *name, struct pmd_devargs 
*rx_queues,
struct rte_eth_dev *eth_dev = NULL;
int ret;

-   ret = rte_eth_from_pcaps_common(name, rx_queues, nb_rx_queues,
+   ret = eth_from_pcaps_common(name, rx_queues, nb_rx_queues,
tx_queues, nb_tx_queues, kvlist, , _dev);

if (ret < 0)
@@ -929,7 +929,7 @@ rte_eth_from_pcaps(const char *name, struct pmd_devargs 
*rx_queues,


 static int
-rte_pmd_pcap_devinit(const char *name, const char *params)
+pmd_pcap_devinit(const char *name, const char *params)
 {
unsigned int is_rx_pcap = 0, is_tx_pcap = 0;
struct rte_kvargs *kvlist;
@@ -1018,7 +1018,7 @@ rte_pmd_pcap_devinit(const char *name, const char *params)
goto free_kvlist;

 create_eth:
-   ret = rte_eth_from_pcaps(name, , pcaps.num_of_queue, ,
+   ret = eth_from_pcaps(name, , pcaps.num_of_queue, ,
dumpers.num_of_queue, kvlist, single_iface, is_tx_pcap);

 free_kvlist:
@@ -1028,7 +1028,7 @@ free_kvlist:
 }

 static int
-rte_pmd_pcap_devuninit(const char *name)
+pmd_pcap_devuninit(const char *name)
 {
struct rte_eth_dev *eth_dev = NULL;

@@ -1053,8 +1053,8 @@ rte_pmd_pcap_devuninit(const char *name)

 static struct rte_driver pmd_pcap_drv = {
.type = PMD_VDEV,
-   .init = rte_pmd_pcap_devinit,
-   .uninit = rte_pmd_pcap_devuninit,
+   .init = pmd_pcap_devinit,
+   .uninit = pmd_pcap_devuninit,
 };

 PMD_REGISTER_DRIVER(pmd_pcap_drv, eth_pcap);
-- 
2.7.4

[dpdk-dev] [RFC 20/21] net/pcap: coding convention updates

2016-07-21 Thread Ferruh Yigit

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 63 -
 1 file changed, 31 insertions(+), 32 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index f3df372..552a1de 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -49,6 +49,7 @@
 #define RTE_ETH_PCAP_SNAPLEN ETHER_MAX_JUMBO_FRAME_LEN
 #define RTE_ETH_PCAP_PROMISC 1
 #define RTE_ETH_PCAP_TIMEOUT -1
+
 #define ETH_PCAP_RX_PCAP_ARG  "rx_pcap"
 #define ETH_PCAP_TX_PCAP_ARG  "tx_pcap"
 #define ETH_PCAP_RX_IFACE_ARG "rx_iface"
@@ -112,7 +113,9 @@ static const char *valid_arguments[] = {
NULL
 };

-static struct ether_addr eth_addr = { .addr_bytes = { 0, 0, 0, 0x1, 0x2, 0x3 } 
};
+static struct ether_addr eth_addr = {
+   .addr_bytes = { 0, 0, 0, 0x1, 0x2, 0x3 }
+};
 static const char *drivername = "Pcap PMD";
 static struct rte_eth_link pmd_link = {
.link_speed = ETH_SPEED_NUM_10G,
@@ -122,15 +125,12 @@ static struct rte_eth_link pmd_link = {
 };

 static int
-eth_pcap_rx_jumbo(struct rte_mempool *mb_pool,
- struct rte_mbuf *mbuf,
- const u_char *data,
- uint16_t data_len)
+eth_pcap_rx_jumbo(struct rte_mempool *mb_pool, struct rte_mbuf *mbuf,
+   const u_char *data, uint16_t data_len)
 {
-   struct rte_mbuf *m = mbuf;
-
/* Copy the first segment. */
uint16_t len = rte_pktmbuf_tailroom(mbuf);
+   struct rte_mbuf *m = mbuf;

rte_memcpy(rte_pktmbuf_append(mbuf, len), data, len);
data_len -= len;
@@ -170,7 +170,7 @@ eth_pcap_gather_data(unsigned char *data, struct rte_mbuf 
*mbuf)

while (mbuf) {
rte_memcpy(data + data_len, rte_pktmbuf_mtod(mbuf, void *),
-  mbuf->data_len);
+   mbuf->data_len);

data_len += mbuf->data_len;
mbuf = mbuf->next;
@@ -178,9 +178,7 @@ eth_pcap_gather_data(unsigned char *data, struct rte_mbuf 
*mbuf)
 }

 static uint16_t
-eth_pcap_rx(void *queue,
-   struct rte_mbuf **bufs,
-   uint16_t nb_pkts)
+eth_pcap_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
unsigned i;
struct pcap_pkthdr header;
@@ -233,6 +231,7 @@ eth_pcap_rx(void *queue,
}
pcap_q->rx_stat.pkts += num_rx;
pcap_q->rx_stat.bytes += rx_bytes;
+
return num_rx;
 }

@@ -251,9 +250,7 @@ calculate_timestamp(struct timeval *ts) {
  * Callback to handle writing packets to a pcap file.
  */
 static uint16_t
-eth_pcap_tx_dumper(void *queue,
-   struct rte_mbuf **bufs,
-   uint16_t nb_pkts)
+eth_pcap_tx_dumper(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
unsigned i;
struct rte_mbuf *mbuf;
@@ -306,6 +303,7 @@ eth_pcap_tx_dumper(void *queue,
dumper_q->tx_stat.pkts += num_tx;
dumper_q->tx_stat.bytes += tx_bytes;
dumper_q->tx_stat.err_pkts += nb_pkts - num_tx;
+
return num_tx;
 }

@@ -313,9 +311,7 @@ eth_pcap_tx_dumper(void *queue,
  * Callback to handle sending packets through a real NIC.
  */
 static uint16_t
-eth_pcap_tx(void *queue,
-   struct rte_mbuf **bufs,
-   uint16_t nb_pkts)
+eth_pcap_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
unsigned i;
int ret;
@@ -332,14 +328,13 @@ eth_pcap_tx(void *queue,

if (likely(mbuf->nb_segs == 1)) {
ret = pcap_sendpacket(tx_queue->pcap,
- rte_pktmbuf_mtod(mbuf, u_char *),
- mbuf->pkt_len);
+   rte_pktmbuf_mtod(mbuf, u_char *),
+   mbuf->pkt_len);
} else {
if (mbuf->pkt_len <= ETHER_MAX_JUMBO_FRAME_LEN) {
eth_pcap_gather_data(tx_pcap_data, mbuf);
ret = pcap_sendpacket(tx_queue->pcap,
- tx_pcap_data,
- mbuf->pkt_len);
+   tx_pcap_data, mbuf->pkt_len);
} else {
RTE_LOG(ERR, PMD,
"Dropping PCAP packet. "
@@ -362,6 +357,7 @@ eth_pcap_tx(void *queue,
tx_queue->tx_stat.pkts += num_tx;
tx_queue->tx_stat.bytes += tx_bytes;
tx_queue->tx_stat.err_pkts += nb_pkts - num_tx;
+
return num_tx;
 }

@@ -457,9 +453,7 @@ eth_dev_start(struct rte_eth_dev *dev)
if (!tx->dumper && strcmp(tx->type, ETH_PCAP_TX_PCAP_ARG) == 0) 
{
if (open_single_tx_pcap(tx->name, >dumper) < 0)
return -1;
-   }
-
-   else if (!tx->pcap &&

[dpdk-dev] [RFC 19/21] net/pcap: fix missing tx iface assignment

2016-07-21 Thread Ferruh Yigit

Missing pcap assignment may cause pcap opened again, and previous one
not closed at all.

Fixes: 1e38a7c66923 ("pcap: fix storage of name and type in queues")

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index ece7ff0..f3df372 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -881,6 +881,7 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
struct devargs_queue *queue = _queues->queue[i];

tx->dumper = queue->dumper;
+   tx->pcap = queue->pcap;
snprintf(tx->name, sizeof(tx->name), "%s", queue->name);
snprintf(tx->type, sizeof(tx->type), "%s", queue->type);
}
-- 
2.7.4

[dpdk-dev] [RFC 18/21] net/pcap: simplify function

2016-07-21 Thread Ferruh Yigit

simplify function rte_eth_from_pcaps_common by using interim variables.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 83b05e2..ece7ff0 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -868,22 +868,21 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
return -1;

for (i = 0; i < nb_rx_queues; i++) {
-   (*internals)->rx_queue[i].pcap = rx_queues->queue[i].pcap;
-   snprintf((*internals)->rx_queue[i].name,
-   sizeof((*internals)->rx_queue[i].name), "%s",
-   rx_queues->queue[i].name);
-   snprintf((*internals)->rx_queue[i].type,
-   sizeof((*internals)->rx_queue[i].type), "%s",
-   rx_queues->queue[i].type);
+   struct pcap_rx_queue *rx = &(*internals)->rx_queue[i];
+   struct devargs_queue *queue = _queues->queue[i];
+
+   rx->pcap = queue->pcap;
+   snprintf(rx->name, sizeof(rx->name), "%s", queue->name);
+   snprintf(rx->type, sizeof(rx->type), "%s", queue->type);
}
+
for (i = 0; i < nb_tx_queues; i++) {
-   (*internals)->tx_queue[i].dumper = tx_queues->queue[i].dumper;
-   snprintf((*internals)->tx_queue[i].name,
-   sizeof((*internals)->tx_queue[i].name), "%s",
-   tx_queues->queue[i].name);
-   snprintf((*internals)->tx_queue[i].type,
-   sizeof((*internals)->tx_queue[i].type), "%s",
-   tx_queues->queue[i].type);
+   struct pcap_tx_queue *tx = &(*internals)->tx_queue[i];
+   struct devargs_queue *queue = _queues->queue[i];
+
+   tx->dumper = queue->dumper;
+   snprintf(tx->name, sizeof(tx->name), "%s", queue->name);
+   snprintf(tx->type, sizeof(tx->type), "%s", queue->type);
}

for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
-- 
2.7.4

[dpdk-dev] [RFC 17/21] net/pcap: remove redundant assignment

2016-07-21 Thread Ferruh Yigit

data->name assigned twice.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 9cb9861..83b05e2 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -825,8 +825,6 @@ rte_pmd_init_internals(const char *name, const unsigned 
nb_rx_queues,
data->nb_tx_queues = (uint16_t)nb_tx_queues;
data->dev_link = pmd_link;
data->mac_addrs = _addr;
-   strncpy(data->name,
-   (*eth_dev)->data->name, strlen((*eth_dev)->data->name));

/*
 * NOTE: we'll replace the data element, of originally allocated
-- 
2.7.4

[dpdk-dev] [RFC 16/21] net/pcap: remove unnecessary check

2016-07-21 Thread Ferruh Yigit

Both fields are fields of same type of struct, one's size can't be bigger
than others.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index fb170db..9cb9861 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -812,10 +812,6 @@ rte_pmd_init_internals(const char *name, const unsigned 
nb_rx_queues,
if (*eth_dev == NULL)
goto error;

-   /* check length of device name */
-   if ((strlen((*eth_dev)->data->name) + 1) > sizeof(data->name))
-   goto error;
-
/* now put it all together
 * - store queue data in internals,
 * - store numa_node info in eth_dev
-- 
2.7.4

[dpdk-dev] [RFC 15/21] net/pcap: update how single iface handled

2016-07-21 Thread Ferruh Yigit

Remove hardcoded single interface values, make it more obvious.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 8e011ea..fb170db 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -967,16 +967,14 @@ rte_pmd_pcap_devinit(const char *name, const char *params)

if (ret < 0)
goto free_kvlist;
-   dumpers.queue[0].pcap = pcaps.queue[0].pcap;
-   dumpers.queue[0].name = pcaps.queue[0].name;
-   dumpers.queue[0].type = pcaps.queue[0].type;

-   single_iface = 1;
+   dumpers.queue[0] = pcaps.queue[0];

-   ret = rte_eth_from_pcaps(name, , 1, , 1,
-   kvlist, single_iface, is_tx_pcap);
+   single_iface = 1;
+   pcaps.num_of_queue = 1;
+   dumpers.num_of_queue = 1;

-   goto free_kvlist;
+   goto create_eth;
}

/*
@@ -1027,6 +1025,7 @@ rte_pmd_pcap_devinit(const char *name, const char *params)
if (ret < 0)
goto free_kvlist;

+create_eth:
ret = rte_eth_from_pcaps(name, , pcaps.num_of_queue, ,
dumpers.num_of_queue, kvlist, single_iface, is_tx_pcap);

-- 
2.7.4

[dpdk-dev] [RFC 14/21] net/pcap: reorder functions

2016-07-21 Thread Ferruh Yigit

Reorder functions to be able to remove function declarations in .c file.
Function definitions not modified.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 129 
 1 file changed, 64 insertions(+), 65 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index a142e38..8e011ea 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -112,10 +112,6 @@ static const char *valid_arguments[] = {
NULL
 };

-static int open_single_tx_pcap(const char *pcap_filename, pcap_dumper_t 
**dumper);
-static int open_single_rx_pcap(const char *pcap_filename, pcap_t **pcap);
-static int open_single_iface(const char *iface, pcap_t **pcap);
-
 static struct ether_addr eth_addr = { .addr_bytes = { 0, 0, 0, 0x1, 0x2, 0x3 } 
};
 static const char *drivername = "Pcap PMD";
 static struct rte_eth_link pmd_link = {
@@ -369,6 +365,70 @@ eth_pcap_tx(void *queue,
return num_tx;
 }

+/*
+ * pcap_open_live wrapper function
+ */
+static inline int
+open_iface_live(const char *iface, pcap_t **pcap) {
+   *pcap = pcap_open_live(iface, RTE_ETH_PCAP_SNAPLEN,
+   RTE_ETH_PCAP_PROMISC, RTE_ETH_PCAP_TIMEOUT, errbuf);
+
+   if (*pcap == NULL) {
+   RTE_LOG(ERR, PMD, "Couldn't open %s: %s\n", iface, errbuf);
+   return -1;
+   }
+
+   return 0;
+}
+
+static int
+open_single_iface(const char *iface, pcap_t **pcap)
+{
+   if (open_iface_live(iface, pcap) < 0) {
+   RTE_LOG(ERR, PMD, "Couldn't open interface %s\n", iface);
+   return -1;
+   }
+
+   return 0;
+}
+
+static int
+open_single_tx_pcap(const char *pcap_filename, pcap_dumper_t **dumper)
+{
+   pcap_t *tx_pcap;
+
+   /*
+* We need to create a dummy empty pcap_t to use it
+* with pcap_dump_open(). We create big enough an Ethernet
+* pcap holder.
+*/
+   if ((tx_pcap = pcap_open_dead(DLT_EN10MB, RTE_ETH_PCAP_SNAPSHOT_LEN))
+   == NULL) {
+   RTE_LOG(ERR, PMD, "Couldn't create dead pcap\n");
+   return -1;
+   }
+
+   /* The dumper is created using the previous pcap_t reference */
+   if ((*dumper = pcap_dump_open(tx_pcap, pcap_filename)) == NULL) {
+   RTE_LOG(ERR, PMD, "Couldn't open %s for writing.\n",
+   pcap_filename);
+   return -1;
+   }
+
+   return 0;
+}
+
+static int
+open_single_rx_pcap(const char *pcap_filename, pcap_t **pcap)
+{
+   if ((*pcap = pcap_open_offline(pcap_filename, errbuf)) == NULL) {
+   RTE_LOG(ERR, PMD, "Couldn't open %s: %s\n", pcap_filename, 
errbuf);
+   return -1;
+   }
+
+   return 0;
+}
+
 static int
 eth_dev_start(struct rte_eth_dev *dev)
 {
@@ -636,16 +696,6 @@ open_rx_pcap(const char *key, const char *value, void 
*extra_args)
return 0;
 }

-static int
-open_single_rx_pcap(const char *pcap_filename, pcap_t **pcap)
-{
-   if ((*pcap = pcap_open_offline(pcap_filename, errbuf)) == NULL) {
-   RTE_LOG(ERR, PMD, "Couldn't open %s: %s\n", pcap_filename, 
errbuf);
-   return -1;
-   }
-   return 0;
-}
-
 /*
  * Opens a pcap file for writing and stores a reference to it
  * for use it later on.
@@ -670,46 +720,6 @@ open_tx_pcap(const char *key, const char *value, void 
*extra_args)
return 0;
 }

-static int
-open_single_tx_pcap(const char *pcap_filename, pcap_dumper_t **dumper)
-{
-   pcap_t *tx_pcap;
-   /*
-* We need to create a dummy empty pcap_t to use it
-* with pcap_dump_open(). We create big enough an Ethernet
-* pcap holder.
-*/
-
-   if ((tx_pcap = pcap_open_dead(DLT_EN10MB, RTE_ETH_PCAP_SNAPSHOT_LEN))
-   == NULL) {
-   RTE_LOG(ERR, PMD, "Couldn't create dead pcap\n");
-   return -1;
-   }
-
-   /* The dumper is created using the previous pcap_t reference */
-   if ((*dumper = pcap_dump_open(tx_pcap, pcap_filename)) == NULL) {
-   RTE_LOG(ERR, PMD, "Couldn't open %s for writing.\n", 
pcap_filename);
-   return -1;
-   }
-
-   return 0;
-}
-
-/*
- * pcap_open_live wrapper function
- */
-static inline int
-open_iface_live(const char *iface, pcap_t **pcap) {
-   *pcap = pcap_open_live(iface, RTE_ETH_PCAP_SNAPLEN,
-   RTE_ETH_PCAP_PROMISC, RTE_ETH_PCAP_TIMEOUT, errbuf);
-
-   if (*pcap == NULL) {
-   RTE_LOG(ERR, PMD, "Couldn't open %s: %s\n", iface, errbuf);
-   return -1;
-   }
-   return 0;
-}
-
 /*
  * Opens an interface for reading and writing
  */
@@ -775,17 +785,6 @@ open_tx_iface(const char *key, const char *value, void 
*extra_args)
 }

 static int
-open_single_iface(const char *iface, pcap_t **pcap)
-{
-   if (open_iface_live(iface, pcap) < 0) {
-   RTE_LOG(ERR, PMD,

[dpdk-dev] [RFC 13/21] net/pcap: reorder header files

2016-07-21 Thread Ferruh Yigit

Remove unused and sort remaining.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 3d85a7f..a142e38 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -33,19 +33,18 @@
  */

 #include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 

 #include 

 #include 

+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
 #define RTE_ETH_PCAP_SNAPSHOT_LEN 65535
 #define RTE_ETH_PCAP_SNAPLEN ETHER_MAX_JUMBO_FRAME_LEN
 #define RTE_ETH_PCAP_PROMISC 1
-- 
2.7.4

[dpdk-dev] [RFC 12/21] net/pcap: make const array static

2016-07-21 Thread Ferruh Yigit

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index ec25912..3d85a7f 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -104,7 +104,7 @@ struct pmd_devargs {
} queue[RTE_PMD_PCAP_MAX_QUEUES];
 };

-const char *valid_arguments[] = {
+static const char *valid_arguments[] = {
ETH_PCAP_RX_PCAP_ARG,
ETH_PCAP_TX_PCAP_ARG,
ETH_PCAP_RX_IFACE_ARG,
-- 
2.7.4

[dpdk-dev] [RFC 11/21] net/pcap: group stats related fields into a struct

2016-07-21 Thread Ferruh Yigit

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 72 +
 1 file changed, 37 insertions(+), 35 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 31eed58..ec25912 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -64,13 +64,17 @@ static struct timeval start_time;
 static uint64_t start_cycles;
 static uint64_t hz;

+struct queue_stat {
+   volatile unsigned long pkts;
+   volatile unsigned long bytes;
+   volatile unsigned long err_pkts;
+};
+
 struct pcap_rx_queue {
pcap_t *pcap;
uint8_t in_port;
struct rte_mempool *mb_pool;
-   volatile unsigned long rx_pkts;
-   volatile unsigned long rx_bytes;
-   volatile unsigned long err_pkts;
+   struct queue_stat rx_stat;
char name[PATH_MAX];
char type[ETH_PCAP_ARG_MAXLEN];
 };
@@ -78,9 +82,7 @@ struct pcap_rx_queue {
 struct pcap_tx_queue {
pcap_dumper_t *dumper;
pcap_t *pcap;
-   volatile unsigned long tx_pkts;
-   volatile unsigned long tx_bytes;
-   volatile unsigned long err_pkts;
+   struct queue_stat tx_stat;
char name[PATH_MAX];
char type[ETH_PCAP_ARG_MAXLEN];
 };
@@ -234,8 +236,8 @@ eth_pcap_rx(void *queue,
num_rx++;
rx_bytes += header.caplen;
}
-   pcap_q->rx_pkts += num_rx;
-   pcap_q->rx_bytes += rx_bytes;
+   pcap_q->rx_stat.pkts += num_rx;
+   pcap_q->rx_stat.bytes += rx_bytes;
return num_rx;
 }

@@ -306,9 +308,9 @@ eth_pcap_tx_dumper(void *queue,
 * we flush the pcap dumper within each burst.
 */
pcap_dump_flush(dumper_q->dumper);
-   dumper_q->tx_pkts += num_tx;
-   dumper_q->tx_bytes += tx_bytes;
-   dumper_q->err_pkts += nb_pkts - num_tx;
+   dumper_q->tx_stat.pkts += num_tx;
+   dumper_q->tx_stat.bytes += tx_bytes;
+   dumper_q->tx_stat.err_pkts += nb_pkts - num_tx;
return num_tx;
 }

@@ -362,9 +364,9 @@ eth_pcap_tx(void *queue,
rte_pktmbuf_free(mbuf);
}

-   tx_queue->tx_pkts += num_tx;
-   tx_queue->tx_bytes += tx_bytes;
-   tx_queue->err_pkts += nb_pkts - num_tx;
+   tx_queue->tx_stat.pkts += num_tx;
+   tx_queue->tx_stat.bytes += tx_bytes;
+   tx_queue->tx_stat.err_pkts += nb_pkts - num_tx;
return num_tx;
 }

@@ -501,7 +503,7 @@ eth_dev_info(struct rte_eth_dev *dev,

 static void
 eth_stats_get(struct rte_eth_dev *dev,
-   struct rte_eth_stats *igb_stats)
+   struct rte_eth_stats *stats)
 {
unsigned i;
unsigned long rx_packets_total = 0, rx_bytes_total = 0;
@@ -511,27 +513,27 @@ eth_stats_get(struct rte_eth_dev *dev,

for (i = 0; i < RTE_ETHDEV_QUEUE_STAT_CNTRS &&
i < dev->data->nb_rx_queues; i++) {
-   igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
-   igb_stats->q_ibytes[i] = internal->rx_queue[i].rx_bytes;
-   rx_packets_total += igb_stats->q_ipackets[i];
-   rx_bytes_total += igb_stats->q_ibytes[i];
+   stats->q_ipackets[i] = internal->rx_queue[i].rx_stat.pkts;
+   stats->q_ibytes[i] = internal->rx_queue[i].rx_stat.bytes;
+   rx_packets_total += stats->q_ipackets[i];
+   rx_bytes_total += stats->q_ibytes[i];
}

for (i = 0; i < RTE_ETHDEV_QUEUE_STAT_CNTRS &&
i < dev->data->nb_tx_queues; i++) {
-   igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
-   igb_stats->q_obytes[i] = internal->tx_queue[i].tx_bytes;
-   igb_stats->q_errors[i] = internal->tx_queue[i].err_pkts;
-   tx_packets_total += igb_stats->q_opackets[i];
-   tx_bytes_total += igb_stats->q_obytes[i];
-   tx_packets_err_total += igb_stats->q_errors[i];
+   stats->q_opackets[i] = internal->tx_queue[i].tx_stat.pkts;
+   stats->q_obytes[i] = internal->tx_queue[i].tx_stat.bytes;
+   stats->q_errors[i] = internal->tx_queue[i].tx_stat.err_pkts;
+   tx_packets_total += stats->q_opackets[i];
+   tx_bytes_total += stats->q_obytes[i];
+   tx_packets_err_total += stats->q_errors[i];
}

-   igb_stats->ipackets = rx_packets_total;
-   igb_stats->ibytes = rx_bytes_total;
-   igb_stats->opackets = tx_packets_total;
-   igb_stats->obytes = tx_bytes_total;
-   igb_stats->oerrors = tx_packets_err_total;
+   stats->ipackets = rx_packets_total;
+   stats->ibytes = rx_bytes_total;
+   stats->opackets = tx_packets_total;
+   stats->obytes = tx_bytes_total;
+   stats->oerrors = tx_packets_err_total;
 }

 static void
@@ -540,13 +542,13 @@ eth_stats_reset(struct rte_eth_dev *dev)
unsigned i;
struct pmd_internals *internal = dev->data->dev_private;

[dpdk-dev] [RFC 10/21] net/pcap: use single_iface variable instead of hardcoded

2016-07-21 Thread Ferruh Yigit

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 0445c74..31eed58 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -943,6 +943,7 @@ rte_pmd_pcap_devinit(const char *name, const char *params)
struct rte_kvargs *kvlist;
struct pmd_devargs pcaps = {0};
struct pmd_devargs dumpers = {0};
+   int single_iface = 0;
int ret;

RTE_LOG(INFO, PMD, "Initializing pmd_pcap for %s\n", name);
@@ -963,13 +964,18 @@ rte_pmd_pcap_devinit(const char *name, const char *params)

ret = rte_kvargs_process(kvlist, ETH_PCAP_IFACE_ARG,
_rx_tx_iface, );
+
if (ret < 0)
goto free_kvlist;
dumpers.queue[0].pcap = pcaps.queue[0].pcap;
dumpers.queue[0].name = pcaps.queue[0].name;
dumpers.queue[0].type = pcaps.queue[0].type;
+
+   single_iface = 1;
+
ret = rte_eth_from_pcaps(name, , 1, , 1,
-   kvlist, 1, is_tx_pcap);
+   kvlist, single_iface, is_tx_pcap);
+
goto free_kvlist;
}

@@ -1022,7 +1028,7 @@ rte_pmd_pcap_devinit(const char *name, const char *params)
goto free_kvlist;

ret = rte_eth_from_pcaps(name, , pcaps.num_of_queue, ,
-   dumpers.num_of_queue, kvlist, 0, is_tx_pcap);
+   dumpers.num_of_queue, kvlist, single_iface, is_tx_pcap);

 free_kvlist:
rte_kvargs_free(kvlist);
-- 
2.7.4

[dpdk-dev] [RFC 09/21] net/pcap: remove duplicated max queue number check

2016-07-21 Thread Ferruh Yigit

Remove duplicated check by reorganizing the code, no functional change.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 49 ++---
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 7524293..0445c74 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -939,11 +939,11 @@ rte_eth_from_pcaps(const char *name, struct pmd_devargs 
*rx_queues,
 static int
 rte_pmd_pcap_devinit(const char *name, const char *params)
 {
-   unsigned using_dumpers = 0;
-   int ret;
+   unsigned int is_rx_pcap = 0, is_tx_pcap = 0;
struct rte_kvargs *kvlist;
struct pmd_devargs pcaps = {0};
struct pmd_devargs dumpers = {0};
+   int ret;

RTE_LOG(INFO, PMD, "Initializing pmd_pcap for %s\n", name);

@@ -969,7 +969,7 @@ rte_pmd_pcap_devinit(const char *name, const char *params)
dumpers.queue[0].name = pcaps.queue[0].name;
dumpers.queue[0].type = pcaps.queue[0].type;
ret = rte_eth_from_pcaps(name, , 1, , 1,
-   kvlist, 1, using_dumpers);
+   kvlist, 1, is_tx_pcap);
goto free_kvlist;
}

@@ -978,23 +978,21 @@ rte_pmd_pcap_devinit(const char *name, const char *params)
 * pcap file
 */
pcaps.num_of_queue = rte_kvargs_count(kvlist, ETH_PCAP_RX_PCAP_ARG);
-
-   if (pcaps.num_of_queue) {
-   if (pcaps.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
-   pcaps.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;
-
-   ret = rte_kvargs_process(kvlist, ETH_PCAP_RX_PCAP_ARG,
-   _rx_pcap, );
-   } else {
+   if (pcaps.num_of_queue)
+   is_rx_pcap = 1;
+   else
pcaps.num_of_queue = rte_kvargs_count(kvlist,
ETH_PCAP_RX_IFACE_ARG);

-   if (pcaps.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
-   pcaps.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;
+   if (pcaps.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
+   pcaps.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;

+   if (is_rx_pcap)
+   ret = rte_kvargs_process(kvlist, ETH_PCAP_RX_PCAP_ARG,
+   _rx_pcap, );
+   else
ret = rte_kvargs_process(kvlist, ETH_PCAP_RX_IFACE_ARG,
_rx_iface, );
-   }

if (ret < 0)
goto free_kvlist;
@@ -1004,30 +1002,27 @@ rte_pmd_pcap_devinit(const char *name, const char 
*params)
 * pcap file
 */
dumpers.num_of_queue = rte_kvargs_count(kvlist, ETH_PCAP_TX_PCAP_ARG);
-
-   if (dumpers.num_of_queue) {
-   if (dumpers.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
-   dumpers.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;
-
-   ret = rte_kvargs_process(kvlist, ETH_PCAP_TX_PCAP_ARG,
-   _tx_pcap, );
-   using_dumpers = 1;
-   } else {
+   if (dumpers.num_of_queue)
+   is_tx_pcap = 1;
+   else
dumpers.num_of_queue = rte_kvargs_count(kvlist,
ETH_PCAP_TX_IFACE_ARG);

-   if (dumpers.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
-   dumpers.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;
+   if (dumpers.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
+   dumpers.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;

+   if (is_tx_pcap)
+   ret = rte_kvargs_process(kvlist, ETH_PCAP_TX_PCAP_ARG,
+   _tx_pcap, );
+   else
ret = rte_kvargs_process(kvlist, ETH_PCAP_TX_IFACE_ARG,
_tx_iface, );
-   }

if (ret < 0)
goto free_kvlist;

ret = rte_eth_from_pcaps(name, , pcaps.num_of_queue, ,
-   dumpers.num_of_queue, kvlist, 0, using_dumpers);
+   dumpers.num_of_queue, kvlist, 0, is_tx_pcap);

 free_kvlist:
rte_kvargs_free(kvlist);
-- 
2.7.4

[dpdk-dev] [RFC 08/21] net/pcap: move comment to correct place

2016-07-21 Thread Ferruh Yigit

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 3eacb82..7524293 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -822,9 +822,6 @@ rte_pmd_init_internals(const char *name, const unsigned 
nb_rx_queues,
 * - point eth_dev_data to internals
 * - and point eth_dev structure to new eth_dev_data structure
 */
-   /* NOTE: we'll replace the data element, of originally allocated eth_dev
-* so the rings are local per-process */
-
data->dev_private = *internals;
data->port_id = (*eth_dev)->data->port_id;
snprintf(data->name, sizeof(data->name), "%s", (*eth_dev)->data->name);
@@ -835,6 +832,10 @@ rte_pmd_init_internals(const char *name, const unsigned 
nb_rx_queues,
strncpy(data->name,
(*eth_dev)->data->name, strlen((*eth_dev)->data->name));

+   /*
+* NOTE: we'll replace the data element, of originally allocated
+* eth_dev so the rings are local per-process
+*/
(*eth_dev)->data = data;
(*eth_dev)->dev_ops = 
(*eth_dev)->driver = NULL;
-- 
2.7.4

[dpdk-dev] [RFC 07/21] net/pcap: don't carry kvlist argument

2016-07-21 Thread Ferruh Yigit

Don't carry kvlist argument into sub function and used it, use kvlist
argument in upper level of call stack.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 94d7c88..3eacb82 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -787,18 +787,10 @@ open_single_iface(const char *iface, pcap_t **pcap)
 static int
 rte_pmd_init_internals(const char *name, const unsigned nb_rx_queues,
const unsigned nb_tx_queues, struct pmd_internals **internals,
-   struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist)
+   struct rte_eth_dev **eth_dev)
 {
struct rte_eth_dev_data *data = NULL;
unsigned int numa_node = rte_socket_id();
-   unsigned k_idx;
-   struct rte_kvargs_pair *pair = NULL;
-
-   for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
-   pair = >pairs[k_idx];
-   if (strstr(pair->key, ETH_PCAP_IFACE_ARG) != NULL)
-   break;
-   }

RTE_LOG(INFO, PMD, "Creating pcap-backed ethdev on numa socket %u\n",
numa_node);
@@ -833,11 +825,6 @@ rte_pmd_init_internals(const char *name, const unsigned 
nb_rx_queues,
/* NOTE: we'll replace the data element, of originally allocated eth_dev
 * so the rings are local per-process */

-   if (pair == NULL)
-   (*internals)->if_index = 0;
-   else
-   (*internals)->if_index = if_nametoindex(pair->value);
-
data->dev_private = *internals;
data->port_id = (*eth_dev)->data->port_id;
snprintf(data->name, sizeof(data->name), "%s", (*eth_dev)->data->name);
@@ -871,6 +858,8 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
const unsigned nb_tx_queues, struct rte_kvargs *kvlist,
struct pmd_internals **internals, struct rte_eth_dev **eth_dev)
 {
+   struct rte_kvargs_pair *pair = NULL;
+   unsigned k_idx;
unsigned i;

/* do some parameter checking */
@@ -880,7 +869,7 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
return -1;

if (rte_pmd_init_internals(name, nb_rx_queues, nb_tx_queues, internals,
-   eth_dev, kvlist) < 0)
+   eth_dev) < 0)
return -1;

for (i = 0; i < nb_rx_queues; i++) {
@@ -902,6 +891,17 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
tx_queues->queue[i].type);
}

+   for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+   pair = >pairs[k_idx];
+   if (strstr(pair->key, ETH_PCAP_IFACE_ARG) != NULL)
+   break;
+   }
+
+   if (pair == NULL)
+   (*internals)->if_index = 0;
+   else
+   (*internals)->if_index = if_nametoindex(pair->value);
+
return 0;
 }

-- 
2.7.4

[dpdk-dev] [RFC 06/21] net/pcap: don't carry numa_node argument

2016-07-21 Thread Ferruh Yigit

Instead of defining numa_node variable upper level of call stack and
carry into sub function, set it where needs to be used.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 40 +---
 1 file changed, 17 insertions(+), 23 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 14f770c..94d7c88 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -786,13 +786,11 @@ open_single_iface(const char *iface, pcap_t **pcap)

 static int
 rte_pmd_init_internals(const char *name, const unsigned nb_rx_queues,
-   const unsigned nb_tx_queues,
-   const unsigned numa_node,
-   struct pmd_internals **internals,
-   struct rte_eth_dev **eth_dev,
-   struct rte_kvargs *kvlist)
+   const unsigned nb_tx_queues, struct pmd_internals **internals,
+   struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist)
 {
struct rte_eth_dev_data *data = NULL;
+   unsigned int numa_node = rte_socket_id();
unsigned k_idx;
struct rte_kvargs_pair *pair = NULL;

@@ -802,8 +800,8 @@ rte_pmd_init_internals(const char *name, const unsigned 
nb_rx_queues,
break;
}

-   RTE_LOG(INFO, PMD,
-   "Creating pcap-backed ethdev on numa socket %u\n", 
numa_node);
+   RTE_LOG(INFO, PMD, "Creating pcap-backed ethdev on numa socket %u\n",
+   numa_node);

/* now do all data allocation - for eth_dev structure
 * and internal (private) data
@@ -812,7 +810,8 @@ rte_pmd_init_internals(const char *name, const unsigned 
nb_rx_queues,
if (data == NULL)
goto error;

-   *internals = rte_zmalloc_socket(name, sizeof(**internals), 0, 
numa_node);
+   *internals = rte_zmalloc_socket(name, sizeof(**internals), 0,
+   numa_node);
if (*internals == NULL)
goto error;

@@ -869,9 +868,8 @@ error:
 static int
 rte_eth_from_pcaps_common(const char *name, struct pmd_devargs *rx_queues,
const unsigned nb_rx_queues, struct pmd_devargs *tx_queues,
-   const unsigned nb_tx_queues, const unsigned numa_node,
-   struct rte_kvargs *kvlist, struct pmd_internals **internals,
-   struct rte_eth_dev **eth_dev)
+   const unsigned nb_tx_queues, struct rte_kvargs *kvlist,
+   struct pmd_internals **internals, struct rte_eth_dev **eth_dev)
 {
unsigned i;

@@ -881,8 +879,8 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
if (tx_queues == NULL && nb_tx_queues > 0)
return -1;

-   if (rte_pmd_init_internals(name, nb_rx_queues, nb_tx_queues, numa_node,
-   internals, eth_dev, kvlist) < 0)
+   if (rte_pmd_init_internals(name, nb_rx_queues, nb_tx_queues, internals,
+   eth_dev, kvlist) < 0)
return -1;

for (i = 0; i < nb_rx_queues; i++) {
@@ -910,17 +908,15 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
 static int
 rte_eth_from_pcaps(const char *name, struct pmd_devargs *rx_queues,
const unsigned nb_rx_queues, struct pmd_devargs *tx_queues,
-   const unsigned nb_tx_queues, const unsigned numa_node,
-   struct rte_kvargs *kvlist, int single_iface,
-   unsigned int using_dumpers)
+   const unsigned nb_tx_queues, struct rte_kvargs *kvlist,
+   int single_iface, unsigned int using_dumpers)
 {
struct pmd_internals *internals = NULL;
struct rte_eth_dev *eth_dev = NULL;
int ret;

ret = rte_eth_from_pcaps_common(name, rx_queues, nb_rx_queues,
-   tx_queues, nb_tx_queues, numa_node, kvlist,
-   , _dev);
+   tx_queues, nb_tx_queues, kvlist, , _dev);

if (ret < 0)
return ret;
@@ -942,7 +938,7 @@ rte_eth_from_pcaps(const char *name, struct pmd_devargs 
*rx_queues,
 static int
 rte_pmd_pcap_devinit(const char *name, const char *params)
 {
-   unsigned numa_node, using_dumpers = 0;
+   unsigned using_dumpers = 0;
int ret;
struct rte_kvargs *kvlist;
struct pmd_devargs pcaps = {0};
@@ -950,8 +946,6 @@ rte_pmd_pcap_devinit(const char *name, const char *params)

RTE_LOG(INFO, PMD, "Initializing pmd_pcap for %s\n", name);

-   numa_node = rte_socket_id();
-
gettimeofday(_time, NULL);
start_cycles = rte_get_timer_cycles();
hz = rte_get_timer_hz();
@@ -974,7 +968,7 @@ rte_pmd_pcap_devinit(const char *name, const char *params)
dumpers.queue[0].name = pcaps.queue[0].name;
dumpers.queue[0].type = pcaps.queue[0].type;
ret = rte_eth_from_pcaps(name, , 1, , 1,
-

[dpdk-dev] [RFC 05/21] net/pcap: update function to reuse it

2016-07-21 Thread Ferruh Yigit

rte_eth_from_pcaps and rte_eth_from_pcaps_n_dumpers functions are very
close, updated rte_eth_from_pcaps function and reused.

No functional update.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 58 +
 1 file changed, 13 insertions(+), 45 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 308b8bc..14f770c 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -908,43 +908,11 @@ rte_eth_from_pcaps_common(const char *name, struct 
pmd_devargs *rx_queues,
 }

 static int
-rte_eth_from_pcaps_n_dumpers(const char *name,
-   struct pmd_devargs *rx_queues,
-   const unsigned nb_rx_queues,
-   struct pmd_devargs *tx_queues,
-   const unsigned nb_tx_queues,
-   const unsigned numa_node,
-   struct rte_kvargs *kvlist)
-{
-   struct pmd_internals *internals = NULL;
-   struct rte_eth_dev *eth_dev = NULL;
-   int ret;
-
-   ret = rte_eth_from_pcaps_common(name, rx_queues, nb_rx_queues,
-   tx_queues, nb_tx_queues, numa_node, kvlist,
-   , _dev);
-
-   if (ret < 0)
-   return ret;
-
-   /* using multiple pcaps/interfaces */
-   internals->single_iface = 0;
-
-   eth_dev->rx_pkt_burst = eth_pcap_rx;
-   eth_dev->tx_pkt_burst = eth_pcap_tx_dumper;
-
-   return 0;
-}
-
-static int
-rte_eth_from_pcaps(const char *name,
-   struct pmd_devargs *rx_queues,
-   const unsigned nb_rx_queues,
-   struct pmd_devargs *tx_queues,
-   const unsigned nb_tx_queues,
-   const unsigned numa_node,
-   struct rte_kvargs *kvlist,
-   int single_iface)
+rte_eth_from_pcaps(const char *name, struct pmd_devargs *rx_queues,
+   const unsigned nb_rx_queues, struct pmd_devargs *tx_queues,
+   const unsigned nb_tx_queues, const unsigned numa_node,
+   struct rte_kvargs *kvlist, int single_iface,
+   unsigned int using_dumpers)
 {
struct pmd_internals *internals = NULL;
struct rte_eth_dev *eth_dev = NULL;
@@ -961,7 +929,11 @@ rte_eth_from_pcaps(const char *name,
internals->single_iface = single_iface;

eth_dev->rx_pkt_burst = eth_pcap_rx;
-   eth_dev->tx_pkt_burst = eth_pcap_tx;
+
+   if (using_dumpers)
+   eth_dev->tx_pkt_burst = eth_pcap_tx_dumper;
+   else
+   eth_dev->tx_pkt_burst = eth_pcap_tx;

return 0;
 }
@@ -1002,7 +974,7 @@ rte_pmd_pcap_devinit(const char *name, const char *params)
dumpers.queue[0].name = pcaps.queue[0].name;
dumpers.queue[0].type = pcaps.queue[0].type;
ret = rte_eth_from_pcaps(name, , 1, , 1,
-   numa_node, kvlist, 1);
+   numa_node, kvlist, 1, using_dumpers);
goto free_kvlist;
}

@@ -1059,12 +1031,8 @@ rte_pmd_pcap_devinit(const char *name, const char 
*params)
if (ret < 0)
goto free_kvlist;

-   if (using_dumpers)
-   ret = rte_eth_from_pcaps_n_dumpers(name, , 
pcaps.num_of_queue,
-   , dumpers.num_of_queue, numa_node, 
kvlist);
-   else
-   ret = rte_eth_from_pcaps(name, , pcaps.num_of_queue, 
,
-   dumpers.num_of_queue, numa_node, kvlist, 0);
+   ret = rte_eth_from_pcaps(name, , pcaps.num_of_queue, ,
+   dumpers.num_of_queue, numa_node, kvlist, 0, using_dumpers);

 free_kvlist:
rte_kvargs_free(kvlist);
-- 
2.7.4

[dpdk-dev] [RFC 04/21] net/pcap: add checks for max queue number

2016-07-21 Thread Ferruh Yigit

Number of queues defined by devargs, a check added to be sure this
number is not bigger than configured queue length.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 34c779a..308b8bc 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -1010,12 +1010,21 @@ rte_pmd_pcap_devinit(const char *name, const char 
*params)
 * We check whether we want to open a RX stream from a real NIC or a
 * pcap file
 */
-   if ((pcaps.num_of_queue = rte_kvargs_count(kvlist, 
ETH_PCAP_RX_PCAP_ARG))) {
+   pcaps.num_of_queue = rte_kvargs_count(kvlist, ETH_PCAP_RX_PCAP_ARG);
+
+   if (pcaps.num_of_queue) {
+   if (pcaps.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
+   pcaps.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;
+
ret = rte_kvargs_process(kvlist, ETH_PCAP_RX_PCAP_ARG,
_rx_pcap, );
} else {
pcaps.num_of_queue = rte_kvargs_count(kvlist,
ETH_PCAP_RX_IFACE_ARG);
+
+   if (pcaps.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
+   pcaps.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;
+
ret = rte_kvargs_process(kvlist, ETH_PCAP_RX_IFACE_ARG,
_rx_iface, );
}
@@ -1027,14 +1036,22 @@ rte_pmd_pcap_devinit(const char *name, const char 
*params)
 * We check whether we want to open a TX stream to a real NIC or a
 * pcap file
 */
-   if ((dumpers.num_of_queue = rte_kvargs_count(kvlist,
-   ETH_PCAP_TX_PCAP_ARG))) {
+   dumpers.num_of_queue = rte_kvargs_count(kvlist, ETH_PCAP_TX_PCAP_ARG);
+
+   if (dumpers.num_of_queue) {
+   if (dumpers.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
+   dumpers.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;
+
ret = rte_kvargs_process(kvlist, ETH_PCAP_TX_PCAP_ARG,
_tx_pcap, );
using_dumpers = 1;
} else {
dumpers.num_of_queue = rte_kvargs_count(kvlist,
ETH_PCAP_TX_IFACE_ARG);
+
+   if (dumpers.num_of_queue > RTE_PMD_PCAP_MAX_QUEUES)
+   dumpers.num_of_queue = RTE_PMD_PCAP_MAX_QUEUES;
+
ret = rte_kvargs_process(kvlist, ETH_PCAP_TX_IFACE_ARG,
_tx_iface, );
}
-- 
2.7.4

[dpdk-dev] [RFC 03/21] net/pcap: reorganize private structs

2016-07-21 Thread Ferruh Yigit

struct rx_pcaps and tx_pcaps used to point parsed devargs, but it is not
clear with current names.

Merged both into single struct and modified struct name and field names.

Functionality not changed, only struct names.

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 123 +++-
 1 file changed, 59 insertions(+), 64 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 877e4c2..34c779a 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -85,21 +85,6 @@ struct pcap_tx_queue {
char type[ETH_PCAP_ARG_MAXLEN];
 };

-struct rx_pcaps {
-   unsigned num_of_rx;
-   pcap_t *pcaps[RTE_PMD_PCAP_MAX_QUEUES];
-   const char *names[RTE_PMD_PCAP_MAX_QUEUES];
-   const char *types[RTE_PMD_PCAP_MAX_QUEUES];
-};
-
-struct tx_pcaps {
-   unsigned num_of_tx;
-   pcap_dumper_t *dumpers[RTE_PMD_PCAP_MAX_QUEUES];
-   pcap_t *pcaps[RTE_PMD_PCAP_MAX_QUEUES];
-   const char *names[RTE_PMD_PCAP_MAX_QUEUES];
-   const char *types[RTE_PMD_PCAP_MAX_QUEUES];
-};
-
 struct pmd_internals {
struct pcap_rx_queue rx_queue[RTE_PMD_PCAP_MAX_QUEUES];
struct pcap_tx_queue tx_queue[RTE_PMD_PCAP_MAX_QUEUES];
@@ -107,6 +92,16 @@ struct pmd_internals {
int single_iface;
 };

+struct pmd_devargs {
+   unsigned num_of_queue;
+   struct devargs_queue {
+   pcap_dumper_t *dumper;
+   pcap_t *pcap;
+   const char *name;
+   const char *type;
+   } queue[RTE_PMD_PCAP_MAX_QUEUES];
+};
+
 const char *valid_arguments[] = {
ETH_PCAP_RX_PCAP_ARG,
ETH_PCAP_TX_PCAP_ARG,
@@ -625,16 +620,16 @@ open_rx_pcap(const char *key, const char *value, void 
*extra_args)
 {
unsigned i;
const char *pcap_filename = value;
-   struct rx_pcaps *pcaps = extra_args;
+   struct pmd_devargs *rx = extra_args;
pcap_t *pcap = NULL;

-   for (i = 0; i < pcaps->num_of_rx; i++) {
+   for (i = 0; i < rx->num_of_queue; i++) {
if (open_single_rx_pcap(pcap_filename, ) < 0)
return -1;

-   pcaps->pcaps[i] = pcap;
-   pcaps->names[i] = pcap_filename;
-   pcaps->types[i] = key;
+   rx->queue[i].pcap = pcap;
+   rx->queue[i].name = pcap_filename;
+   rx->queue[i].type = key;
}

return 0;
@@ -659,16 +654,16 @@ open_tx_pcap(const char *key, const char *value, void 
*extra_args)
 {
unsigned i;
const char *pcap_filename = value;
-   struct tx_pcaps *dumpers = extra_args;
+   struct pmd_devargs *dumpers = extra_args;
pcap_dumper_t *dumper;

-   for (i = 0; i < dumpers->num_of_tx; i++) {
+   for (i = 0; i < dumpers->num_of_queue; i++) {
if (open_single_tx_pcap(pcap_filename, ) < 0)
return -1;

-   dumpers->dumpers[i] = dumper;
-   dumpers->names[i] = pcap_filename;
-   dumpers->types[i] = key;
+   dumpers->queue[i].dumper = dumper;
+   dumpers->queue[i].name = pcap_filename;
+   dumpers->queue[i].type = key;
}

return 0;
@@ -721,15 +716,15 @@ static inline int
 open_rx_tx_iface(const char *key, const char *value, void *extra_args)
 {
const char *iface = value;
-   struct rx_pcaps *pcaps = extra_args;
+   struct pmd_devargs *tx = extra_args;
pcap_t *pcap = NULL;

if (open_single_iface(iface, ) < 0)
return -1;

-   pcaps->pcaps[0] = pcap;
-   pcaps->names[0] = iface;
-   pcaps->types[0] = key;
+   tx->queue[0].pcap = pcap;
+   tx->queue[0].name = iface;
+   tx->queue[0].type = key;

return 0;
 }
@@ -742,15 +737,15 @@ open_rx_iface(const char *key, const char *value, void 
*extra_args)
 {
unsigned i;
const char *iface = value;
-   struct rx_pcaps *pcaps = extra_args;
+   struct pmd_devargs *rx = extra_args;
pcap_t *pcap = NULL;

-   for (i = 0; i < pcaps->num_of_rx; i++) {
+   for (i = 0; i < rx->num_of_queue; i++) {
if (open_single_iface(iface, ) < 0)
return -1;
-   pcaps->pcaps[i] = pcap;
-   pcaps->names[i] = iface;
-   pcaps->types[i] = key;
+   rx->queue[i].pcap = pcap;
+   rx->queue[i].name = iface;
+   rx->queue[i].type = key;
}

return 0;
@@ -764,15 +759,15 @@ open_tx_iface(const char *key, const char *value, void 
*extra_args)
 {
unsigned i;
const char *iface = value;
-   struct tx_pcaps *pcaps = extra_args;
+   struct pmd_devargs *tx = extra_args;
pcap_t *pcap;

-   for (i = 0; i < pcaps->num_of_tx; i++) {
+   for (i = 0; i < tx->num_of_queue; i++) {
if (open_single_iface(iface, ) < 0)

[dpdk-dev] [RFC 02/21] net/pcap: use macros for param string

2016-07-21 Thread Ferruh Yigit

Signed-off-by: Ferruh Yigit 
---
 drivers/net/pcap/rte_eth_pcap.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 6343c0e..877e4c2 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -1091,8 +1091,8 @@ static struct rte_driver pmd_pcap_drv = {

 PMD_REGISTER_DRIVER(pmd_pcap_drv, eth_pcap);
 DRIVER_REGISTER_PARAM_STRING(eth_pcap,
-   "rx_pcap= "
-   "tx_pcap= "
-   "rx_iface= "
-   "tx_iface= "
-   "iface=");
+   ETH_PCAP_RX_PCAP_ARG"= "
+   ETH_PCAP_TX_PCAP_ARG"= "
+   ETH_PCAP_RX_IFACE_ARG"= "
+   ETH_PCAP_TX_IFACE_ARG"= "
+   ETH_PCAP_IFACE_ARG"=");
-- 
2.7.4

[dpdk-dev] [RFC 01/21] net/pcap: create own configuration parameter

2016-07-21 Thread Ferruh Yigit

pcap pmd is using ring pmd configuration parameters to set max number of
queues. This creates an unnecessary dependency and confusion.

Create a new config parameter for pcap PMD:
CONFIG_RTE_PMD_PCAP_MAX_QUEUES

Default value of config is same as ring parameter default.

pcap pmd doesn't need to be configured in a detail to set rx and tx max
queue numbers separately, so using same config item for both queues.

Signed-off-by: Ferruh Yigit 
---
 config/common_base  |  1 +
 drivers/net/pcap/rte_eth_pcap.c | 18 +-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/config/common_base b/config/common_base
index 7830535..e32bd61 100644
--- a/config/common_base
+++ b/config/common_base
@@ -307,6 +307,7 @@ CONFIG_RTE_PMD_RING_MAX_TX_RINGS=16
 # Compile software PMD backed by PCAP files
 #
 CONFIG_RTE_LIBRTE_PMD_PCAP=n
+CONFIG_RTE_PMD_PCAP_MAX_QUEUES=16

 #
 # Compile link bonding PMD library
diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 7e213eb..6343c0e 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -87,22 +87,22 @@ struct pcap_tx_queue {

 struct rx_pcaps {
unsigned num_of_rx;
-   pcap_t *pcaps[RTE_PMD_RING_MAX_RX_RINGS];
-   const char *names[RTE_PMD_RING_MAX_RX_RINGS];
-   const char *types[RTE_PMD_RING_MAX_RX_RINGS];
+   pcap_t *pcaps[RTE_PMD_PCAP_MAX_QUEUES];
+   const char *names[RTE_PMD_PCAP_MAX_QUEUES];
+   const char *types[RTE_PMD_PCAP_MAX_QUEUES];
 };

 struct tx_pcaps {
unsigned num_of_tx;
-   pcap_dumper_t *dumpers[RTE_PMD_RING_MAX_TX_RINGS];
-   pcap_t *pcaps[RTE_PMD_RING_MAX_RX_RINGS];
-   const char *names[RTE_PMD_RING_MAX_RX_RINGS];
-   const char *types[RTE_PMD_RING_MAX_RX_RINGS];
+   pcap_dumper_t *dumpers[RTE_PMD_PCAP_MAX_QUEUES];
+   pcap_t *pcaps[RTE_PMD_PCAP_MAX_QUEUES];
+   const char *names[RTE_PMD_PCAP_MAX_QUEUES];
+   const char *types[RTE_PMD_PCAP_MAX_QUEUES];
 };

 struct pmd_internals {
-   struct pcap_rx_queue rx_queue[RTE_PMD_RING_MAX_RX_RINGS];
-   struct pcap_tx_queue tx_queue[RTE_PMD_RING_MAX_TX_RINGS];
+   struct pcap_rx_queue rx_queue[RTE_PMD_PCAP_MAX_QUEUES];
+   struct pcap_tx_queue tx_queue[RTE_PMD_PCAP_MAX_QUEUES];
int if_index;
int single_iface;
 };
-- 
2.7.4

[dpdk-dev] [RFC 00/21] pcap pmd improvements

2016-07-21 Thread Ferruh Yigit

No new feature added, code refactored.
This patch targetted for 16.11 release.

Ferruh Yigit (21):
  net/pcap: create own configuration parameter
  net/pcap: use macros for param string
  net/pcap: reorganize private structs
  net/pcap: add checks for max queue number
  net/pcap: update function to reuse it
  net/pcap: don't carry numa_node argument
  net/pcap: don't carry kvlist argument
  net/pcap: move comment to correct place
  net/pcap: remove duplicated max queue number check
  net/pcap: use single_iface variable instead of hardcoded
  net/pcap: group stats related fields into a struct
  net/pcap: make const array static
  net/pcap: reorder header files
  net/pcap: reorder functions
  net/pcap: update how single iface handled
  net/pcap: remove unnecessary check
  net/pcap: remove redundant assignment
  net/pcap: simplify function
  net/pcap: fix missing tx iface assignment
  net/pcap: coding convention updates
  net/pcap: remove rte prefix from static functions

 config/common_base  |   1 +
 drivers/net/pcap/rte_eth_pcap.c | 590 +++-
 2 files changed, 280 insertions(+), 311 deletions(-)

-- 
2.7.4

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu

On Thu, Jul 21, 2016 at 12:45:32PM +0300, Ilya Maximets wrote:
> On 21.07.2016 12:37, Yuanhan Liu wrote:
> > On Thu, Jul 21, 2016 at 11:21:15AM +0300, Ilya Maximets wrote:
> >> If something abnormal happened to QEMU, 'connect()' can block calling
> >> thread (e.g. main thread of OVS) forever or for a really long time.
> >> This can break whole application or block the reconnection thread.
> >>
> >> Example with OVS:
> >>
> >>ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
> >>(gdb) bt
> >>#0  connect () from /lib64/libpthread.so.0
> >>#1  vhost_user_create_client (vsocket=0xa816e0)
> >>#2  rte_vhost_driver_register
> >>#3  netdev_dpdk_vhost_user_construct
> >>#4  netdev_open (name=0xa664b0 "vhost1")
> >>[...]
> >>#11 main
> >>
> >> Fix that by setting non-blocking mode for client sockets for connection.
> >>
> >> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> > 
> > Thanks for spotting and fixing yet another bug!
> > 
> >>  
> >> +static int
> >> +vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
> > 
> > I don't quite understand why this is needed: connect() with O_NONBLOCK
> > flag set is not enough?
> 
> There is a little issue with non-blocking connect() call. Connection
> establishing may be started but '-1' returned with 'errno = EINPROGRESS'.
> In this case we must wait on fd until it will be available for writing.
> After that we need to check current status of connection using getsockopt().
> 
> I don't sure that we're able to get such situation, but it's documented,
> and, I think, we should handle it.
> 
> See 'man connect' for details.

I see. Thanks.

But basically, I don't like the way of introduing yet another
fdset here. I'm wondering we could leverage current fdset code
to achieve that. This might need some work though.

So how about making it simple and stupid at this stage: sleep a
while (maybe 1ms, or maybe 1s) when that happens, and give up
when the connection is still not established?

--yliu

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Yuanhan Liu

On Thu, Jul 21, 2016 at 11:21:15AM +0300, Ilya Maximets wrote:
> If something abnormal happened to QEMU, 'connect()' can block calling
> thread (e.g. main thread of OVS) forever or for a really long time.
> This can break whole application or block the reconnection thread.
> 
> Example with OVS:
> 
>   ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
>   (gdb) bt
>   #0  connect () from /lib64/libpthread.so.0
>   #1  vhost_user_create_client (vsocket=0xa816e0)
>   #2  rte_vhost_driver_register
>   #3  netdev_dpdk_vhost_user_construct
>   #4  netdev_open (name=0xa664b0 "vhost1")
>   [...]
>   #11 main
> 
> Fix that by setting non-blocking mode for client sockets for connection.
> 
> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")

Thanks for spotting and fixing yet another bug!

>  
> +static int
> +vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)

I don't quite understand why this is needed: connect() with O_NONBLOCK
flag set is not enough?

--yliu

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-21 Thread Tomasz Kulasek

This is an ABI deprecation notice for DPDK 16.11 in librte_ether about
changes in rte_eth_dev and rte_eth_desc_lim structures.

As discussed in that thread:

http://dpdk.org/ml/archives/dev/2015-September/023603.html

Different NIC models depending on HW offload requested might impose
different requirements on packets to be TX-ed in terms of:

 - Max number of fragments per packet allowed
 - Max number of fragments per TSO segments
 - The way pseudo-header checksum should be pre-calculated
 - L3/L4 header fields filling
 - etc.


MOTIVATION:
---

1) Some work cannot (and didn't should) be done in rte_eth_tx_burst.
   However, this work is sometimes required, and now, it's an
   application issue.

2) Different hardware may have different requirements for TX offloads,
   other subset can be supported and so on.

3) Some parameters (eg. number of segments in ixgbe driver) may hung
   device. These parameters may be vary for different devices.

   For example i40e HW allows 8 fragments per packet, but that is after
   TSO segmentation. While ixgbe has a 38-fragment pre-TSO limit.

4) Fields in packet may require different initialization (like eg. will
   require pseudo-header checksum precalculation, sometimes in a
   different way depending on packet type, and so on). Now application
   needs to care about it.

5) Using additional API (rte_eth_tx_prep) before rte_eth_tx_burst let to
   prepare packet burst in acceptable form for specific device.

6) Some additional checks may be done in debug mode keeping tx_burst
   implementation clean.


PROPOSAL:
-

To help user to deal with all these varieties we propose to:

1. Introduce rte_eth_tx_prep() function to do necessary preparations of
   packet burst to be safely transmitted on device for desired HW
   offloads (set/reset checksum field according to the hardware
   requirements) and check HW constraints (number of segments per
   packet, etc).

   While the limitations and requirements may differ for devices, it
   requires to extend rte_eth_dev structure with new function pointer
   "tx_pkt_prep" which can be implemented in the driver to prepare and
   verify packets, in devices specific way, before burst, what should to
   prevent application to send malformed packets.

2. Also new fields will be introduced in rte_eth_desc_lim: 
   nb_seg_max and nb_mtu_seg_max, providing an information about max
   segments in TSO and non-TSO packets acceptable by device.

   This information is useful for application to not create/limit
   malicious packet.


APPLICATION (CASE OF USE):
--

1) Application should to initialize burst of packets to send, set
   required tx offload flags and required fields, like l2_len, l3_len,
   l4_len, and tso_segsz

2) Application passes burst to the rte_eth_tx_prep to check conditions
   required to send packets through the NIC.

3) The result of rte_eth_tx_prep can be used to send valid packets
   and/or restore invalid if function fails.

eg.

for (i = 0; i < nb_pkts; i++) {

/* initialize or process packet */

bufs[i]->tso_segsz = 800;
bufs[i]->ol_flags = PKT_TX_TCP_SEG | PKT_TX_IPV4
| PKT_TX_IP_CKSUM;
bufs[i]->l2_len = sizeof(struct ether_hdr);
bufs[i]->l3_len = sizeof(struct ipv4_hdr);
bufs[i]->l4_len = sizeof(struct tcp_hdr);
}

/* Prepare burst of TX packets */
nb_prep = rte_eth_tx_prep(port, 0, bufs, nb_pkts);

if (nb_prep < nb_pkts) {
printf("tx_prep failed\n");

/* drop or restore invalid packets */

}

/* Send burst of TX packets */
nb_tx = rte_eth_tx_burst(port, 0, bufs, nb_prep);

/* Free any unsent packets. */



Signed-off-by: Tomasz Kulasek 
---
 doc/guides/rel_notes/deprecation.rst |7 +++
 1 file changed, 7 insertions(+)

diff --git a/doc/guides/rel_notes/deprecation.rst 
b/doc/guides/rel_notes/deprecation.rst
index f502f86..485aacb 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -41,3 +41,10 @@ Deprecation Notices
 * The mempool functions for single/multi producer/consumer are deprecated and
   will be removed in 16.11.
   It is replaced by rte_mempool_generic_get/put functions.
+
+* In 16.11 ABI changes are plained: the ``rte_eth_dev`` structure will be
+  extended with new function pointer ``tx_pkt_prep`` allowing verification
+  and processing of packet burst to meet HW specific requirements before
+  transmit. Also new fields will be added to the ``rte_eth_desc_lim`` 
structure:
+  ``nb_seg_max`` and ``nb_mtu_seg_max`` provideing information about number of
+  segments limit to be transmitted by device for TSO/non-TSO packets.
-- 
1.7.9.5

[dpdk-dev] [PATCH] test_mempool: remove unused mp_ext var

2016-07-21 Thread Santosh Shukla

test_mempool func not using pointer variable 'mp_ext' and incorrectly freed. So
removing ptr var. Now freeing mp_stack var.

Signed-off-by: Santosh Shukla 
---
 app/test/test_mempool.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index 46ad670..3b21cf7 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -506,7 +506,6 @@ test_mempool(void)
 {
struct rte_mempool *mp_cache = NULL;
struct rte_mempool *mp_nocache = NULL;
-   struct rte_mempool *mp_ext = NULL;
struct rte_mempool *mp_stack = NULL;

rte_atomic32_init();
@@ -605,7 +604,7 @@ test_mempool(void)
 err:
rte_mempool_free(mp_nocache);
rte_mempool_free(mp_cache);
-   rte_mempool_free(mp_ext);
+   rte_mempool_free(mp_stack);
return -1;
 }

-- 
1.7.9.5

[dpdk-dev] [PATCH v3] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets

On 21.07.2016 16:35, Yuanhan Liu wrote:
> On Thu, Jul 21, 2016 at 04:19:35PM +0300, Ilya Maximets wrote:
>> If something abnormal happened to QEMU, 'connect()' can block calling
>> thread (e.g. main thread of OVS) forever or for a really long time.
>> This can break whole application or block the reconnection thread.
>>
>> Example with OVS:
>>
>>  ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
>>  (gdb) bt
>>  #0  connect () from /lib64/libpthread.so.0
>>  #1  vhost_user_create_client (vsocket=0xa816e0)
>>  #2  rte_vhost_driver_register
>>  #3  netdev_dpdk_vhost_user_construct
>>  #4  netdev_open (name=0xa664b0 "vhost1")
>>  [...]
>>  #11 main
>>
>> Fix that by setting non-blocking mode for client sockets for connection.
>>
>> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
>>
>> Signed-off-by: Ilya Maximets 
> 
> Acked-by: Yuanhan Liu 
> 
> One help I'd like to ask is that I'd appriciate if you could do the test
> to make sure that your 2 (latest) patches fix the two issues you reported.
> 
> You might have already done that; I just want to make sure.

I've performed the test with 'ofport_request' script before sending patches.
And currently test still works. No leaks of descriptors, no hangs,
no QEMU crashes observed.
Sometimes network device breaks on QEMU side, but it's QEMU issue. In this
case I'm receiving following message from DPDK's vhost:

VHOST_CONFIG: vhost-user client: socket created, fd: 28
VHOST_CONFIG: failed to connect to /vhost1: Resource temporarily unavailable
VHOST_CONFIG: /vhost1: reconnecting...

Before the 'hang' patch there was hang of main thread.

After QEMU restart all works normally. OVS restart not required.

Best regards, Ilya Maximets.

[dpdk-dev] [PATCH] eal: fix check number of bytes from read function

2016-07-21 Thread Thomas Monjalon

Hi,

2016-07-20 16:24, Michal Jastrzebski:
> - if (read(fd, , sizeof(uint64_t)) < 0) {
> +
> + retval = read(fd, , sizeof(uint64_t));
> + if (retval < 0) {
>   RTE_LOG(ERR, EAL, "%s(): cannot read /proc/self/pagemap: %s\n",
>   __func__, strerror(errno));
>   close(fd);
>   return RTE_BAD_PHYS_ADDR;
> + }   else if (retval >= 0 && retval < (int)sizeof(uint64_t)) {

I have 4 comments about the above line:
- the check retval >= 0 is not needed because implied by else
- why not checking retval != sizeof(uint64_t) as it is the exact expected value?
- (int)sizeof(uint64_t) can be replaced by 8 but it's shorter ;)
- only 1 space is required between } and else

> + RTE_LOG(ERR, EAL, "%s(): read %d bytes from /proc/self/pagemap "
> + "but expected %d: %s\n",
> + __func__, retval, (int)sizeof(uint64_t), 
> strerror(errno));

Are you sure errno is meaningful here?

> + close(fd);
> + return RTE_BAD_PHYS_ADDR;
>   }

[dpdk-dev] [PATCH] virtio: fix packet corruption

2016-07-21 Thread Yuanhan Liu

On Tue, Jul 19, 2016 at 02:31:59PM +0200, Olivier Matz wrote:
> The support of virtio-user changed the way the mbuf dma address is
> retrieved, using a physical address in case of virtio-pci and a virtual
> address in case of virtio-user.
> 
> This change introduced some possible memory corruption in packets,
> replacing:
>   m->buf_physaddr + RTE_PKTMBUF_HEADROOM
> by:
>   m->buf_physaddr + m->data_off (through a macro)
> 
> This patch fixes this issue, restoring the original behavior.
> 
> By the way, it also rework the macros, adding a "VIRTIO_" prefix and
> API comments.
> 
> Fixes: f24f8f9fee8a ("net/virtio: allow virtual address to fill vring 
> descriptors")
> 
> Signed-off-by: Olivier Matz 

Thanks for the fix!

Acked-by: Yuanhan Liu 

--yliu

[dpdk-dev] [PATCH 04/12] mbuf: add function to calculate a checksum

2016-07-21 Thread Don Provan

> -Original Message-
> From: Ananyev, Konstantin [mailto:konstantin.ananyev at intel.com]
> Sent: Thursday, July 21, 2016 3:51 AM
> Subject: Re: [dpdk-dev] [PATCH 04/12] mbuf: add function to calculate a
> checksum
> 
>...
> > +  Added a new function ``rte_pktmbuf_cksum()`` to process the checksum
> > + of  data embedded in an mbuf chain.
> >...
> > +#include 
>
> As a nit, do we need to introduce a dependency for librte_mbuf on librte_net?
> Might be better to put this functionality into librte_net?

That's not a nit at all. This is clearly a net function that has no place in 
the mbuf code.
That should be obvious even before we notice this circular dependency.
-don
dprovan at bivio.net

[dpdk-dev] [PATCH v2] mempool: adjust name string size in related data types

2016-07-21 Thread Olivier Matz



On 07/21/2016 03:47 PM, Zoltan Kiss wrote:
> 
> 
> On 21/07/16 14:40, Olivier Matz wrote:
>> Hi Zoltan,
>>
>>
>> On 07/20/2016 07:16 PM, Zoltan Kiss wrote:
>>> A recent patch brought up an issue about the size of the 'name' fields:
>>>
>>> 85cf0079 mem: avoid memzone/mempool/ring name truncation
>>>
>>> These relations should be observed:
>>>
>>> 1. Each ring creates a memzone with a prefixed name:
>>> RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
>>>
>>> 2. There are some mempool handlers which create a ring with a prefixed
>>> name:
>>> RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE -
>>> strlen(RTE_MEMPOOL_MZ_PREFIX)
>>>
>>> 3. A mempool can create up to RTE_MAX_MEMZONE pre and postfixed
>>> memzones:
>>> sprintf(postfix, "_%d", RTE_MAX_MEMZONE)
>>> RTE_MEMPOOL_NAMESIZE <= RTE_MEMZONE_NAMESIZE -
>>> strlen(RTE_MEMPOOL_MZ_PREFIX) - strlen(postfix)
>>>
>>> Setting all of them to 32 hides this restriction from the application.
>>> This patch decreases the mempool and ring string size to accommodate for
>>> these prefixes, but it doesn't apply the 3rd constraint. Applications
>>> relying on these constants need to be recompiled, otherwise they'll run
>>> into ENAMETOOLONG issues.
>>> The size of the arrays are kept 32 for ABI compatibility, it can be
>>> decreased next time the ABI changes.
>>>
>>> Signed-off-by: Zoltan Kiss 
>>
>> Looks like to be a good compromise for the 16.07 release. One question
>> however: why not taking in account the 3rd constraint? Because it may
>> not completly fix the issue if the mempool is fragmented.
>>
>> We could define RTE_MEMPOOL_NAMESIZE to 24
>>  = 32 - len('mp_') - len('_0123'))
> 
> I was trying to figure out a compile time macro for strlen(postfix), but
> I could not. Your suggestion would work only until someone increases
> RTE_MAX_MEMZONE above . As the likelihood of fragmenting a pool over
> 99 memzones seems small, I did not bother to fix this with an ugly hack,
> but if you think we should include it, let me know!

Ok, looks fair, thanks for the clarification.

Acked-by: Olivier Matz

[dpdk-dev] [PATCH] net/fm10k: fix RSS hash config

2016-07-21 Thread Xiao Wang

Sometimes app just wants to update the RSS hash function and no RSS key
update is needed, but fm10k pmd will return EINVAL for this case.

If the rss_key is NULL, we don't need to check the rss_key_len.

Fixes: 57033cdf8fdc ("fm10k: add PF RSS")

Reported-by: Xueqin Lin 
Signed-off-by: Xiao Wang 
---
 drivers/net/fm10k/fm10k_ethdev.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/fm10k/fm10k_ethdev.c b/drivers/net/fm10k/fm10k_ethdev.c
index 144b2de..01f4a72 100644
--- a/drivers/net/fm10k/fm10k_ethdev.c
+++ b/drivers/net/fm10k/fm10k_ethdev.c
@@ -2159,8 +2159,8 @@ fm10k_rss_hash_update(struct rte_eth_dev *dev,

PMD_INIT_FUNC_TRACE();

-   if (rss_conf->rss_key_len < FM10K_RSSRK_SIZE *
-   FM10K_RSSRK_ENTRIES_PER_REG)
+   if (key && (rss_conf->rss_key_len < FM10K_RSSRK_SIZE *
+   FM10K_RSSRK_ENTRIES_PER_REG))
return -EINVAL;

if (hf == 0)
@@ -2202,8 +2202,8 @@ fm10k_rss_hash_conf_get(struct rte_eth_dev *dev,

PMD_INIT_FUNC_TRACE();

-   if (rss_conf->rss_key_len < FM10K_RSSRK_SIZE *
-   FM10K_RSSRK_ENTRIES_PER_REG)
+   if (key && (rss_conf->rss_key_len < FM10K_RSSRK_SIZE *
+   FM10K_RSSRK_ENTRIES_PER_REG))
return -EINVAL;

if (key != NULL)
-- 
1.9.3

[dpdk-dev] [PATCH] EAL:fix memory barrier implementation on IBM POWER

2016-07-21 Thread Thomas Monjalon

2016-07-15 10:30, Chao Zhu:
> On weak memory order architecture like POWER, rte_smp_wmb/rte_smp_rmb
> need to use CPU instructions, not compiler barrier. This patch fixes
> this. Also, to improve performance on PPC64, use light weight sync
> instruction instead of sync instruction.
> 
> Signed-off-by: Chao Zhu 

Applied, thanks

[dpdk-dev] [PATCH] vhost: fix driver unregister for client mode

2016-07-21 Thread Yuanhan Liu

On Wed, Jul 20, 2016 at 11:32:43AM +0300, Ilya Maximets wrote:
> Currently while calling of 'rte_vhost_driver_unregister()' connection
> to QEMU will not be closed. This leads to inability to register driver
> again and reconnect to same virtual machine.
> 
> This scenario is reproducible with OVS. While executing of the following
> command vhost port will be re-created (will be executed
> 'rte_vhost_driver_register()' followed by 'rte_vhost_driver_unregister()')
> network will be broken and QEMU possibly will crash:
> 
>   ovs-vsctl set Interface vhost1 ofport_request=15
> 
> Fix this by closing all established connections on driver unregister and
> removing of pending connections from reconnection list.
> 
> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> Signed-off-by: Ilya Maximets 

Again, thanks for the fix.

> ---
> 
> Patch prepared for master branch because dpdk-next-virtio doesn't contain
> commit acbff5c67ea7 ("vhost: fix crash when exceeding file descriptors").
> Porting to dpdk-next-virtio/master is trivial and may be performed on
> demand.

Yeah, my bad, I haven't updated it after rc2, since Thomas no longer
pull request from it.  Anyway, you just remind me that I should have
done that.

>  /**
>   * Unregister the specified vhost socket
>   */
> @@ -672,20 +700,34 @@ rte_vhost_driver_unregister(const char *path)
>  {
>   int i;
>   int count;
> + struct vhost_user_connection *conn;
>  
>   pthread_mutex_lock(_user.mutex);
>  
>   for (i = 0; i < vhost_user.vsocket_cnt; i++) {
> - if (!strcmp(vhost_user.vsockets[i]->path, path)) {
> - if (vhost_user.vsockets[i]->is_server) {
> - fdset_del(_user.fdset,
> - vhost_user.vsockets[i]->listenfd);
> - close(vhost_user.vsockets[i]->listenfd);
> + struct vhost_user_socket *vsocket = vhost_user.vsockets[i];
> +
> + if (!strcmp(vsocket->path, path)) {
> + if (vsocket->is_server) {
> + (void) fdset_del(_user.fdset,
> +  vsocket->listenfd);

I would think the (void) cast is not neceessary here.

> + close(vsocket->listenfd);
>   unlink(path);
> + } else if (vsocket->reconnect) {
> + vhost_user_remove_reconnect(vsocket);
> + }
> +
> + conn = fdset_del(_user.fdset, vsocket->connfd);
> + if (conn) {
> + RTE_LOG(INFO, VHOST_CONFIG, "free connfd = %d"
> + "for device '%s'\n", vsocket->connfd, path);

We should try not to break a log message into several lines, which
hurts "git grep".  Here, it could be:

RTE_LOG(INFO, VHOST_CONFIG,
"free connfd = %d for device '%s'\n",
vsocket->connfd, path);

Besides the two minor nits,

Acked-by: Yuanhan Liu 

--yliu

[dpdk-dev] [PATCH 2/2] example/ipsec-secgw: add support for cryptodev_start

2016-07-21 Thread Hemant Agrawal

The usual device sequence is configure, queue setup and start.
Crypto device should be started before use.

Signed-off-by: Akhil Goyal 
Signed-off-by: Hemant Agrawal 
---
 examples/ipsec-secgw/ipsec-secgw.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/examples/ipsec-secgw/ipsec-secgw.c 
b/examples/ipsec-secgw/ipsec-secgw.c
index 1ca144b..302499c 100644
--- a/examples/ipsec-secgw/ipsec-secgw.c
+++ b/examples/ipsec-secgw/ipsec-secgw.c
@@ -1273,6 +1273,10 @@ cryptodevs_init(void)
_conf, dev_conf.socket_id))
rte_panic("Failed to setup queue %u for "
"cdev_id %u\n", 0, cdev_id);
+   /* Start device */
+   if (rte_cryptodev_start(cdev_id))
+   rte_panic("Failed to start crypto dev for "
+   "cdev_id=%u\n", cdev_id);
}

printf("\n");
-- 
1.9.1

[dpdk-dev] [PATCH 1/2] example/l2fwd-crypto: add support for cryptodev_start

2016-07-21 Thread Hemant Agrawal

The usual device sequence is configure, queue setup and start.
Crypto device should be started before use.

Signed-off-by: Akhil Goyal 
Signed-off-by: Hemant Agrawal 
---
 examples/l2fwd-crypto/main.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/examples/l2fwd-crypto/main.c b/examples/l2fwd-crypto/main.c
index dd39cc1..62e1c39 100644
--- a/examples/l2fwd-crypto/main.c
+++ b/examples/l2fwd-crypto/main.c
@@ -1793,6 +1793,12 @@ initialize_cryptodevs(struct l2fwd_crypto_options 
*options, unsigned nb_ports,
if (retval < 0) {
printf("Failed to setup queue pair %u on cryptodev %u",
0, cdev_id);
+   }
+   /* Start device */
+   retval = rte_cryptodev_start(cdev_id);
+   if (retval < 0) {
+   printf("rte_cryptodev_start:err=%d, cdev_id=%u\n",
+  retval, (unsigned)cdev_id);
return -1;
}

-- 
1.9.1

[dpdk-dev] [PATCH] eal: fix parsing of argument of option --lcores

2016-07-21 Thread Thomas Monjalon

Hi,

2016-07-21 14:03, Wei Dai:
> The '-' in lcores set overrides cpu set of following
> lcore set in the argument of EAL option --lcores.
> 
> Fixes: 53e54bf81700 ("eal: new option --lcores for cpu assignment")
> 
> Signed-off-by: Wei Dai 

Thanks for the catch!

> --- a/lib/librte_eal/common/eal_common_options.c
> +++ b/lib/librte_eal/common/eal_common_options.c
> @@ -563,6 +563,7 @@ convert_to_cpuset(rte_cpuset_t *cpusetp,
>   * lcores, cpus could be a single digit/range or a group.
>   * '(' and ')' are necessary if it's a group.
>   * If not supply '@cpus', the value of cpus uses the same as lcores.
> + * The 'a-b' in lcores not within '(' and ')' means a,a+1,...,b-1,b .

It also a range when inside a group.
The difference is the mapping to the cpus.
I think this new comment brings more confusion. Better to skip.

>   * e.g. '1,2@(5-7),(3-5)@(0,2),(0,6),7-8' means start 9 EAL thread as below
>   *   lcore 0 runs on cpuset 0x41 (cpu 0,6)
>   *   lcore 1 runs on cpuset 0x2 (cpu 1)
> @@ -571,6 +572,15 @@ convert_to_cpuset(rte_cpuset_t *cpusetp,
>   *   lcore 6 runs on cpuset 0x41 (cpu 0,6)
>   *   lcore 7 runs on cpuset 0x80 (cpu 7)
>   *   lcore 8 runs on cpuset 0x100 (cpu 8)
> + * e.g. '0-2,(3-5)@(3,4),6@(5,6),7@(5-7)'means start 8 EAL threads as below
> + *   lcore 0 runs on cpuset 0x1 (cpu 0)
> + *   lcore 1 runs on cpuset 0x2 (cpu 1)
> + *   lcore 2 runs on cpuset ox4 (cpu 2)
> + *   lcore 3,4,5 runs on cpuset 0x18 (cpu 3,4)
> + *   lcore 6 runs on cpuset 0x60 (cpu 5,6)
> + *   lcore 7 runs on cpuset 0xe0 (cpu 5,6,7)
> + * The second case is used to test bugfix for lflags not be cleared after use
> + */
>   */

Please do not add a second example just to show how to test your fix.

> @@ -679,6 +689,8 @@ eal_parse_lcores(const char *lcores)
>  sizeof(rte_cpuset_t));
>   }
>  
> + lflags = 0;
> +
>   lcores = end + 1;
>   } while (*end != '\0');

It would have more sense to init lflags at the beginning of the loop
and replace
int lflags = 0;
by
int lflags;

[dpdk-dev] [PATCH v3] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets

If something abnormal happened to QEMU, 'connect()' can block calling
thread (e.g. main thread of OVS) forever or for a really long time.
This can break whole application or block the reconnection thread.

Example with OVS:

ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
(gdb) bt
#0  connect () from /lib64/libpthread.so.0
#1  vhost_user_create_client (vsocket=0xa816e0)
#2  rte_vhost_driver_register
#3  netdev_dpdk_vhost_user_construct
#4  netdev_open (name=0xa664b0 "vhost1")
[...]
#11 main

Fix that by setting non-blocking mode for client sockets for connection.

Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")

Signed-off-by: Ilya Maximets 
---
This was reproduced with current QEMU master branch
(commit 1ecfb24da987b862f) + patch-set "vhost-user reconnect fixes"
(https://lists.nongnu.org/archive/html/qemu-devel/2016-07/msg01547.html).

OVS was patched to support client mode:
http://openvswitch.org/pipermail/dev/2016-July/074972.html

Following script forces QEMU to fail to initialize vhost because
disconnection occures while device not fully configured:

while true
do
ovs-vsctl set Interface vhost1 ofport_request=125
ovs-vsctl set Interface vhost1 ofport_request=126
done

As a result: QEMU still works, network interface broken and OVS main
 thread stalled inside 'connect()'.

Version 3:
* Removed unnecessary 'getsockopt()' check.

Version 2:
* EINPROGRESS not checked. EISCONN checked instead on
  the next iteration of reconnection loop.



 lib/librte_vhost/vhost_user/vhost-net-user.c | 52 +---
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 62c5ec3..b35594d 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 #include 
@@ -449,6 +450,14 @@ create_unix_socket(const char *path, struct sockaddr_un 
*un, bool is_server)
RTE_LOG(INFO, VHOST_CONFIG, "vhost-user %s: socket created, fd: %d\n",
is_server ? "server" : "client", fd);

+   if (!is_server && fcntl(fd, F_SETFL, O_NONBLOCK)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "vhost-user: can't set nonblocking mode for socket, fd: 
"
+   "%d (%s)\n", fd, strerror(errno));
+   close(fd);
+   return -1;
+   }
+
memset(un, 0, sizeof(*un));
un->sun_family = AF_UNIX;
strncpy(un->sun_path, path, sizeof(un->sun_path));
@@ -516,9 +525,33 @@ struct vhost_user_reconnect_list {
 static struct vhost_user_reconnect_list reconn_list;
 static pthread_t reconn_tid;

+static int
+vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
+{
+   int ret, flags;
+
+   ret = connect(fd, un, sz);
+   if (ret < 0 && errno != EISCONN)
+   return -1;
+
+   flags = fcntl(fd, F_GETFL, 0);
+   if (flags < 0) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "can't get flags for connfd %d\n", fd);
+   return -2;
+   }
+   if ((flags & O_NONBLOCK) && fcntl(fd, F_SETFL, flags & ~O_NONBLOCK)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "can't disable nonblocking on fd %d\n", fd);
+   return -2;
+   }
+   return 0;
+}
+
 static void *
 vhost_user_client_reconnect(void *arg __rte_unused)
 {
+   int ret;
struct vhost_user_reconnect *reconn, *next;

while (1) {
@@ -532,13 +565,23 @@ vhost_user_client_reconnect(void *arg __rte_unused)
 reconn != NULL; reconn = next) {
next = TAILQ_NEXT(reconn, next);

-   if (connect(reconn->fd, (struct sockaddr *)>un,
-   sizeof(reconn->un)) < 0)
+   ret = vhost_user_connect_nonblock(reconn->fd,
+   (struct sockaddr *)>un,
+   sizeof(reconn->un));
+   if (ret == -2) {
+   close(reconn->fd);
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "reconnection for fd %d failed\n",
+   reconn->fd);
+   goto remove_fd;
+   }
+   if (ret == -1)
continue;

RTE_LOG(INFO, VHOST_CONFIG,
"%s: connected\n", reconn->vsocket->path);
vhost_user_add_connection(reconn->fd, reconn->vsocket);
+remove_fd:

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets

On 21.07.2016 15:58, Yuanhan Liu wrote:
> On Thu, Jul 21, 2016 at 03:42:54PM +0300, Ilya Maximets wrote:
>> On 21.07.2016 15:35, Yuanhan Liu wrote:
>>> On Thu, Jul 21, 2016 at 03:13:14PM +0300, Ilya Maximets wrote:
>>
>> What do you think of it?
>
> I found that we can't check connection status without select/poll
> on it. 'getsockopt()' will return 0 with no errors if connection
> is not still established just like if it was.
> So, I think, the first version of this patch is the only
> acceptable solution.

 Sorry, v2 is acceptable too, because it always calls 'connect()'.
>>>
>>> So you have done the test that it works?
>>
>> No, it's just theory. I don't know how to test this.
>>
>>> I'm more curious to know
>>> could your above case hit the getsockopt() code path, I mean, the
>>> path that errno is set to EINPROGRESS or EISCONN?
>>
>> As I already told, I don't sure that we're able to get EINPROGRESS
>> on our AF_UNIX sockets.
>> In v2 'getsockopt()' check is unnecessary.
> 
> We then have no reason to keep it?

You want me to re-send v2 without 'getsockopt()' check?

[dpdk-dev] [PATCH v2] vhost: fix driver unregister for client mode

2016-07-21 Thread Ilya Maximets

I've fixed leak of file descriptors in 'vhost_user_remove_reconnect()'
and sent v3.

On 21.07.2016 11:31, Ilya Maximets wrote:
> Currently while calling of 'rte_vhost_driver_unregister()' connection
> to QEMU will not be closed. This leads to inability to register driver
> again and reconnect to same virtual machine.
> 
> This scenario is reproducible with OVS. While executing of the following
> command vhost port will be re-created (will be executed
> 'rte_vhost_driver_register()' followed by 'rte_vhost_driver_unregister()')
> network will be broken and QEMU possibly will crash:
> 
>   ovs-vsctl set Interface vhost1 ofport_request=15
> 
> Fix this by closing all established connections on driver unregister and
> removing of pending connections from reconnection list.
> 
> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> 
> Acked-by: Yuanhan Liu 
> Signed-off-by: Ilya Maximets 
> ---
>  lib/librte_vhost/vhost_user/fd_man.c | 15 ++--
>  lib/librte_vhost/vhost_user/fd_man.h |  2 +-
>  lib/librte_vhost/vhost_user/vhost-net-user.c | 56 
> 
>  3 files changed, 62 insertions(+), 11 deletions(-)
> 
> diff --git a/lib/librte_vhost/vhost_user/fd_man.c 
> b/lib/librte_vhost/vhost_user/fd_man.c
> index c691339..2d3eeb7 100644
> --- a/lib/librte_vhost/vhost_user/fd_man.c
> +++ b/lib/librte_vhost/vhost_user/fd_man.c
> @@ -132,8 +132,10 @@ fdset_init(struct fdset *pfdset)
>   if (pfdset == NULL)
>   return;
>  
> - for (i = 0; i < MAX_FDS; i++)
> + for (i = 0; i < MAX_FDS; i++) {
>   pfdset->fd[i].fd = -1;
> + pfdset->fd[i].dat = NULL;
> + }
>   pfdset->num = 0;
>  }
>  
> @@ -166,14 +168,16 @@ fdset_add(struct fdset *pfdset, int fd, fd_cb rcb, 
> fd_cb wcb, void *dat)
>  
>  /**
>   *  Unregister the fd from the fdset.
> + *  Returns context of a given fd or NULL.
>   */
> -void
> +void *
>  fdset_del(struct fdset *pfdset, int fd)
>  {
>   int i;
> + void *dat = NULL;
>  
>   if (pfdset == NULL || fd == -1)
> - return;
> + return NULL;
>  
>   do {
>   pthread_mutex_lock(>fd_mutex);
> @@ -181,13 +185,17 @@ fdset_del(struct fdset *pfdset, int fd)
>   i = fdset_find_fd(pfdset, fd);
>   if (i != -1 && pfdset->fd[i].busy == 0) {
>   /* busy indicates r/wcb is executing! */
> + dat = pfdset->fd[i].dat;
>   pfdset->fd[i].fd = -1;
>   pfdset->fd[i].rcb = pfdset->fd[i].wcb = NULL;
> + pfdset->fd[i].dat = NULL;
>   pfdset->num--;
>   i = -1;
>   }
>   pthread_mutex_unlock(>fd_mutex);
>   } while (i != -1);
> +
> + return dat;
>  }
>  
>  /**
> @@ -203,6 +211,7 @@ fdset_del_slot(struct fdset *pfdset, int index)
>  
>   pfdset->fd[index].fd = -1;
>   pfdset->fd[index].rcb = pfdset->fd[index].wcb = NULL;
> + pfdset->fd[index].dat = NULL;
>   pfdset->num--;
>  
>   pthread_mutex_unlock(>fd_mutex);
> diff --git a/lib/librte_vhost/vhost_user/fd_man.h 
> b/lib/librte_vhost/vhost_user/fd_man.h
> index 74ecde2..bd66ed1 100644
> --- a/lib/librte_vhost/vhost_user/fd_man.h
> +++ b/lib/librte_vhost/vhost_user/fd_man.h
> @@ -60,7 +60,7 @@ void fdset_init(struct fdset *pfdset);
>  int fdset_add(struct fdset *pfdset, int fd,
>   fd_cb rcb, fd_cb wcb, void *dat);
>  
> -void fdset_del(struct fdset *pfdset, int fd);
> +void *fdset_del(struct fdset *pfdset, int fd);
>  
>  void fdset_event_dispatch(struct fdset *pfdset);
>  
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
> b/lib/librte_vhost/vhost_user/vhost-net-user.c
> index f0ea3a2..8c6a096 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> @@ -60,6 +60,7 @@
>  struct vhost_user_socket {
>   char *path;
>   int listenfd;
> + int connfd;
>   bool is_server;
>   bool reconnect;
>  };
> @@ -277,11 +278,13 @@ vhost_user_add_connection(int fd, struct 
> vhost_user_socket *vsocket)
>  
>   RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
>  
> + vsocket->connfd = fd;
>   conn->vsocket = vsocket;
>   conn->vid = vid;
>   ret = fdset_add(_user.fdset, fd, vhost_user_msg_handler,
>   NULL, conn);
>   if (ret < 0) {
> + vsocket->connfd = -1;
>   free(conn);
>   close(fd);
>   RTE_LOG(ERR, VHOST_CONFIG,
> @@ -329,6 +332,7 @@ vhost_user_msg_handler(int connfd, void *dat, int *remove)
>   RTE_LOG(ERR, VHOST_CONFIG,
>   "vhost read incorrect message\n");
>  
> + vsocket->connfd = -1;
>   close(connfd);
>   *remove = 1;
>   free(conn);
> @@ -635,6 +639,7 @@ rte_vhost_driver_register(const char *path, uint64_t 
> flags)
>

[dpdk-dev] [PATCH v3] vhost: fix driver unregister for client mode

2016-07-21 Thread Ilya Maximets

Currently while calling of 'rte_vhost_driver_unregister()' connection
to QEMU will not be closed. This leads to inability to register driver
again and reconnect to same virtual machine.

This scenario is reproducible with OVS. While executing of the following
command vhost port will be re-created (will be executed
'rte_vhost_driver_register()' followed by 'rte_vhost_driver_unregister()')
network will be broken and QEMU possibly will crash:

ovs-vsctl set Interface vhost1 ofport_request=15

Fix this by closing all established connections on driver unregister and
removing of pending connections from reconnection list.

Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")

Signed-off-by: Ilya Maximets 
---

Version 3:
* fixed leak of file descriptors by adding of
  'close(reconn->fd)' to 'vhost_user_remove_reconnect()'

Version 2:
* style fixes.

 lib/librte_vhost/vhost_user/fd_man.c | 15 ++--
 lib/librte_vhost/vhost_user/fd_man.h |  2 +-
 lib/librte_vhost/vhost_user/vhost-net-user.c | 57 
 3 files changed, 63 insertions(+), 11 deletions(-)

diff --git a/lib/librte_vhost/vhost_user/fd_man.c 
b/lib/librte_vhost/vhost_user/fd_man.c
index c691339..2d3eeb7 100644
--- a/lib/librte_vhost/vhost_user/fd_man.c
+++ b/lib/librte_vhost/vhost_user/fd_man.c
@@ -132,8 +132,10 @@ fdset_init(struct fdset *pfdset)
if (pfdset == NULL)
return;

-   for (i = 0; i < MAX_FDS; i++)
+   for (i = 0; i < MAX_FDS; i++) {
pfdset->fd[i].fd = -1;
+   pfdset->fd[i].dat = NULL;
+   }
pfdset->num = 0;
 }

@@ -166,14 +168,16 @@ fdset_add(struct fdset *pfdset, int fd, fd_cb rcb, fd_cb 
wcb, void *dat)

 /**
  *  Unregister the fd from the fdset.
+ *  Returns context of a given fd or NULL.
  */
-void
+void *
 fdset_del(struct fdset *pfdset, int fd)
 {
int i;
+   void *dat = NULL;

if (pfdset == NULL || fd == -1)
-   return;
+   return NULL;

do {
pthread_mutex_lock(>fd_mutex);
@@ -181,13 +185,17 @@ fdset_del(struct fdset *pfdset, int fd)
i = fdset_find_fd(pfdset, fd);
if (i != -1 && pfdset->fd[i].busy == 0) {
/* busy indicates r/wcb is executing! */
+   dat = pfdset->fd[i].dat;
pfdset->fd[i].fd = -1;
pfdset->fd[i].rcb = pfdset->fd[i].wcb = NULL;
+   pfdset->fd[i].dat = NULL;
pfdset->num--;
i = -1;
}
pthread_mutex_unlock(>fd_mutex);
} while (i != -1);
+
+   return dat;
 }

 /**
@@ -203,6 +211,7 @@ fdset_del_slot(struct fdset *pfdset, int index)

pfdset->fd[index].fd = -1;
pfdset->fd[index].rcb = pfdset->fd[index].wcb = NULL;
+   pfdset->fd[index].dat = NULL;
pfdset->num--;

pthread_mutex_unlock(>fd_mutex);
diff --git a/lib/librte_vhost/vhost_user/fd_man.h 
b/lib/librte_vhost/vhost_user/fd_man.h
index 74ecde2..bd66ed1 100644
--- a/lib/librte_vhost/vhost_user/fd_man.h
+++ b/lib/librte_vhost/vhost_user/fd_man.h
@@ -60,7 +60,7 @@ void fdset_init(struct fdset *pfdset);
 int fdset_add(struct fdset *pfdset, int fd,
fd_cb rcb, fd_cb wcb, void *dat);

-void fdset_del(struct fdset *pfdset, int fd);
+void *fdset_del(struct fdset *pfdset, int fd);

 void fdset_event_dispatch(struct fdset *pfdset);

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index f0ea3a2..62c5ec3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -60,6 +60,7 @@
 struct vhost_user_socket {
char *path;
int listenfd;
+   int connfd;
bool is_server;
bool reconnect;
 };
@@ -277,11 +278,13 @@ vhost_user_add_connection(int fd, struct 
vhost_user_socket *vsocket)

RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);

+   vsocket->connfd = fd;
conn->vsocket = vsocket;
conn->vid = vid;
ret = fdset_add(_user.fdset, fd, vhost_user_msg_handler,
NULL, conn);
if (ret < 0) {
+   vsocket->connfd = -1;
free(conn);
close(fd);
RTE_LOG(ERR, VHOST_CONFIG,
@@ -329,6 +332,7 @@ vhost_user_msg_handler(int connfd, void *dat, int *remove)
RTE_LOG(ERR, VHOST_CONFIG,
"vhost read incorrect message\n");

+   vsocket->connfd = -1;
close(connfd);
*remove = 1;
free(conn);
@@ -635,6 +639,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
goto out;
memset(vsocket, 0, sizeof(struct vhost_user_socket));
vsocket->path = strdup(path);
+   vsocket->connfd = -1;

if ((flags

[dpdk-dev] [PATCH] doc: announce KNI ethtool removal

2016-07-21 Thread Jay Rolette

On Thu, Jul 21, 2016 at 3:32 PM, Thomas Monjalon 
wrote:

> 2016-07-21 13:20, Jay Rolette:
> > On Thu, Jul 21, 2016 at 10:33 AM, Ferruh Yigit 
> > wrote:
> > > KNI ethtool is functional and maintained, and it may have users!
> > >
> > > Why just removing it, specially without providing an alternative?
> > > Is is good time to discuss KCP again?
> >
> > Yes, my product uses it.
>
> Your product uses what? KCP? KNI? KNI ethtool?
>

Sorry, that wasn't very clear. It uses KNI + ifconfig to configure the
device/interface in Linux. I'm assuming the "ethtool" bits under discussion
are the same things that make ifconfig work with KNI to the limited extent
it does.

> Seems like we are back to the same discussion we
> > had a few months ago about the KNI situation...
> >
> > It shouldn't be removed unless there is a replacement, ideally one that
> > works with the normal Linux tools like every other network device.
>
> This ethtool module works only for igb and ixgbe!
> There is already no replacement for other drivers.
> Who works on a replacement?
>

Ferruh submitted KCP previously, but you guys didn't like the fact that it
was a kernel module. IIRC, one of the gains from that was simplified
maintenance because you didn't need driver specific support for KNI.
Assuming he's still willing to beat it into shape, we have something that
is already most of the way there.

If people are going to continue to block it because it is a kernel module,
then IMO, it's better to leave the existing support on igx / ixgbe in place
instead of stepping backwards to zero support for ethtool.

> While the code wasn't ready at the time, it was a definite improvement
> over what
> > we have with KNI today.
>

[dpdk-dev] [PATCH] memzone: allow full length name

2016-07-21 Thread Olivier Matz

Hi,

On 07/20/2016 07:16 PM, Zoltan Kiss wrote:
> (strlen(name) == sizeof(mz->name) - 1) is a valid case, change the
> condition to reflect that.
> Move it earlier to avoid lookup with invalid name.
> Change errno to ENAMETOOLONG.
> 
> Fixes: 85cf0079 ("mem: avoid memzone/mempool/ring name truncation")
> 
> Signed-off-by: Zoltan Kiss 
> ---
>  lib/librte_eal/common/eal_common_memzone.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/librte_eal/common/eal_common_memzone.c 
> b/lib/librte_eal/common/eal_common_memzone.c
> index 5d28341..1bd0a33 100644
> --- a/lib/librte_eal/common/eal_common_memzone.c
> +++ b/lib/librte_eal/common/eal_common_memzone.c
> @@ -144,6 +144,13 @@ memzone_reserve_aligned_thread_unsafe(const char *name, 
> size_t len,
>   return NULL;
>   }
>  
> + if (strlen(name) > sizeof(mz->name) - 1) {
> + RTE_LOG(DEBUG, EAL, "%s(): memzone <%s>: name too long\n",
> + __func__, name);
> + rte_errno = ENAMETOOLONG;
> + return NULL;
> + }
> +
>   /* zone already exist */
>   if ((memzone_lookup_thread_unsafe(name)) != NULL) {
>   RTE_LOG(DEBUG, EAL, "%s(): memzone <%s> already exists\n",
> @@ -152,13 +159,6 @@ memzone_reserve_aligned_thread_unsafe(const char *name, 
> size_t len,
>   return NULL;
>   }
>  
> - if (strlen(name) >= sizeof(mz->name) - 1) {
> - RTE_LOG(DEBUG, EAL, "%s(): memzone <%s>: name too long\n",
> - __func__, name);
> - rte_errno = EEXIST;
> - return NULL;
> - }
> -
>   /* if alignment is not a power of two */
>   if (align && !rte_is_power_of_2(align)) {
>   RTE_LOG(ERR, EAL, "%s(): Invalid alignment: %u\n", __func__,
> 

Acked-by: Olivier Matz 

Thanks for fixing this.

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets

On 21.07.2016 15:35, Yuanhan Liu wrote:
> On Thu, Jul 21, 2016 at 03:13:14PM +0300, Ilya Maximets wrote:

 What do you think of it?
>>>
>>> I found that we can't check connection status without select/poll
>>> on it. 'getsockopt()' will return 0 with no errors if connection
>>> is not still established just like if it was.
>>> So, I think, the first version of this patch is the only
>>> acceptable solution.
>>
>> Sorry, v2 is acceptable too, because it always calls 'connect()'.
> 
> So you have done the test that it works?

No, it's just theory. I don't know how to test this.

> I'm more curious to know
> could your above case hit the getsockopt() code path, I mean, the
> path that errno is set to EINPROGRESS or EISCONN?

As I already told, I don't sure that we're able to get EINPROGRESS
on our AF_UNIX sockets.
In v2 'getsockopt()' check is unnecessary.

[dpdk-dev] [PATCH] app/test: fix refcnt_mbuf_ring size

2016-07-21 Thread Olivier Matz

Hi Jerin,

On 07/18/2016 07:55 AM, Jerin Jacob wrote:
> rte_ring_create expects the size of the ring to
> be a power of 2. REFCNT_RING_SIZE value is not
> power of 2 in-case if RTE_MAX_LCORE == 96.
> Fix it by aligning the size to next power of 2 value.
> 
> Fixes: af75078f ("first public release")
> 
> Signed-off-by: Jerin Jacob 
> ---
>  app/test/test_mbuf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
> index 684ad80..c0823ea 100644
> --- a/app/test/test_mbuf.c
> +++ b/app/test/test_mbuf.c
> @@ -809,7 +809,7 @@ test_refcnt_mbuf(void)
>  
>   if (refcnt_mbuf_ring == NULL &&
>   (refcnt_mbuf_ring = rte_ring_create("refcnt_mbuf_ring",
> - REFCNT_RING_SIZE, SOCKET_ID_ANY,
> + rte_align32pow2(REFCNT_RING_SIZE), SOCKET_ID_ANY,
>   RING_F_SP_ENQ)) == NULL) {
>   printf("%s: cannot allocate " MAKE_STRING(refcnt_mbuf_ring)
>   "\n", __func__);
> 

Acked-by: Olivier Matz

[dpdk-dev] [PATCH v2] mempool: adjust name string size in related data types

2016-07-21 Thread Olivier Matz

Hi Zoltan,


On 07/20/2016 07:16 PM, Zoltan Kiss wrote:
> A recent patch brought up an issue about the size of the 'name' fields:
> 
> 85cf0079 mem: avoid memzone/mempool/ring name truncation
> 
> These relations should be observed:
> 
> 1. Each ring creates a memzone with a prefixed name:
> RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
> 
> 2. There are some mempool handlers which create a ring with a prefixed
> name:
> RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE - strlen(RTE_MEMPOOL_MZ_PREFIX)
> 
> 3. A mempool can create up to RTE_MAX_MEMZONE pre and postfixed memzones:
> sprintf(postfix, "_%d", RTE_MAX_MEMZONE)
> RTE_MEMPOOL_NAMESIZE <= RTE_MEMZONE_NAMESIZE -
>   strlen(RTE_MEMPOOL_MZ_PREFIX) - strlen(postfix)
> 
> Setting all of them to 32 hides this restriction from the application.
> This patch decreases the mempool and ring string size to accommodate for
> these prefixes, but it doesn't apply the 3rd constraint. Applications
> relying on these constants need to be recompiled, otherwise they'll run
> into ENAMETOOLONG issues.
> The size of the arrays are kept 32 for ABI compatibility, it can be
> decreased next time the ABI changes.
> 
> Signed-off-by: Zoltan Kiss 

Looks like to be a good compromise for the 16.07 release. One question
however: why not taking in account the 3rd constraint? Because it may
not completly fix the issue if the mempool is fragmented.

We could define RTE_MEMPOOL_NAMESIZE to 24
 = 32 - len('mp_') - len('_0123'))

Thanks,
Olivier

[dpdk-dev] [PATCH] eal: fix check number of bytes from read function

2016-07-21 Thread Sergio Gonzalez Monroy

On 20/07/2016 15:24, Michal Jastrzebski wrote:
> In rte_mem_virt2phy: Value returned from a function and indicating the
> number of bytes was ignored. This could cause a wrong pfn (page frame
> number) mask read from pagemap file.
> When read returns less than the number of sizeof(uint64_t) bytes,
> function rte_mem_virt2phy returns error.
>
> Coverity issue: 13212
> Fixes: 40b966a211ab ("ivshmem: library changes for mmaping using
> ivshmem").
>
> Signed-off-by: Michal Jastrzebski 
> ---
>   lib/librte_eal/linuxapp/eal/eal_memory.c |   12 ++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c 
> b/lib/librte_eal/linuxapp/eal/eal_memory.c
> index 42a29fa..05769fb 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> @@ -158,7 +158,7 @@ rte_mem_lock_page(const void *virt)
>   phys_addr_t
>   rte_mem_virt2phy(const void *virtaddr)
>   {
> - int fd;
> + int fd, retval;
>   uint64_t page, physaddr;
>   unsigned long virt_pfn;
>   int page_size;
> @@ -209,11 +209,19 @@ rte_mem_virt2phy(const void *virtaddr)
>   close(fd);
>   return RTE_BAD_PHYS_ADDR;
>   }
> - if (read(fd, , sizeof(uint64_t)) < 0) {
> +
> + retval = read(fd, , sizeof(uint64_t));
> + if (retval < 0) {
>   RTE_LOG(ERR, EAL, "%s(): cannot read /proc/self/pagemap: %s\n",
>   __func__, strerror(errno));
>   close(fd);
>   return RTE_BAD_PHYS_ADDR;
> + }   else if (retval >= 0 && retval < (int)sizeof(uint64_t)) {

Just a couple of nits, retval >= 0 it's already implicit, no need to do 
that check.

> + RTE_LOG(ERR, EAL, "%s(): read %d bytes from /proc/self/pagemap "
> + "but expected %d: %s\n",
> + __func__, retval, (int)sizeof(uint64_t), 
> strerror(errno));
> + close(fd);

Another nit, we could just close(fd) right after read, regardless of 
read being success or error as
we close(fd) also on success just before exiting the function.

Other than that:

Acked-by: Sergio Gonzalez Monroy 

> + return RTE_BAD_PHYS_ADDR;
>   }
>   
>   /*

[dpdk-dev] [PATCH] test_mempool: remove unused mp_ext var

2016-07-21 Thread Thomas Monjalon

2016-07-21 15:28, Olivier Matz:
> Hi Santosh,
> 
> On 07/21/2016 01:49 PM, Santosh Shukla wrote:
> > test_mempool func not using pointer variable 'mp_ext' and incorrectly 
> > freed. So
> > removing ptr var. Now freeing mp_stack var.
> > 
> > Signed-off-by: Santosh Shukla 
> > ---
> >  app/test/test_mempool.c |3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
> > index 46ad670..3b21cf7 100644
> > --- a/app/test/test_mempool.c
> > +++ b/app/test/test_mempool.c
> > @@ -506,7 +506,6 @@ test_mempool(void)
> >  {
> > struct rte_mempool *mp_cache = NULL;
> > struct rte_mempool *mp_nocache = NULL;
> > -   struct rte_mempool *mp_ext = NULL;
> > struct rte_mempool *mp_stack = NULL;
> >  
> > rte_atomic32_init();
> > @@ -605,7 +604,7 @@ test_mempool(void)
> >  err:
> > rte_mempool_free(mp_nocache);
> > rte_mempool_free(mp_cache);
> > -   rte_mempool_free(mp_ext);
> > +   rte_mempool_free(mp_stack);
> > return -1;
> >  }
> >  
> > 
> 
> Strange, it seems these modifications were present in latest patch from
> David Hunt (v6). Maybe a bad manipulation during the push?

Yes sorry, it seems to be a wrong manipulation, the v5 was pushed.
This patch seems to fix my error, thanks.

> The "Fixes:" line should be added though.
> 
> Acked-by: Olivier Matz

[dpdk-dev] [PATCH] test_mempool: remove unused mp_ext var

2016-07-21 Thread Olivier Matz

Hi Santosh,

On 07/21/2016 01:49 PM, Santosh Shukla wrote:
> test_mempool func not using pointer variable 'mp_ext' and incorrectly freed. 
> So
> removing ptr var. Now freeing mp_stack var.
> 
> Signed-off-by: Santosh Shukla 
> ---
>  app/test/test_mempool.c |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
> index 46ad670..3b21cf7 100644
> --- a/app/test/test_mempool.c
> +++ b/app/test/test_mempool.c
> @@ -506,7 +506,6 @@ test_mempool(void)
>  {
>   struct rte_mempool *mp_cache = NULL;
>   struct rte_mempool *mp_nocache = NULL;
> - struct rte_mempool *mp_ext = NULL;
>   struct rte_mempool *mp_stack = NULL;
>  
>   rte_atomic32_init();
> @@ -605,7 +604,7 @@ test_mempool(void)
>  err:
>   rte_mempool_free(mp_nocache);
>   rte_mempool_free(mp_cache);
> - rte_mempool_free(mp_ext);
> + rte_mempool_free(mp_stack);
>   return -1;
>  }
>  
> 

Strange, it seems these modifications were present in latest patch from
David Hunt (v6). Maybe a bad manipulation during the push?

The "Fixes:" line should be added though.

Acked-by: Olivier Matz

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-21 Thread Wiles, Keith


> On Jul 21, 2016, at 10:06 AM, Neil Horman  wrote:
> 
> On Thu, Jul 21, 2016 at 02:09:19PM +, Wiles, Keith wrote:
>> 
>>> On Jul 21, 2016, at 8:54 AM, Neil Horman  wrote:
>>> 
>>> On Wed, Jul 20, 2016 at 10:32:28PM +, Wiles, Keith wrote:
 
> On Jul 20, 2016, at 3:16 PM, Neil Horman  wrote:
> 
> On Wed, Jul 20, 2016 at 07:47:32PM +, Wiles, Keith wrote:
>> 
>>> On Jul 20, 2016, at 12:48 PM, Neil Horman  wrote:
>>> 
>>> On Wed, Jul 20, 2016 at 07:40:49PM +0200, Thomas Monjalon wrote:
 2016-07-20 13:09, Neil Horman:
> From: Neil Horman 
> 
> John Mcnamara and I were discussing enhacing the validate_abi script 
> to build
> the dpdk tree faster with multiple jobs.  Theres no reason not to do 
> it, so this
> implements that requirement.  It uses a MAKE_JOBS variable that can 
> be set by
> the user to limit the job count.  By default the job count is set to 
> the number
> of online cpus.
 
 Please could you use the variable name DPDK_MAKE_JOBS?
 This name is already used in scripts/test-build.sh.
 
>>> Sure
>>> 
> +if [ -z "$MAKE_JOBS" ]
> +then
> + # This counts the number of cpus on the system
> + MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
> +fi
 
 Is lscpu common enough?
 
>>> I'm not sure how to answer that.  lscpu is part of the util-linux 
>>> package, which
>>> is part of any base install.  Theres a variant for BSD, but I'm not 
>>> sure how
>>> common it is there.
>>> Neil
>>> 
 Another acceptable default would be just "-j" without any number.
 It would make the number of jobs unlimited.
>> 
>> I think the best is just use -j as it tries to use the correct number of 
>> jobs based on the number of cores, right?
>> 
> -j with no argument (or -j 0), is sort of, maybe what you want.  With 
> either of
> those options, make will just issue jobs as fast as it processes 
> dependencies.
> Dependent on how parallel the build is, that can lead to tons of waiting 
> process
> (i.e. more than your number of online cpus), which can actually hurt your 
> build
> time.
 
 I read the manual and looked at the code, which supports your statement. 
 (I think I had some statement on stack overflow and the last time I 
 believe anything on the internet :-) I have not seen a lot of differences 
 in compile times with -j on my system. Mostly I suspect it is the number 
 of paths in the dependency, cores and memory on the system.
 
 I have 72 lcores or 2 sockets, 18 cores per socket. Xeon 2.3Ghz cores.
 
 $ export RTE_TARGET=x86_64-native-linuxapp-gcc 
 
 $ time make install T=${RTE_TARGET}
 real   0m59.445s user  0m27.344s sys   0m7.040s
 
 $ time make install T=${RTE_TARGET} -j
 real   0m26.584s user  0m14.380s sys   0m5.120s
 
 # Remove the x86_64-native-linuxapp-gcc
 
 $ time make install T=${RTE_TARGET} -j 72
 real   0m23.454s user  0m10.832s sys   0m4.664s
 
 $ time make install T=${RTE_TARGET} -j 8
 real   0m23.812s user  0m10.672s sys   0m4.276s
 
 cd x86_64-native-linuxapp-gcc
 $ make clean
 $ time make
 real   0m28.539s user  0m9.820s sys0m3.620s
 
 # Do a make clean between each build.
 
 $ time make -j
 real   0m7.217s user   0m6.532s sys0m2.332s
 
 $ time make -j 8
 real   0m8.256s user   0m6.472s sys0m2.456s
 
 $ time make -j 72
 real   0m6.866s user   0m6.184s sys0m2.216s
 
 Just the real time numbers in the following table.
 
 processes real Time   depdirs
no -j 59.4sYes
  -j 8 23.8sYes
 -j 7223.5sYes
   -j   26.5sYes
 
no -j 28.5s No
  -j 8   8.2s No
 -j 72  6.8s No
   -j 7.2s No
 
 Looks like the depdirs build time on my system:
 $ make clean -j
 $ rm .depdirs
 $ time make -j
 real   0m23.734s user  0m11.228s sys   0m4.844s
 
 About 16 seconds, which is not a lot of savings. Now the difference from 
 no -j to -j is a lot, but the difference between -j and -j  is 
 not a huge saving. This leads me back to over engineering the problem when 
 ?-j? would work just as well here.
 
 Even on my MacBook Pro i7 system the difference is not that much 1m8s 
 without depdirs build for -j in a VirtualBox with all 4 cores 8G RAM. 
 Compared to 1m13s with -j 4 option.
 
 I just wonder if it makes a lot of sense to use cpuinfo in this given case

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets



On 21.07.2016 15:10, Ilya Maximets wrote:
> On 21.07.2016 14:40, Yuanhan Liu wrote:
>> On Thu, Jul 21, 2016 at 02:14:59PM +0300, Ilya Maximets wrote:
 Hmm, how about this fixup:
 --
 diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
 b/lib/librte_vhost/vhost_user/vhost-net-user.c
 index 8626d13..b0f45e6 100644
 --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
 +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
 @@ -537,18 +537,7 @@ vhost_user_connect_nonblock(int fd, struct sockaddr 
 *un, size_t sz)
errno = EINVAL;
  
ret = connect(fd, un, sz);
 -  if (ret == -1 && errno != EINPROGRESS)
 -  return -1;
 -  if (ret == 0)
 -  goto connected;
 -
 -  FD_ZERO();
 -  FD_SET(fd, );
 -
 -  ret = select(fd + 1, NULL, , NULL, );
 -  if (!ret)
 -  errno = ETIMEDOUT;
 -  if (ret != 1)
 +  if (ret < 0 && errno != EISCONN)
return -1;
  
ret = getsockopt(fd, SOL_SOCKET, SO_ERROR, _error, );
 @@ -558,7 +547,6 @@ vhost_user_connect_nonblock(int fd, struct sockaddr 
 *un, size_t sz)
return -1;
}
  
 -connected:
flags = fcntl(fd, F_GETFL, 0);
if (flags < 0) {
RTE_LOG(ERR, VHOST_CONFIG,
 --
 ?

 We will not check the EINPROGRESS, but subsequent 'connect()' will return
 EISCONN if connection already established. getsockopt() is kept just in
 case. Subsequent 'connect()' will happen on the next iteration of
 reconnection cycle (1 second sleep).
>>>
>>> I've sent v2 with this changes.
>>
>> Thanks. But still, it doesn't look clean to me. I was thinking following
>> might be cleaner?
>>
>> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c
>> b/lib/librte_vhost/vhost_user/vhost-net-user.
>> index f0f92f8..c0ef290 100644
>> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
>> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
>> @@ -532,6 +532,10 @@ vhost_user_client_reconnect(void *arg __rte_unused)
>>  reconn != NULL; reconn = next) {
>> next = TAILQ_NEXT(reconn, next);
>> 
>> +   if (reconn->conn_inprogress) {
>> +   /* do connect check here */
>> +   }
>> +
>> if (connect(reconn->fd, (struct sockaddr 
>> *)>un,
>> sizeof(reconn->un)) < 0)
>> continue;
>> @@ -605,6 +609,7 @@ vhost_user_create_client(struct vhost_user_socket 
>> *vsocket)
>> reconn->un = un;
>> reconn->fd = fd;
>> reconn->vsocket = vsocket;
>> +   reconn->conn_inprogress = errno == EINPROGRESS;
>> pthread_mutex_lock(_list.mutex);
>> TAILQ_INSERT_TAIL(_list.head, reconn, next);
>> pthread_mutex_unlock(_list.mutex);
>>
>> It's just a rough diff, hopefully it shows my idea clearly. And of
>> course, we should not call connect() anymore when conn_inprogress
>> is set.
>>
>> What do you think of it?
> 
> I found that we can't check connection status without select/poll
> on it. 'getsockopt()' will return 0 with no errors if connection
> is not still established just like if it was.
> So, I think, the first version of this patch is the only
> acceptable solution.

Sorry, v2 is acceptable too, because it always calls 'connect()'.

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets

On 21.07.2016 14:40, Yuanhan Liu wrote:
> On Thu, Jul 21, 2016 at 02:14:59PM +0300, Ilya Maximets wrote:
>>> Hmm, how about this fixup:
>>> --
>>> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
>>> b/lib/librte_vhost/vhost_user/vhost-net-user.c
>>> index 8626d13..b0f45e6 100644
>>> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
>>> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
>>> @@ -537,18 +537,7 @@ vhost_user_connect_nonblock(int fd, struct sockaddr 
>>> *un, size_t sz)
>>> errno = EINVAL;
>>>  
>>> ret = connect(fd, un, sz);
>>> -   if (ret == -1 && errno != EINPROGRESS)
>>> -   return -1;
>>> -   if (ret == 0)
>>> -   goto connected;
>>> -
>>> -   FD_ZERO();
>>> -   FD_SET(fd, );
>>> -
>>> -   ret = select(fd + 1, NULL, , NULL, );
>>> -   if (!ret)
>>> -   errno = ETIMEDOUT;
>>> -   if (ret != 1)
>>> +   if (ret < 0 && errno != EISCONN)
>>> return -1;
>>>  
>>> ret = getsockopt(fd, SOL_SOCKET, SO_ERROR, _error, );
>>> @@ -558,7 +547,6 @@ vhost_user_connect_nonblock(int fd, struct sockaddr 
>>> *un, size_t sz)
>>> return -1;
>>> }
>>>  
>>> -connected:
>>> flags = fcntl(fd, F_GETFL, 0);
>>> if (flags < 0) {
>>> RTE_LOG(ERR, VHOST_CONFIG,
>>> --
>>> ?
>>>
>>> We will not check the EINPROGRESS, but subsequent 'connect()' will return
>>> EISCONN if connection already established. getsockopt() is kept just in
>>> case. Subsequent 'connect()' will happen on the next iteration of
>>> reconnection cycle (1 second sleep).
>>
>> I've sent v2 with this changes.
> 
> Thanks. But still, it doesn't look clean to me. I was thinking following
> might be cleaner?
> 
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c
> b/lib/librte_vhost/vhost_user/vhost-net-user.
> index f0f92f8..c0ef290 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> @@ -532,6 +532,10 @@ vhost_user_client_reconnect(void *arg __rte_unused)
>  reconn != NULL; reconn = next) {
> next = TAILQ_NEXT(reconn, next);
> 
> +   if (reconn->conn_inprogress) {
> +   /* do connect check here */
> +   }
> +
> if (connect(reconn->fd, (struct sockaddr 
> *)>un,
> sizeof(reconn->un)) < 0)
> continue;
> @@ -605,6 +609,7 @@ vhost_user_create_client(struct vhost_user_socket 
> *vsocket)
> reconn->un = un;
> reconn->fd = fd;
> reconn->vsocket = vsocket;
> +   reconn->conn_inprogress = errno == EINPROGRESS;
> pthread_mutex_lock(_list.mutex);
> TAILQ_INSERT_TAIL(_list.head, reconn, next);
> pthread_mutex_unlock(_list.mutex);
> 
> It's just a rough diff, hopefully it shows my idea clearly. And of
> course, we should not call connect() anymore when conn_inprogress
> is set.
> 
> What do you think of it?

I found that we can't check connection status without select/poll
on it. 'getsockopt()' will return 0 with no errors if connection
is not still established just like if it was.
So, I think, the first version of this patch is the only
acceptable solution.

Best regards, Ilya Maximets.

[dpdk-dev] [PATCH v5] eal: out-of-bounds write

2016-07-21 Thread Thomas Monjalon

2016-07-21 12:01, Mrozowicz, SlawomirX:
> Hi Thomas,
> 
> As I understand Sergio suggested to come back to the solution similar to v1.
> Could you comment or better take decision which solution should be applied, 
> please.
> 
> Best Regards,
> S?awomir 
> 
> 
> >-Original Message-
> >From: Gonzalez Monroy, Sergio
> >On 20/06/2016 11:09, Thomas Monjalon wrote:
> >> 2016-06-20 10:38, Sergio Gonzalez Monroy:
> >>> On 20/06/2016 10:14, Thomas Monjalon wrote:
> > +   RTE_LOG(ERR, EAL,
> > +   "All memory segments exhausted by IVSHMEM. "
>  There is no evidence that it is related to IVSHMEM.
>  "Not enough memory segments." would be more appropriate.
> >>> Actually we would hit this issue when all memsegs have been used by
> >IVSHMEM.
> >>> So I think the message is accurate.
> >> I think it's saner to avoid mixing "potential root cause of a use
> >> case" and "accurate description of the error".
> >> One day, the root cause could be different and the message will become
> >wrong.
> >> Here there is not enough memory segment.
> >>
> >
> >Right.
> >So the whole point of doing the check before the loop was to display the 
> >error
> >message with its specific cause.
> >
> >I would think that if the code changes and the message is not accurate then 
> >it
> >should also be updated.
> >
> >So if folks prefer a more generic error message probably we don't need the
> >check before the loop and just change the check condition inside the loop 
> >that
> >would end up printing the generic error message (after the loop).
> >
> >Basically v1 would do that.
> >http://dpdk.org/dev/patchwork/patch/12241/

At this point of 16.07 we can apply the v1 if you agree.
The message about IVSHMEM will be totally wrong when the ivshmem specific
code will be removed.
If we need more error messages, feel free to send another patch.

[dpdk-dev] [PATCH v2] mempool: adjust name string size in related data types

2016-07-21 Thread Zoltan Kiss



On 21/07/16 14:40, Olivier Matz wrote:
> Hi Zoltan,
>
>
> On 07/20/2016 07:16 PM, Zoltan Kiss wrote:
>> A recent patch brought up an issue about the size of the 'name' fields:
>>
>> 85cf0079 mem: avoid memzone/mempool/ring name truncation
>>
>> These relations should be observed:
>>
>> 1. Each ring creates a memzone with a prefixed name:
>> RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
>>
>> 2. There are some mempool handlers which create a ring with a prefixed
>> name:
>> RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE - strlen(RTE_MEMPOOL_MZ_PREFIX)
>>
>> 3. A mempool can create up to RTE_MAX_MEMZONE pre and postfixed memzones:
>> sprintf(postfix, "_%d", RTE_MAX_MEMZONE)
>> RTE_MEMPOOL_NAMESIZE <= RTE_MEMZONE_NAMESIZE -
>>  strlen(RTE_MEMPOOL_MZ_PREFIX) - strlen(postfix)
>>
>> Setting all of them to 32 hides this restriction from the application.
>> This patch decreases the mempool and ring string size to accommodate for
>> these prefixes, but it doesn't apply the 3rd constraint. Applications
>> relying on these constants need to be recompiled, otherwise they'll run
>> into ENAMETOOLONG issues.
>> The size of the arrays are kept 32 for ABI compatibility, it can be
>> decreased next time the ABI changes.
>>
>> Signed-off-by: Zoltan Kiss 
>
> Looks like to be a good compromise for the 16.07 release. One question
> however: why not taking in account the 3rd constraint? Because it may
> not completly fix the issue if the mempool is fragmented.
>
> We could define RTE_MEMPOOL_NAMESIZE to 24
>  = 32 - len('mp_') - len('_0123'))

I was trying to figure out a compile time macro for strlen(postfix), but 
I could not. Your suggestion would work only until someone increases 
RTE_MAX_MEMZONE above . As the likelihood of fragmenting a pool over 
99 memzones seems small, I did not bother to fix this with an ugly hack, 
but if you think we should include it, let me know!

>
> Thanks,
> Olivier
>

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-21 Thread Neil Horman

On Thu, Jul 21, 2016 at 03:22:45PM +, Wiles, Keith wrote:
> 
> > On Jul 21, 2016, at 10:06 AM, Neil Horman  wrote:
> > 
> > On Thu, Jul 21, 2016 at 02:09:19PM +, Wiles, Keith wrote:
> >> 
> >>> On Jul 21, 2016, at 8:54 AM, Neil Horman  wrote:
> >>> 
> >>> On Wed, Jul 20, 2016 at 10:32:28PM +, Wiles, Keith wrote:
>  
> > On Jul 20, 2016, at 3:16 PM, Neil Horman  
> > wrote:
> > 
> > On Wed, Jul 20, 2016 at 07:47:32PM +, Wiles, Keith wrote:
> >> 
> >>> On Jul 20, 2016, at 12:48 PM, Neil Horman  
> >>> wrote:
> >>> 
> >>> On Wed, Jul 20, 2016 at 07:40:49PM +0200, Thomas Monjalon wrote:
>  2016-07-20 13:09, Neil Horman:
> > From: Neil Horman 
> > 
> > John Mcnamara and I were discussing enhacing the validate_abi 
> > script to build
> > the dpdk tree faster with multiple jobs.  Theres no reason not to 
> > do it, so this
> > implements that requirement.  It uses a MAKE_JOBS variable that can 
> > be set by
> > the user to limit the job count.  By default the job count is set 
> > to the number
> > of online cpus.
>  
>  Please could you use the variable name DPDK_MAKE_JOBS?
>  This name is already used in scripts/test-build.sh.
>  
> >>> Sure
> >>> 
> > +if [ -z "$MAKE_JOBS" ]
> > +then
> > +   # This counts the number of cpus on the system
> > +   MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
> > +fi
>  
>  Is lscpu common enough?
>  
> >>> I'm not sure how to answer that.  lscpu is part of the util-linux 
> >>> package, which
> >>> is part of any base install.  Theres a variant for BSD, but I'm not 
> >>> sure how
> >>> common it is there.
> >>> Neil
> >>> 
>  Another acceptable default would be just "-j" without any number.
>  It would make the number of jobs unlimited.
> >> 
> >> I think the best is just use -j as it tries to use the correct number 
> >> of jobs based on the number of cores, right?
> >> 
> > -j with no argument (or -j 0), is sort of, maybe what you want.  With 
> > either of
> > those options, make will just issue jobs as fast as it processes 
> > dependencies.
> > Dependent on how parallel the build is, that can lead to tons of 
> > waiting process
> > (i.e. more than your number of online cpus), which can actually hurt 
> > your build
> > time.
>  
>  I read the manual and looked at the code, which supports your statement. 
>  (I think I had some statement on stack overflow and the last time I 
>  believe anything on the internet :-) I have not seen a lot of 
>  differences in compile times with -j on my system. Mostly I suspect it 
>  is the number of paths in the dependency, cores and memory on the system.
>  
>  I have 72 lcores or 2 sockets, 18 cores per socket. Xeon 2.3Ghz cores.
>  
>  $ export RTE_TARGET=x86_64-native-linuxapp-gcc 
>  
>  $ time make install T=${RTE_TARGET}
>  real 0m59.445s user  0m27.344s sys   0m7.040s
>  
>  $ time make install T=${RTE_TARGET} -j
>  real 0m26.584s user  0m14.380s sys   0m5.120s
>  
>  # Remove the x86_64-native-linuxapp-gcc
>  
>  $ time make install T=${RTE_TARGET} -j 72
>  real 0m23.454s user  0m10.832s sys   0m4.664s
>  
>  $ time make install T=${RTE_TARGET} -j 8
>  real 0m23.812s user  0m10.672s sys   0m4.276s
>  
>  cd x86_64-native-linuxapp-gcc
>  $ make clean
>  $ time make
>  real 0m28.539s user  0m9.820s sys0m3.620s
>  
>  # Do a make clean between each build.
>  
>  $ time make -j
>  real 0m7.217s user   0m6.532s sys0m2.332s
>  
>  $ time make -j 8
>  real 0m8.256s user   0m6.472s sys0m2.456s
>  
>  $ time make -j 72
>  real 0m6.866s user   0m6.184s sys0m2.216s
>  
>  Just the real time numbers in the following table.
>  
>  processes real Time   depdirs
> no -j 59.4sYes
>   -j 8 23.8sYes
>  -j 7223.5sYes
>    -j   26.5sYes
>  
> no -j 28.5s No
>   -j 8   8.2s No
>  -j 72  6.8s No
>    -j 7.2s No
>  
>  Looks like the depdirs build time on my system:
>  $ make clean -j
>  $ rm .depdirs
>  $ time make -j
>  real 0m23.734s user  0m11.228s sys   0m4.844s
>  
>  About 16 seconds, which is not a lot of savings. Now the difference from 
>  no -j to -j is a lot, but the difference between -j and -j  
>  is not a huge saving. This leads me back to over engineering the problem 
>  when ?-j? would

[dpdk-dev] [PATCH 05/12] mbuf: add new Rx checksum mbuf flags

2016-07-21 Thread Stephen Hemminger

On Thu, 21 Jul 2016 10:08:23 +0200
Olivier Matz  wrote:

> +/**
> + * Deprecated.
> + * Checking this flag alone is deprecated: check the 2 bits of
> + * PKT_RX_L4_CKSUM_MASK.
> + * This flag was set when the L4 checksum of a packet was detected as
> + * wrong by the hardware.
> + */
> +#define PKT_RX_L4_CKSUM_BAD  (1ULL << 3)
> +
> +/**
> + * Deprecated.
> + * Checking this flag alone is deprecated: check the 2 bits of
> + * PKT_RX_IP_CKSUM_MASK.
> + * This flag was set when the IP checksum of a packet was detected as
> + * wrong by the hardware.
> + */
> +#define PKT_RX_IP_CKSUM_BAD  (1ULL << 4)

I think you should use the GCC deprecated attribute, not sure how though

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets

On 21.07.2016 13:37, Ilya Maximets wrote:
> 
> 
> On 21.07.2016 13:13, Yuanhan Liu wrote:
>> On Thu, Jul 21, 2016 at 12:45:32PM +0300, Ilya Maximets wrote:
>>> On 21.07.2016 12:37, Yuanhan Liu wrote:
 On Thu, Jul 21, 2016 at 11:21:15AM +0300, Ilya Maximets wrote:
> If something abnormal happened to QEMU, 'connect()' can block calling
> thread (e.g. main thread of OVS) forever or for a really long time.
> This can break whole application or block the reconnection thread.
>
> Example with OVS:
>
>   ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
>   (gdb) bt
>   #0  connect () from /lib64/libpthread.so.0
>   #1  vhost_user_create_client (vsocket=0xa816e0)
>   #2  rte_vhost_driver_register
>   #3  netdev_dpdk_vhost_user_construct
>   #4  netdev_open (name=0xa664b0 "vhost1")
>   [...]
>   #11 main
>
> Fix that by setting non-blocking mode for client sockets for connection.
>
> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")

 Thanks for spotting and fixing yet another bug!

>  
> +static int
> +vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)

 I don't quite understand why this is needed: connect() with O_NONBLOCK
 flag set is not enough?
>>>
>>> There is a little issue with non-blocking connect() call. Connection
>>> establishing may be started but '-1' returned with 'errno = EINPROGRESS'.
>>> In this case we must wait on fd until it will be available for writing.
>>> After that we need to check current status of connection using getsockopt().
>>>
>>> I don't sure that we're able to get such situation, but it's documented,
>>> and, I think, we should handle it.
>>>
>>> See 'man connect' for details.
>>
>> I see. Thanks.
>>
>> But basically, I don't like the way of introduing yet another
>> fdset here. I'm wondering we could leverage current fdset code
>> to achieve that. This might need some work though.
>>
>> So how about making it simple and stupid at this stage: sleep a
>> while (maybe 1ms, or maybe 1s) when that happens, and give up
>> when the connection is still not established?
> 
> Hmm, how about this fixup:
> --
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
> b/lib/librte_vhost/vhost_user/vhost-net-user.c
> index 8626d13..b0f45e6 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> @@ -537,18 +537,7 @@ vhost_user_connect_nonblock(int fd, struct sockaddr *un, 
> size_t sz)
>   errno = EINVAL;
>  
>   ret = connect(fd, un, sz);
> - if (ret == -1 && errno != EINPROGRESS)
> - return -1;
> - if (ret == 0)
> - goto connected;
> -
> - FD_ZERO();
> - FD_SET(fd, );
> -
> - ret = select(fd + 1, NULL, , NULL, );
> - if (!ret)
> - errno = ETIMEDOUT;
> - if (ret != 1)
> + if (ret < 0 && errno != EISCONN)
>   return -1;
>  
>   ret = getsockopt(fd, SOL_SOCKET, SO_ERROR, _error, );
> @@ -558,7 +547,6 @@ vhost_user_connect_nonblock(int fd, struct sockaddr *un, 
> size_t sz)
>   return -1;
>   }
>  
> -connected:
>   flags = fcntl(fd, F_GETFL, 0);
>   if (flags < 0) {
>   RTE_LOG(ERR, VHOST_CONFIG,
> --
> ?
> 
> We will not check the EINPROGRESS, but subsequent 'connect()' will return
> EISCONN if connection already established. getsockopt() is kept just in
> case. Subsequent 'connect()' will happen on the next iteration of
> reconnection cycle (1 second sleep).

I've sent v2 with this changes.

Best regards, Ilya Maximets.

[dpdk-dev] [PATCH v2] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets

If something abnormal happened to QEMU, 'connect()' can block calling
thread (e.g. main thread of OVS) forever or for a really long time.
This can break whole application or block the reconnection thread.

Example with OVS:

ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
(gdb) bt
#0  connect () from /lib64/libpthread.so.0
#1  vhost_user_create_client (vsocket=0xa816e0)
#2  rte_vhost_driver_register
#3  netdev_dpdk_vhost_user_construct
#4  netdev_open (name=0xa664b0 "vhost1")
[...]
#11 main

Fix that by setting non-blocking mode for client sockets for connection.

Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")

Signed-off-by: Ilya Maximets 
---
This was reproduced with current QEMU master branch
(commit 1ecfb24da987b862f) + patch-set "vhost-user reconnect fixes"
(https://lists.nongnu.org/archive/html/qemu-devel/2016-07/msg01547.html).

OVS was patched to support client mode:
http://openvswitch.org/pipermail/dev/2016-July/074972.html

Following script forces QEMU to fail to initialize vhost because
disconnection occures while device not fully configured:

while true
do
ovs-vsctl set Interface vhost1 ofport_request=125
ovs-vsctl set Interface vhost1 ofport_request=126
done

As a result: QEMU still works, network interface broken and OVS main
 thread stalled inside 'connect()'.

Version 2:
* EINPROGRESS not checked. EISCONN checked instead on
  the next iteration of reconnection loop.

 lib/librte_vhost/vhost_user/vhost-net-user.c | 62 ++--
 1 file changed, 58 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8c6a096..63e0840 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 #include 
@@ -449,6 +450,14 @@ create_unix_socket(const char *path, struct sockaddr_un 
*un, bool is_server)
RTE_LOG(INFO, VHOST_CONFIG, "vhost-user %s: socket created, fd: %d\n",
is_server ? "server" : "client", fd);

+   if (!is_server && fcntl(fd, F_SETFL, O_NONBLOCK)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "vhost-user: can't set nonblocking mode for socket, fd: 
"
+   "%d (%s)\n", fd, strerror(errno));
+   close(fd);
+   return -1;
+   }
+
memset(un, 0, sizeof(*un));
un->sun_family = AF_UNIX;
strncpy(un->sun_path, path, sizeof(un->sun_path));
@@ -516,9 +525,43 @@ struct vhost_user_reconnect_list {
 static struct vhost_user_reconnect_list reconn_list;
 static pthread_t reconn_tid;

+static int
+vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
+{
+   int ret, flags, so_error;
+   socklen_t len = sizeof(so_error);
+
+   errno = EINVAL;
+
+   ret = connect(fd, un, sz);
+   if (ret < 0 && errno != EISCONN)
+   return -1;
+
+   ret = getsockopt(fd, SOL_SOCKET, SO_ERROR, _error, );
+   if (ret < 0 || so_error) {
+   if (!ret)
+   errno = so_error;
+   return -1;
+   }
+
+   flags = fcntl(fd, F_GETFL, 0);
+   if (flags < 0) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "can't get flags for connfd %d\n", fd);
+   return -2;
+   }
+   if ((flags & O_NONBLOCK) && fcntl(fd, F_SETFL, flags & ~O_NONBLOCK)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "can't disable nonblocking on fd %d\n", fd);
+   return -2;
+   }
+   return 0;
+}
+
 static void *
 vhost_user_client_reconnect(void *arg __rte_unused)
 {
+   int ret;
struct vhost_user_reconnect *reconn, *next;

while (1) {
@@ -532,13 +575,23 @@ vhost_user_client_reconnect(void *arg __rte_unused)
 reconn != NULL; reconn = next) {
next = TAILQ_NEXT(reconn, next);

-   if (connect(reconn->fd, (struct sockaddr *)>un,
-   sizeof(reconn->un)) < 0)
+   ret = vhost_user_connect_nonblock(reconn->fd,
+   (struct sockaddr *)>un,
+   sizeof(reconn->un));
+   if (ret == -2) {
+   close(reconn->fd);
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "reconnection for fd %d failed\n",
+   reconn->fd);
+   goto remove_fd;
+   }
+   if (ret == -1)
continue;

RTE_LOG(INFO,

[dpdk-dev] [PATCH V2] doc: fix vhost setup in tep-termination app guide

2016-07-21 Thread Mark Kavanagh

- Fix vhost setup flags
- Add minor edits to improve readability and consistency

---

v2: - revert file mode changes made erroneously in v1

Signed-off-by: Mark Kavanagh 
---
 doc/guides/sample_app_ug/tep_termination.rst | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/doc/guides/sample_app_ug/tep_termination.rst 
b/doc/guides/sample_app_ug/tep_termination.rst
index 2d86a03..c3d1e97 100644
--- a/doc/guides/sample_app_ug/tep_termination.rst
+++ b/doc/guides/sample_app_ug/tep_termination.rst
@@ -59,8 +59,8 @@ This allows network isolation, QOS, etc to be provided on a 
per client basis.
 In a typical setup, the network overlay tunnel is terminated at the 
Virtual/Tunnel End Point (VEP/TEP).
 The TEP is normally located at the physical host level ideally in the software 
switch.
 Due to processing constraints and the inevitable bottleneck that the switch
-becomes the ability to offload overlay support features becomes an important 
requirement.
-Intel? XL710 10/40 G Ethernet network card provides hardware filtering
+becomes, the ability to offload overlay support features becomes an important 
requirement.
+Intel? XL710 10/40 Gigabit Ethernet network card provides hardware filtering
 and offload capabilities to support overlay networks implementations such as 
MAC in UDP and MAC in GRE.

 Sample Code Overview
@@ -131,14 +131,14 @@ Compiling the Sample Code

 .. code-block:: console

-CONFIG_RTE_LIBRTE_VHOST=n
+CONFIG_RTE_LIBRTE_VHOST=y

 vhost user is turned on by default in the configure file 
config/common_linuxapp.
 To enable vhost cuse, disable vhost user.

 .. code-block:: console

-CONFIG_RTE_LIBRTE_VHOST_USER=y
+CONFIG_RTE_LIBRTE_VHOST_USER=n

  After vhost is enabled and the implementation is selected, build the 
vhost library.

-- 
1.9.3

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-21 Thread Wiles, Keith


> On Jul 21, 2016, at 8:54 AM, Neil Horman  wrote:
> 
> On Wed, Jul 20, 2016 at 10:32:28PM +, Wiles, Keith wrote:
>> 
>>> On Jul 20, 2016, at 3:16 PM, Neil Horman  wrote:
>>> 
>>> On Wed, Jul 20, 2016 at 07:47:32PM +, Wiles, Keith wrote:
 
> On Jul 20, 2016, at 12:48 PM, Neil Horman  wrote:
> 
> On Wed, Jul 20, 2016 at 07:40:49PM +0200, Thomas Monjalon wrote:
>> 2016-07-20 13:09, Neil Horman:
>>> From: Neil Horman 
>>> 
>>> John Mcnamara and I were discussing enhacing the validate_abi script to 
>>> build
>>> the dpdk tree faster with multiple jobs.  Theres no reason not to do 
>>> it, so this
>>> implements that requirement.  It uses a MAKE_JOBS variable that can be 
>>> set by
>>> the user to limit the job count.  By default the job count is set to 
>>> the number
>>> of online cpus.
>> 
>> Please could you use the variable name DPDK_MAKE_JOBS?
>> This name is already used in scripts/test-build.sh.
>> 
> Sure
> 
>>> +if [ -z "$MAKE_JOBS" ]
>>> +then
>>> +   # This counts the number of cpus on the system
>>> +   MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
>>> +fi
>> 
>> Is lscpu common enough?
>> 
> I'm not sure how to answer that.  lscpu is part of the util-linux 
> package, which
> is part of any base install.  Theres a variant for BSD, but I'm not sure 
> how
> common it is there.
> Neil
> 
>> Another acceptable default would be just "-j" without any number.
>> It would make the number of jobs unlimited.
 
 I think the best is just use -j as it tries to use the correct number of 
 jobs based on the number of cores, right?
 
>>> -j with no argument (or -j 0), is sort of, maybe what you want.  With 
>>> either of
>>> those options, make will just issue jobs as fast as it processes 
>>> dependencies.
>>> Dependent on how parallel the build is, that can lead to tons of waiting 
>>> process
>>> (i.e. more than your number of online cpus), which can actually hurt your 
>>> build
>>> time.
>> 
>> I read the manual and looked at the code, which supports your statement. (I 
>> think I had some statement on stack overflow and the last time I believe 
>> anything on the internet :-) I have not seen a lot of differences in compile 
>> times with -j on my system. Mostly I suspect it is the number of paths in 
>> the dependency, cores and memory on the system.
>> 
>> I have 72 lcores or 2 sockets, 18 cores per socket. Xeon 2.3Ghz cores.
>> 
>> $ export RTE_TARGET=x86_64-native-linuxapp-gcc 
>> 
>> $ time make install T=${RTE_TARGET}
>> real 0m59.445s user  0m27.344s sys   0m7.040s
>> 
>> $ time make install T=${RTE_TARGET} -j
>> real 0m26.584s user  0m14.380s sys   0m5.120s
>> 
>> # Remove the x86_64-native-linuxapp-gcc
>> 
>> $ time make install T=${RTE_TARGET} -j 72
>> real 0m23.454s user  0m10.832s sys   0m4.664s
>> 
>> $ time make install T=${RTE_TARGET} -j 8
>> real 0m23.812s user  0m10.672s sys   0m4.276s
>> 
>> cd x86_64-native-linuxapp-gcc
>> $ make clean
>> $ time make
>> real 0m28.539s user  0m9.820s sys0m3.620s
>> 
>> # Do a make clean between each build.
>> 
>> $ time make -j
>> real 0m7.217s user   0m6.532s sys0m2.332s
>> 
>> $ time make -j 8
>> real 0m8.256s user   0m6.472s sys0m2.456s
>> 
>> $ time make -j 72
>> real 0m6.866s user   0m6.184s sys0m2.216s
>> 
>> Just the real time numbers in the following table.
>> 
>> processes real Time   depdirs
>> no -j 59.4sYes
>>   -j 8 23.8sYes
>>  -j 7223.5sYes
>>-j   26.5sYes
>> 
>> no -j 28.5s No
>>   -j 8   8.2s No
>>  -j 72  6.8s No
>>-j 7.2s No
>> 
>> Looks like the depdirs build time on my system:
>> $ make clean -j
>> $ rm .depdirs
>> $ time make -j
>> real 0m23.734s user  0m11.228s sys   0m4.844s
>> 
>> About 16 seconds, which is not a lot of savings. Now the difference from no 
>> -j to -j is a lot, but the difference between -j and -j  is not a 
>> huge saving. This leads me back to over engineering the problem when ?-j? 
>> would work just as well here.
>> 
>> Even on my MacBook Pro i7 system the difference is not that much 1m8s 
>> without depdirs build for -j in a VirtualBox with all 4 cores 8G RAM. 
>> Compared to 1m13s with -j 4 option.
>> 
>> I just wonder if it makes a lot of sense to use cpuinfo in this given case 
>> if it turns out to be -j works with the 80% rule?
>> 
> It may, but that seems to be reason to me to just set DPDK_MAKE_JOBS=0, and
> you'll get that behavior

Just to be sure, ?make -j 0? is not a valid argument to the -j option. It looks 
like you have to do ?-j? or ?-j N? or no option where N != 0

I think we just use -j which gets us the 80% rule and the best performance 
without counting cores.

> 
> Neil
> 
>> On

[dpdk-dev] [PATCH V2] doc: fix vhost setup in tep-termination app guide

2016-07-21 Thread Mark Kavanagh

- Fix vhost setup flags
- Add minor edits to improve readability and consistency

---

v2: - revert file mode changes made erroneously in v1

Signed-off-by: Mark Kavanagh 
---
 doc/guides/sample_app_ug/tep_termination.rst | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
 mode change 100644 => 100755 doc/guides/sample_app_ug/tep_termination.rst

diff --git a/doc/guides/sample_app_ug/tep_termination.rst 
b/doc/guides/sample_app_ug/tep_termination.rst
old mode 100644
new mode 100755
index 2d86a03..c3d1e97
--- a/doc/guides/sample_app_ug/tep_termination.rst
+++ b/doc/guides/sample_app_ug/tep_termination.rst
@@ -59,8 +59,8 @@ This allows network isolation, QOS, etc to be provided on a 
per client basis.
 In a typical setup, the network overlay tunnel is terminated at the 
Virtual/Tunnel End Point (VEP/TEP).
 The TEP is normally located at the physical host level ideally in the software 
switch.
 Due to processing constraints and the inevitable bottleneck that the switch
-becomes the ability to offload overlay support features becomes an important 
requirement.
-Intel? XL710 10/40 G Ethernet network card provides hardware filtering
+becomes, the ability to offload overlay support features becomes an important 
requirement.
+Intel? XL710 10/40 Gigabit Ethernet network card provides hardware filtering
 and offload capabilities to support overlay networks implementations such as 
MAC in UDP and MAC in GRE.

 Sample Code Overview
@@ -131,14 +131,14 @@ Compiling the Sample Code

 .. code-block:: console

-CONFIG_RTE_LIBRTE_VHOST=n
+CONFIG_RTE_LIBRTE_VHOST=y

 vhost user is turned on by default in the configure file 
config/common_linuxapp.
 To enable vhost cuse, disable vhost user.

 .. code-block:: console

-CONFIG_RTE_LIBRTE_VHOST_USER=y
+CONFIG_RTE_LIBRTE_VHOST_USER=n

  After vhost is enabled and the implementation is selected, build the 
vhost library.

-- 
1.9.3

[dpdk-dev] [PATCH] eal: fix parsing of argument of option --lcores

2016-07-21 Thread Wei Dai

The '-' in lcores set overrides cpu set of following
lcore set in the argument of EAL option --lcores.

Fixes: 53e54bf81700 ("eal: new option --lcores for cpu assignment")

Signed-off-by: Wei Dai 
---
 lib/librte_eal/common/eal_common_options.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/lib/librte_eal/common/eal_common_options.c 
b/lib/librte_eal/common/eal_common_options.c
index 0a594d7..96eb1a9 100644
--- a/lib/librte_eal/common/eal_common_options.c
+++ b/lib/librte_eal/common/eal_common_options.c
@@ -563,6 +563,7 @@ convert_to_cpuset(rte_cpuset_t *cpusetp,
  * lcores, cpus could be a single digit/range or a group.
  * '(' and ')' are necessary if it's a group.
  * If not supply '@cpus', the value of cpus uses the same as lcores.
+ * The 'a-b' in lcores not within '(' and ')' means a,a+1,...,b-1,b .
  * e.g. '1,2@(5-7),(3-5)@(0,2),(0,6),7-8' means start 9 EAL thread as below
  *   lcore 0 runs on cpuset 0x41 (cpu 0,6)
  *   lcore 1 runs on cpuset 0x2 (cpu 1)
@@ -571,6 +572,15 @@ convert_to_cpuset(rte_cpuset_t *cpusetp,
  *   lcore 6 runs on cpuset 0x41 (cpu 0,6)
  *   lcore 7 runs on cpuset 0x80 (cpu 7)
  *   lcore 8 runs on cpuset 0x100 (cpu 8)
+ * e.g. '0-2,(3-5)@(3,4),6@(5,6),7@(5-7)'means start 8 EAL threads as below
+ *   lcore 0 runs on cpuset 0x1 (cpu 0)
+ *   lcore 1 runs on cpuset 0x2 (cpu 1)
+ *   lcore 2 runs on cpuset ox4 (cpu 2)
+ *   lcore 3,4,5 runs on cpuset 0x18 (cpu 3,4)
+ *   lcore 6 runs on cpuset 0x60 (cpu 5,6)
+ *   lcore 7 runs on cpuset 0xe0 (cpu 5,6,7)
+ * The second case is used to test bugfix for lflags not be cleared after use
+ */
  */
 static int
 eal_parse_lcores(const char *lcores)
@@ -679,6 +689,8 @@ eal_parse_lcores(const char *lcores)
   sizeof(rte_cpuset_t));
}

+   lflags = 0;
+
lcores = end + 1;
} while (*end != '\0');

-- 
2.5.5

[dpdk-dev] [PATCH] net/i40e: fix out-of-bounds writes during vector Rx

2016-07-21 Thread Ilya Maximets

From: Sergey Dyasly 

Rx loop inside _recv_raw_pkts_vec() ignores nb_pkts argument and always
tries to receive RTE_I40E_VPMD_RX_BURST (32) packets. This is a violation
of rte_eth_rx_burst() API and can lead to memory corruption (out-of-bounds
writes to struct rte_mbuf **rx_pkts) if nb_pkts is less than 32.

Fix this by actually using nb_pkts inside the loop.

Fixes: 9ed94e5bb04e ("i40e: add vector Rx")

Signed-off-by: Sergey Dyasly 
Acked-by: Ilya Maximets 
---
 drivers/net/i40e/i40e_rxtx_vec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c
index 05cb415..51fb282 100644
--- a/drivers/net/i40e/i40e_rxtx_vec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec.c
@@ -269,7 +269,7 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct 
rte_mbuf **rx_pkts,
 * D. fill info. from desc to mbuf
 */

-   for (pos = 0, nb_pkts_recd = 0; pos < RTE_I40E_VPMD_RX_BURST;
+   for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
pos += RTE_I40E_DESCS_PER_LOOP,
rxdp += RTE_I40E_DESCS_PER_LOOP) {
__m128i descs[RTE_I40E_DESCS_PER_LOOP];
-- 
2.7.4

[dpdk-dev] [PATCH] ethdev: ensure consistent port id assignment

2016-07-21 Thread Kerlin, MarcinX

Hi Amin,

> -Original Message-
> From: Tootoonchian, Amin
> Sent: Wednesday, July 20, 2016 5:08 PM
> To: Kerlin, MarcinX 
> Cc: dev at dpdk.org; thomas.monjalon at 6wind.com
> Subject: RE: [PATCH] ethdev: ensure consistent port id assignment
> 
> Hi Marcin,
> 
> Comments inline:
> 
> > -Original Message-
> > From: Kerlin, MarcinX
> > Sent: Wednesday, July 20, 2016 1:51 AM
> > To: Tootoonchian, Amin 
> > Cc: dev at dpdk.org; thomas.monjalon at 6wind.com
> > Subject: RE: [PATCH] ethdev: ensure consistent port id assignment
> >
> > Hi Amin,
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Tootoonchian,
> > > Amin
> > > Sent: Tuesday, July 12, 2016 4:01 AM
> > > To: thomas.monjalon at 6wind.com
> > > Cc: dev at dpdk.org
> > > Subject: [dpdk-dev] [PATCH] ethdev: ensure consistent port id
> > > assignment
> > >
> > > The rte_eth_dev_allocate() code has an implicit assumption that the
> > > port id assignment in the secondary process is consistent with that
> > > of the primary. The current code breaks if the enumeration of
> > > ethdevs in primary and secondary processes are not identical (e.g.,
> > > when the black/whitelist and vdev args of the primary and secondary
> > > do not match, or when the primary dynamically adds/removes ethdevs).
> > >
> > > To fix this problem, rte_eth_dev_allocate() now looks up port id by
> > > name in a secondary process (making it explicit that ethdevs can
> > > only be created and initialized by the primary process). Upon
> > > releasing a port, the primary process zeros out eth_dev->data to
> > > avoid false positives in port id lookup by rte_eth_dev_get_port_by_name().
> > >
> > > Signed-off-by: Amin Tootoonchian 
> > > ---
> > >  lib/librte_ether/rte_ethdev.c | 44
> > > +
> > > --
> > >  1 file changed, 30 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/lib/librte_ether/rte_ethdev.c
> > > b/lib/librte_ether/rte_ethdev.c index
> > > 0a6e3f1..1801f57 100644
> > > --- a/lib/librte_ether/rte_ethdev.c
> > > +++ b/lib/librte_ether/rte_ethdev.c
> > > @@ -195,25 +195,37 @@ rte_eth_dev_allocate(const char *name, enum
> > > rte_eth_dev_type type)
> > >   uint8_t port_id;
> > >   struct rte_eth_dev *eth_dev;
> > >
> > > - port_id = rte_eth_dev_find_free_port();
> > > - if (port_id == RTE_MAX_ETHPORTS) {
> > > - RTE_PMD_DEBUG_TRACE("Reached maximum number of
> > > Ethernet ports\n");
> > > - return NULL;
> > > - }
> > > -
> > >   if (rte_eth_dev_data == NULL)
> > >   rte_eth_dev_data_alloc();
> > >
> > > - if (rte_eth_dev_allocated(name) != NULL) {
> > > - RTE_PMD_DEBUG_TRACE("Ethernet Device with name %s
> > > already allocated!\n",
> > > - name);
> > > - return NULL;
> > > + if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> > > + port_id = rte_eth_dev_find_free_port();
> > > + if (port_id == RTE_MAX_ETHPORTS) {
> > > + RTE_PMD_DEBUG_TRACE("Reached maximum number
> > > of Ethernet ports\n");
> > > + return NULL;
> > > + }
> > > +
> > > + if (rte_eth_dev_allocated(name) != NULL) {
> > > + RTE_PMD_DEBUG_TRACE("Ethernet Device with name
> > > %s already allocated!\n",
> > > + name);
> > > + return NULL;
> > > + }
> > > + } else {
> > > + if (rte_eth_dev_get_port_by_name(name, _id) != 0) {
> >
> >
> > I was working also on this problem but I didn't send patch yet, so I
> > did review of your code.
> >
> > Condition (rte_eth_dev_get_port_by_name(name, _id) != 0) will
> > always fail.
> > Secondary process enter here and he will be looking for him name but
> > has not yet added and the application return NULL here e.g. we will
> > run app with device name pcap1 but this device is not on list and function
> return null.
> 
> This is the intended behavior with this patch. Ports are to be created only 
> by the
> primary process. This is required for correct operation IMO, because if we
> allow secondary processes to create ports dynamically (and locally use
> conflicting port ids) without any synchronization mechanism, they're
> guaranteed to overwrite each other's rte_eth_dev_data.

Thanks Amin for clarification,
I had another approach, that rte_eth_devices and rte_eth_dev_data should have 
different offset of port_id and secondary process can also add devices.

as I now understand with this patch we will not be able do something like:
Primary:
./test-pmd -c 0xf  -n 4 --socket-mem='512,0'  -w 03:00.1 -w 03:00.0  
--proc-type=primary --file-prefix=xz1 -- -i
Secondary: 
./test-pmd -c 0xf0 --socket-mem='512,0' -n 4 -v -b 03:00.1 -b 03:00.0 
--vdev 'eth_pcap0,rx_pcap=/var/log/device1.pcap,tx_pcap=/var/log/device2.pcap'
--proc-type=secondary --file-prefix=xz1 -- -i 

Because secondary processes "Ports are to be created only by the primary

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-21 Thread Rahul Lakkireddy

Hi Adrien,

The proposal looks very good.  It satisfies most of the features
supported by Chelsio NICs.  We are looking for suggestions on exposing
more additional features supported by Chelsio NICs via this API.

Chelsio NICs have two regions in which filters can be placed -
Maskfull and Maskless regions.  As their names imply, maskfull region
can accept masks to match a range of values; whereas, maskless region
don't accept any masks and hence perform a more strict exact-matches.
Filters without masks can also be placed in maskfull region.  By
default, maskless region have higher priority over the maskfull region.
However, the priority between the two regions is configurable.

Please suggest on how we can let the apps configure in which region
filters must be placed and set the corresponding priority accordingly
via this API.

More comments below.

On Tuesday, July 07/05/16, 2016 at 20:16:46 +0200, Adrien Mazarguil wrote:
> Hi All,
> 
[...]

> 
> ``ETH``
> ^^^
> 
> Matches an Ethernet header.
> 
> - ``dst``: destination MAC.
> - ``src``: source MAC.
> - ``type``: EtherType.
> - ``tags``: number of 802.1Q/ad tags defined.
> - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> 
>  - ``tpid``: Tag protocol identifier.
>  - ``tci``: Tag control information.
> 
> ``IPV4``
> 
> 
> Matches an IPv4 header.
> 
> - ``src``: source IP address.
> - ``dst``: destination IP address.
> - ``tos``: ToS/DSCP field.
> - ``ttl``: TTL field.
> - ``proto``: protocol number for the next layer.
> 
> ``IPV6``
> 
> 
> Matches an IPv6 header.
> 
> - ``src``: source IP address.
> - ``dst``: destination IP address.
> - ``tc``: traffic class field.
> - ``nh``: Next header field (protocol).
> - ``hop_limit``: hop limit field (TTL).
> 
> ``ICMP``
> 
> 
> Matches an ICMP header.
> 
> - TBD.
> 
> ``UDP``
> ^^^
> 
> Matches a UDP header.
> 
> - ``sport``: source port.
> - ``dport``: destination port.
> - ``length``: UDP length.
> - ``checksum``: UDP checksum.
> 
> .. raw:: pdf
> 
>PageBreak
> 
> ``TCP``
> ^^^
> 
> Matches a TCP header.
> 
> - ``sport``: source port.
> - ``dport``: destination port.
> - All other TCP fields and bits.
> 
> ``VXLAN``
> ^
> 
> Matches a VXLAN header.
> 
> - TBD.
> 

In addition to above matches, Chelsio NICs have some additional
features:

- Match based on unicast DST-MAC, multicast DST-MAC, broadcast DST-MAC.
  Also, there is a match criteria available called 'promisc' - which
  matches packets that are not destined for the interface, but had
  been received by the hardware due to interface being in promiscuous
  mode.

- Match FCoE packets.

- Match IP Fragmented packets.

- Match range of physical ports on the NIC in a single rule via masks.
  For ex: match all UDP packets coming on ports 3 and 4 out of 4
  ports available on the NIC.

- Match range of Physical Functions (PFs) on the NIC in a single rule
  via masks. For ex: match all traffic coming on several PFs.

Please suggest on how we can expose the above features to DPDK apps via
this API.

[...]

> 
> Actions
> ~~~
> 
> Each possible action is represented by a type. Some have associated
> configuration structures. Several actions combined in a list can be affected
> to a flow rule. That list is not ordered.
> 
> At least one action must be defined in a filter rule in order to do
> something with matched packets.
> 
> - Actions are defined with ``struct rte_flow_action``.
> - A list of actions is defined with ``struct rte_flow_actions``.
> 
> They fall in three categories:
> 
> - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
>   processing matched packets by subsequent flow rules, unless overridden
>   with PASSTHRU.
> 
> - Non terminating actions (PASSTHRU, DUP) that leave matched packets up for
>   additional processing by subsequent flow rules.
> 
> - Other non terminating meta actions that do not affect the fate of packets
>   (END, VOID, ID, COUNT).
> 
> When several actions are combined in a flow rule, they should all have
> different types (e.g. dropping a packet twice is not possible). However
> considering the VOID type is an exception to this rule, the defined behavior
> is for PMDs to only take into account the last action of a given type found
> in the list. PMDs still perform error checking on the entire list.
> 
> *Note that PASSTHRU is the only action able to override a terminating rule.*
> 

Chelsio NICs can support an action 'switch' which can re-direct
matched packets from one port to another port in hardware.  In addition,
it can also optionally:

1. Perform header rewrites (src-mac/dst-mac rewrite, src-mac/dst-mac
   swap, vlan add/remove/rewrite).

2. Perform NAT'ing in hardware (4-tuple rewrite).

before sending it out on the wire [1].

To meet the above requirements, we'd need a way to pass sub-actions
to action 'switch' and a way to pass extra info (such as new
src-mac/dst-mac, new vlan, new 4-tuple for NAT) to rewrite
corresponding

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets



On 21.07.2016 13:13, Yuanhan Liu wrote:
> On Thu, Jul 21, 2016 at 12:45:32PM +0300, Ilya Maximets wrote:
>> On 21.07.2016 12:37, Yuanhan Liu wrote:
>>> On Thu, Jul 21, 2016 at 11:21:15AM +0300, Ilya Maximets wrote:
 If something abnormal happened to QEMU, 'connect()' can block calling
 thread (e.g. main thread of OVS) forever or for a really long time.
 This can break whole application or block the reconnection thread.

 Example with OVS:

ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
(gdb) bt
#0  connect () from /lib64/libpthread.so.0
#1  vhost_user_create_client (vsocket=0xa816e0)
#2  rte_vhost_driver_register
#3  netdev_dpdk_vhost_user_construct
#4  netdev_open (name=0xa664b0 "vhost1")
[...]
#11 main

 Fix that by setting non-blocking mode for client sockets for connection.

 Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
>>>
>>> Thanks for spotting and fixing yet another bug!
>>>
  
 +static int
 +vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
>>>
>>> I don't quite understand why this is needed: connect() with O_NONBLOCK
>>> flag set is not enough?
>>
>> There is a little issue with non-blocking connect() call. Connection
>> establishing may be started but '-1' returned with 'errno = EINPROGRESS'.
>> In this case we must wait on fd until it will be available for writing.
>> After that we need to check current status of connection using getsockopt().
>>
>> I don't sure that we're able to get such situation, but it's documented,
>> and, I think, we should handle it.
>>
>> See 'man connect' for details.
> 
> I see. Thanks.
> 
> But basically, I don't like the way of introduing yet another
> fdset here. I'm wondering we could leverage current fdset code
> to achieve that. This might need some work though.
> 
> So how about making it simple and stupid at this stage: sleep a
> while (maybe 1ms, or maybe 1s) when that happens, and give up
> when the connection is still not established?

Hmm, how about this fixup:
--
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8626d13..b0f45e6 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -537,18 +537,7 @@ vhost_user_connect_nonblock(int fd, struct sockaddr *un, 
size_t sz)
errno = EINVAL;

ret = connect(fd, un, sz);
-   if (ret == -1 && errno != EINPROGRESS)
-   return -1;
-   if (ret == 0)
-   goto connected;
-
-   FD_ZERO();
-   FD_SET(fd, );
-
-   ret = select(fd + 1, NULL, , NULL, );
-   if (!ret)
-   errno = ETIMEDOUT;
-   if (ret != 1)
+   if (ret < 0 && errno != EISCONN)
return -1;

ret = getsockopt(fd, SOL_SOCKET, SO_ERROR, _error, );
@@ -558,7 +547,6 @@ vhost_user_connect_nonblock(int fd, struct sockaddr *un, 
size_t sz)
return -1;
}

-connected:
flags = fcntl(fd, F_GETFL, 0);
if (flags < 0) {
RTE_LOG(ERR, VHOST_CONFIG,
--
?

We will not check the EINPROGRESS, but subsequent 'connect()' will return
EISCONN if connection already established. getsockopt() is kept just in
case. Subsequent 'connect()' will happen on the next iteration of
reconnection cycle (1 second sleep).

Best regards, Ilya Maximets.

[dpdk-dev] [PATCH] doc: announce KNI ethtool removal

2016-07-21 Thread Jay Rolette

On Thu, Jul 21, 2016 at 10:33 AM, Ferruh Yigit 
wrote:

> On 7/20/2016 5:07 PM, Thomas Monjalon wrote:
> > The out-of-tree kernel code must be avoided.
> > Moreover there is no good reason to keep this legacy feature
> > which is only partially supported.
> >
> > As described earlier in this plan:
> >   http://dpdk.org/ml/archives/dev/2016-July/043606.html
> > it will help to keep PCI ids in PMD code.
> >
> > Signed-off-by: Thomas Monjalon 
> > ---
> >  doc/guides/rel_notes/deprecation.rst | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> > index f502f86..9cadf6a 100644
> > --- a/doc/guides/rel_notes/deprecation.rst
> > +++ b/doc/guides/rel_notes/deprecation.rst
> > @@ -41,3 +41,10 @@ Deprecation Notices
> >  * The mempool functions for single/multi producer/consumer are
> deprecated and
> >will be removed in 16.11.
> >It is replaced by rte_mempool_generic_get/put functions.
> > +
> > +* The ethtool support will be removed from KNI in 16.11.
> > +  It is implemented only for igb and ixgbe.
> > +  It is really hard to maintain because it requires some out-of-tree
> kernel
> > +  code to be duplicated in this kernel module.
> > +  Removing this partial support will help to restrict the PCI id
> definitions
> > +  to the PMD code.
> >
>
> KNI ethtool is functional and maintained, and it may have users!
>
> Why just removing it, specially without providing an alternative?
> Is is good time to discuss KCP again?
>

Yes, my product uses it. Seems like we are back to the same discussion we
had a few months ago about the KNI situation...

It shouldn't be removed unless there is a replacement, ideally one that
works with the normal Linux tools like every other network device. While
the code wasn't ready at the time, it was a definite improvement over what
we have with KNI today.

Jay

[dpdk-dev] [PATCH V2] doc: fix vhost setup in tep-termination app guide

2016-07-21 Thread Kavanagh, Mark B

Please disregard - correct version of patch to follow.

Cheers,
Mark

>-Original Message-
>From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Mark Kavanagh
>Sent: Thursday, July 21, 2016 2:04 PM
>To: dev at dpdk.org
>Subject: [dpdk-dev] [PATCH V2] doc: fix vhost setup in tep-termination app 
>guide
>
>- Fix vhost setup flags
>- Add minor edits to improve readability and consistency
>
>---
>
>v2: - revert file mode changes made erroneously in v1
>
>Signed-off-by: Mark Kavanagh 
>---
> doc/guides/sample_app_ug/tep_termination.rst | 8 
> 1 file changed, 4 insertions(+), 4 deletions(-)
> mode change 100644 => 100755 doc/guides/sample_app_ug/tep_termination.rst
>
>diff --git a/doc/guides/sample_app_ug/tep_termination.rst
>b/doc/guides/sample_app_ug/tep_termination.rst
>old mode 100644
>new mode 100755
>index 2d86a03..c3d1e97
>--- a/doc/guides/sample_app_ug/tep_termination.rst
>+++ b/doc/guides/sample_app_ug/tep_termination.rst
>@@ -59,8 +59,8 @@ This allows network isolation, QOS, etc to be provided on a 
>per client
>basis.
> In a typical setup, the network overlay tunnel is terminated at the 
> Virtual/Tunnel End Point
>(VEP/TEP).
> The TEP is normally located at the physical host level ideally in the 
> software switch.
> Due to processing constraints and the inevitable bottleneck that the switch
>-becomes the ability to offload overlay support features becomes an important 
>requirement.
>-Intel? XL710 10/40 G Ethernet network card provides hardware filtering
>+becomes, the ability to offload overlay support features becomes an important 
>requirement.
>+Intel? XL710 10/40 Gigabit Ethernet network card provides hardware filtering
> and offload capabilities to support overlay networks implementations such as 
> MAC in UDP and
>MAC in GRE.
>
> Sample Code Overview
>@@ -131,14 +131,14 @@ Compiling the Sample Code
>
> .. code-block:: console
>
>-CONFIG_RTE_LIBRTE_VHOST=n
>+CONFIG_RTE_LIBRTE_VHOST=y
>
> vhost user is turned on by default in the configure file 
> config/common_linuxapp.
> To enable vhost cuse, disable vhost user.
>
> .. code-block:: console
>
>-CONFIG_RTE_LIBRTE_VHOST_USER=y
>+CONFIG_RTE_LIBRTE_VHOST_USER=n
>
>  After vhost is enabled and the implementation is selected, build the 
> vhost library.
>
>--
>1.9.3

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-21 Thread Take Ceara

Hi Beilei,

On Wed, Jul 20, 2016 at 3:59 AM, Xing, Beilei  wrote:
> Hi Ceara,
>
>> -Original Message-
>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
>> Sent: Tuesday, July 19, 2016 10:59 PM
>> To: Xing, Beilei 
>> Cc: Zhang, Helin ; Wu, Jingjing
>> ; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710
>> NICs for some RX mbuf sizes
>>
>> Hi Beilei,
>>
>> I changed the way I run testmpd to:
>>
>> testpmd -c 0x331 -w :82:00.0 -w :83:00.0 -- --mbuf-size 1152 
>> --rss-ip -
>> -rxq=2 --txpkts 1024 -i
>>
>> As far as I understand this will allocate mbufs with the same size I was 
>> using
>> in my test (--mbuf-size seems to include the mbuf headroom therefore 1152
>> = 1024 + 128 headroom).
>>
>> testpmd> start tx_first
>>   io packet forwarding - CRC stripping disabled - packets/burst=32
>>   nb forwarding cores=1 - nb forwarding ports=2
>>   RX queues=2 - RX desc=128 - RX free threshold=32
>>   RX threshold registers: pthresh=8 hthresh=8 wthresh=0
>>   TX queues=1 - TX desc=512 - TX free threshold=32
>>   TX threshold registers: pthresh=32 hthresh=0 wthresh=0
>>   TX RS bit threshold=32 - TXQ flags=0xf01
>> testpmd> show port stats all
>>
>>    NIC statistics for port 0
>> 
>>   RX-packets: 18817613   RX-missed: 5  RX-bytes:  19269115888
>>   RX-errors: 0
>>   RX-nombuf:  0
>>   TX-packets: 18818064   TX-errors: 0  TX-bytes:  19269567464
>>
>> ##
>> ##
>>
>>    NIC statistics for port 1
>> 
>>   RX-packets: 18818392   RX-missed: 5  RX-bytes:  19269903360
>>   RX-errors: 0
>>   RX-nombuf:  0
>>   TX-packets: 18817979   TX-errors: 0  TX-bytes:  19269479424
>>
>> ##
>> ##
>>
>> Ttraffic is sent/received. However, I couldn't find any way to verify that 
>> the
>> incoming mbufs actually have the mbuf->hash.rss field set except for starting
>> test-pmd with gdb and setting a breakpoint in the io fwd engine. After doing
>> that I noticed that none of the incoming packets has the PKT_RX_RSS_HASH
>> flag set in ol_flags... I guess for some reason test-pmd doesn't actually
>> configure RSS in this case but I fail to see where.
>>
>
> Actually there's a way to check mbuf->hash.rss, you need set forward mode to 
> "rxonly", and set verbose to 1.
> I run testpmd with the configuration you used, and found i40e RSS works well.
> With the following steps, you can see RSS hash value and receive queue, and 
> PKT_RX_RSS_HASH is set too.
> I think you can use the same way to check what you want.
>
> ./testpmd -c f -n 4 -- -i --coremask=0xe --rxq=16 --txq=16 
> --mbuf-size 1152 --rss-ip --txpkts 1024
> testpmd> set verbose 1
> testpmd> set fwd rxonly
> testpmd> start
> testpmd> port 0/queue 1: received 1 packets
>   src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
> nb
>  - Receive queue=0x1
>   PKT_RX_RSS_HASH
> port 0/queue 0: received 1 packets
>   src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
> nb_segs=1 - RSS hash=0x4e949f23 - RSS queue=0x0Unknown packet type
>  - Receive queue=0x0
>   PKT_RX_RSS_HASH
> port 0/queue 8: received 1 packets
>   src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
> nb_segs=1 - RSS hash=0xa3c78b2b - RSS queue=0x8Unknown packet type
>  - Receive queue=0x8
>   PKT_RX_RSS_HASH
> port 0/queue 5: received 1 packets
>   src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
> nb_segs=1 - RSS hash=0xe29b3d36 - RSS queue=0x5Unknown packet type
>  - Receive queue=0x5
>   PKT_RX_RSS_HASH
>

Following your testpmd example run I managed to replicate the problem
on my dpdk 16.04 setup like this:

I have two X710 adapters connected back to back:
$ ./tools/dpdk_nic_bind.py -s

Network devices using DPDK-compatible driver

:01:00.3 'Ethernet Controller X710 for 10GbE SFP+' drv=igb_uio unused=
:81:00.3 'Ethernet Controller X710 for 10GbE SFP+' drv=igb_uio unused=

The firmware of the two adapters is up to date with the latest
version: 5.04 (f5.0.40043 a1.5 n5.04 e24cd)

I run testpmd with mbuf-size 1152 and txpktsize 1024 such that upon
receival the whole mbuf (except headroom) is filled.
I enabled RX IP checksum in hw and RX RSS hashing for UDP.
With test-pmd forward mode "rxonly" and verbose 1 I see that incoming
packets have PKT_RX_RSS_HASH set but the hash value is 0.

./testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
--coremask=0x0 --rxq=16 --txq=16 --mbuf-size 1152 --txpkts 1024
--enable-rx-cksum --rss-udp
[...]
testpmd> set verbose 1
Change verbose level from 0 to 1
testpmd> set fwd rxonly
Set rxonly packet forwarding mode
testpmd> start tx_first
  rxonly packet forwarding - CRC stripping

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets

On 21.07.2016 12:37, Yuanhan Liu wrote:
> On Thu, Jul 21, 2016 at 11:21:15AM +0300, Ilya Maximets wrote:
>> If something abnormal happened to QEMU, 'connect()' can block calling
>> thread (e.g. main thread of OVS) forever or for a really long time.
>> This can break whole application or block the reconnection thread.
>>
>> Example with OVS:
>>
>>  ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
>>  (gdb) bt
>>  #0  connect () from /lib64/libpthread.so.0
>>  #1  vhost_user_create_client (vsocket=0xa816e0)
>>  #2  rte_vhost_driver_register
>>  #3  netdev_dpdk_vhost_user_construct
>>  #4  netdev_open (name=0xa664b0 "vhost1")
>>  [...]
>>  #11 main
>>
>> Fix that by setting non-blocking mode for client sockets for connection.
>>
>> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> 
> Thanks for spotting and fixing yet another bug!
> 
>>  
>> +static int
>> +vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
> 
> I don't quite understand why this is needed: connect() with O_NONBLOCK
> flag set is not enough?

There is a little issue with non-blocking connect() call. Connection
establishing may be started but '-1' returned with 'errno = EINPROGRESS'.
In this case we must wait on fd until it will be available for writing.
After that we need to check current status of connection using getsockopt().

I don't sure that we're able to get such situation, but it's documented,
and, I think, we should handle it.

See 'man connect' for details.

Best regards, Ilya Maximets.

[dpdk-dev] [PATCH v5] eal: out-of-bounds write

2016-07-21 Thread Mrozowicz, SlawomirX

Hi Thomas,

As I understand Sergio suggested to come back to the solution similar to v1.
Could you comment or better take decision which solution should be applied, 
please.

Best Regards,
S?awomir 


>-Original Message-
>From: Gonzalez Monroy, Sergio
>Sent: Monday, June 20, 2016 1:29 PM
>To: Thomas Monjalon 
>Cc: Mrozowicz, SlawomirX ;
>dev at dpdk.org; david.marchand at 6wind.com
>Subject: Re: [dpdk-dev] [PATCH v5] eal: out-of-bounds write
>
>On 20/06/2016 11:09, Thomas Monjalon wrote:
>> 2016-06-20 10:38, Sergio Gonzalez Monroy:
>>> On 20/06/2016 10:14, Thomas Monjalon wrote:
> + RTE_LOG(ERR, EAL,
> + "All memory segments exhausted by IVSHMEM. "
 There is no evidence that it is related to IVSHMEM.
 "Not enough memory segments." would be more appropriate.
>>> Actually we would hit this issue when all memsegs have been used by
>IVSHMEM.
>>> So I think the message is accurate.
>> I think it's saner to avoid mixing "potential root cause of a use
>> case" and "accurate description of the error".
>> One day, the root cause could be different and the message will become
>wrong.
>> Here there is not enough memory segment.
>>
>
>Right.
>So the whole point of doing the check before the loop was to display the error
>message with its specific cause.
>
>I would think that if the code changes and the message is not accurate then it
>should also be updated.
>
>So if folks prefer a more generic error message probably we don't need the
>check before the loop and just change the check condition inside the loop that
>would end up printing the generic error message (after the loop).
>
>Basically v1 would do that.
>http://dpdk.org/dev/patchwork/patch/12241/
>
>Sergio

[dpdk-dev] [PATCH v4 2/2] examples/ipsec-secgw: add sample configuration files

2016-07-21 Thread Fan Zhang

This patch adds two sample configuration files to ipsec-secgw sample
application. The sample configuration files shows how to set-up systems
back-to-back that would forward traffic through an IPsec tunnel.

Signed-off-by: Fan Zhang 
---
 examples/ipsec-secgw/ep0.cfg | 160 +++
 examples/ipsec-secgw/ep1.cfg | 160 +++
 2 files changed, 320 insertions(+)
 create mode 100644 examples/ipsec-secgw/ep0.cfg
 create mode 100644 examples/ipsec-secgw/ep1.cfg

diff --git a/examples/ipsec-secgw/ep0.cfg b/examples/ipsec-secgw/ep0.cfg
new file mode 100644
index 000..299aa9e
--- /dev/null
+++ b/examples/ipsec-secgw/ep0.cfg
@@ -0,0 +1,160 @@
+###
+#   IPSEC-SECGW Endpoint sample configuration
+#
+#   The main purpose of this file is to show how to configure two systems
+#   back-to-back that would forward traffic through an IPsec tunnel. This
+#   file is the Endpoint 0 configuration. To use this configuration file,
+#   add the following command-line option:
+#
+#   -f ./ep0.cfg
+#
+###
+
+#SP IPv4 rules
+sp ipv4 out esp protect 5 pri 1 dst 192.168.105.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 out esp protect 6 pri 1 dst 192.168.106.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 out esp protect 10 pri 1 dst 192.168.175.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 out esp protect 11 pri 1 dst 192.168.176.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 out esp protect 15 pri 1 dst 192.168.200.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 out esp protect 16 pri 1 dst 192.168.201.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 out esp protect 25 pri 1 dst 192.168.55.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 out esp protect 26 pri 1 dst 192.168.56.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 out esp bypass pri 1 dst 192.168.240.0/24 sport 0:65535 dport 0:65535
+sp ipv4 out esp bypass pri 1 dst 192.168.241.0/24 sport 0:65535 dport 0:65535
+
+sp ipv4 in esp protect 105 pri 1 dst 192.168.115.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp protect 106 pri 1 dst 192.168.116.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp protect 110 pri 1 dst 192.168.185.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp protect 111 pri 1 dst 192.168.186.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp protect 115 pri 1 dst 192.168.210.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp protect 116 pri 1 dst 192.168.211.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp protect 115 pri 1 dst 192.168.210.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp protect 125 pri 1 dst 192.168.65.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp protect 125 pri 1 dst 192.168.65.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp protect 126 pri 1 dst 192.168.66.0/24 sport 0:65535 dport 
0:65535
+sp ipv4 in esp bypass pri 1 dst 192.168.245.0/24 sport 0:65535 dport 0:65535
+sp ipv4 in esp bypass pri 1 dst 192.168.246.0/24 sport 0:65535 dport 0:65535
+
+#SP IPv6 rules
+sp ipv6 out esp protect 5 pri 1 dst :::::::/96 
\
+sport 0:65535 dport 0:65535
+sp ipv6 out esp protect 6 pri 1 dst :::::::/96 
\
+sport 0:65535 dport 0:65535
+sp ipv6 out esp protect 10 pri 1 dst 
:::::::/96 \
+sport 0:65535 dport 0:65535
+sp ipv6 out esp protect 11 pri 1 dst 
:::::::/96 \
+sport 0:65535 dport 0:65535
+sp ipv6 out esp protect 25 pri 1 dst 
:::::::/96 \
+sport 0:65535 dport 0:65535
+sp ipv6 out esp protect 26 pri 1 dst 
:::::::/96 \
+sport 0:65535 dport 0:65535
+
+sp ipv6 in esp protect 15 pri 1 dst :::::::/96 
\
+sport 0:65535 dport 0:65535
+sp ipv6 in esp protect 16 pri 1 dst :::::::/96 
\
+sport 0:65535 dport 0:65535
+sp ipv6 in esp protect 110 pri 1 dst 
:::::::/96 \
+sport 0:65535 dport 0:65535
+sp ipv6 in esp protect 111 pri 1 dst 
:::::::/96 \
+sport 0:65535 dport 0:65535
+sp ipv6 in esp protect 125 pri 1 dst 
:::::::/96 \
+sport 0:65535 dport 0:65535
+sp ipv6 in esp protect 126 pri 1 dst 
:::::::/96 \
+sport 0:65535 dport 0:65535
+
+#SA rules
+sa out 5 cipher_algo aes-128-cbc cipher_key 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 \
+auth_algo sha1-hmac auth_key 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 \
+mode ipv4-tunnel src 172.16.1.5 dst 172.16.2.5
+
+sa out 6 cipher_algo aes-128-cbc cipher_key a0:a0:a0:a0:a0:a0:a0:a0:a0:a0:a0:\
+a0:a0:a0:a0:a0 auth_algo sha1-hmac auth_key a0:a0:a0:a0:a0:a0:a0:a0:a0:a0:a0:\
+a0:a0:a0:a0:a0:a0:a0:a0:a0 mode ipv4-tunnel src 172.16.1.6 dst 172.16.2.6
+
+sa out 10 cipher_algo aes-128-cbc cipher_key a1:a1:a1:a1:a1:a1:a1:a1:a1:a1:a1:\
+a1:a1:a1:a1:a1 auth_algo sha1-hmac

[dpdk-dev] [PATCH v4 1/2] examples/ipsec-secgw: add configuration file support

2016-07-21 Thread Fan Zhang

This patch adds the configuration file support to ipsec_secgw
sample application. Instead of hard-coded rules, the users can
specify their own SP, SA, and routing rules in the configuration
file. An command line option "-f" is added to pass the
configuration file location to the application.

Configuration item formats:

SP rule format:
sp   esp \
  

SA rule format:
sa   \
  

Routing rule format:
rt

Signed-off-by: Fan Zhang 
---
 doc/guides/sample_app_ug/ipsec_secgw.rst | 845 +--
 examples/ipsec-secgw/Makefile|   1 +
 examples/ipsec-secgw/ipsec-secgw.c   |  58 ++-
 examples/ipsec-secgw/ipsec.h |  14 +-
 examples/ipsec-secgw/parser.c| 599 ++
 examples/ipsec-secgw/parser.h| 116 +
 examples/ipsec-secgw/rt.c| 255 --
 examples/ipsec-secgw/sa.c| 737 +--
 examples/ipsec-secgw/sp4.c   | 538 
 examples/ipsec-secgw/sp6.c   | 539 +---
 10 files changed, 2383 insertions(+), 1319 deletions(-)
 create mode 100644 examples/ipsec-secgw/parser.c
 create mode 100644 examples/ipsec-secgw/parser.h

diff --git a/doc/guides/sample_app_ug/ipsec_secgw.rst 
b/doc/guides/sample_app_ug/ipsec_secgw.rst
index fcb33c2..5cce2fe 100644
--- a/doc/guides/sample_app_ug/ipsec_secgw.rst
+++ b/doc/guides/sample_app_ug/ipsec_secgw.rst
@@ -122,7 +122,7 @@ The application has a number of command line options::
 -p PORTMASK -P -u PORTMASK
 --config (port,queue,lcore)[,(port,queue,lcore]
 --single-sa SAIDX
-   --ep0|--ep1
+-f CONFIG_FILE_PATH

 Where:

@@ -142,14 +142,11 @@ Where:
 on both Inbound and Outbound. This option is meant for 
debugging/performance
 purposes.

-*   ``--ep0``: configure the app as Endpoint 0.
+*   ``-f CONFIG_FILE_PATH``: the full path of text-based file containing all
+configuration items for running the application (See Configuration file
+syntax section below). ``-f CONFIG_FILE_PATH`` **must** be specified.
+**ONLY** the UNIX format configuration file is accepted.

-*   ``--ep1``: configure the app as Endpoint 1.
-
-Either one of ``--ep0`` or ``--ep1`` **must** be specified.
-The main purpose of these options is to easily configure two systems
-back-to-back that would forward traffic through an IPsec tunnel (see
-:ref:`figure_ipsec_endpoints`).

 The mapping of lcores to port/queues is similar to other l3fwd applications.

@@ -157,7 +154,8 @@ For example, given the following command line::

 ./build/ipsec-secgw -l 20,21 -n 4 --socket-mem 0,2048   \
--vdev "cryptodev_null_pmd" -- -p 0xf -P -u 0x3  \
-   --config="(0,0,20),(1,0,20),(2,0,21),(3,0,21)" --ep0 \
+   --config="(0,0,20),(1,0,20),(2,0,21),(3,0,21)"   \
+   -f /path/to/config_file  \

 where each options means:

@@ -194,8 +192,12 @@ where each options means:
 |  |   |   |   
|
 
+--+---+---+---+

-*   The ``--ep0`` options configures the app with a given set of SP, SA and 
Routing
-entries as explained below in more detail.
+*   The ``-f /path/to/config_file`` option enables the application read and
+parse the configuration file specified, and configures the application
+with a given set of SP, SA and Routing entries accordingly. The syntax of
+the configuration file will be explained below in more detail. Please
+**note** the parser only accepts UNIX format text file. Other formats
+such as DOS/MAC format will cause a parse error.

 Refer to the *DPDK Getting Started Guide* for general information on running
 applications and the Environment Abstraction Layer (EAL) options.
@@ -219,496 +221,357 @@ For example, something like the following command line:
 --vdev "cryptodev_aesni_mb_pmd" --vdev "cryptodev_null_pmd" \
-- \
 -p 0xf -P -u 0x3 --config="(0,0,20),(1,0,20),(2,0,21),(3,0,21)" \
---ep0
+-f sample.cfg


 Configurations
 --

-The following sections provide some details on the default values used to
-initialize the SP, SA and Routing tables.
-Currently all configuration information is hard coded into the application.
+The following sections provide the syntax of configurations to initialize
+your SP, SA and Routing tables.
+Configurations shall be specified in the configuration file to be passed to
+the application. The file is then parsed by the application. The successful
+parsing will result in the appropriate rules being applied to the tables
+accordingly.

-The following image illustrate a few of the concepts regarding IPSec, such
-as protected/unprotected and inbound/outbound

[dpdk-dev] [PATCH v4 0/2] examples/ipsec_secgw: add configuration file support

2016-07-21 Thread Fan Zhang

This patchset adds the configuration file supported to ipsec_secgw
sample application. Two sample configuration files, ep0.cfg and ep1.cfg
are also added to show how to configure two systems back-to-back that 
would forward traffic through an IPsec tunnel

v4 change:
- rebase the patchset on top of 
  "examples/ipsec-secgw: fix GCC 4.5.x build error"
  (http://dpdk.org/dev/patchwork/patch/14895/)
- add cipher_key and auth_key configuration options to SA rules
- updated documentation for the new options

v3 change:
- fix 32-bit compilation error

v2 changes:
- fix configuration file parsing error.
- update doc to remove whitespace tailing errors.

Fan Zhang (2):
  examples/ipsec-secgw: add configuration file support
  examples/ipsec-secgw: add sample configuration files

 doc/guides/sample_app_ug/ipsec_secgw.rst | 845 +--
 examples/ipsec-secgw/Makefile|   1 +
 examples/ipsec-secgw/ep0.cfg | 160 ++
 examples/ipsec-secgw/ep1.cfg | 160 ++
 examples/ipsec-secgw/ipsec-secgw.c   |  58 ++-
 examples/ipsec-secgw/ipsec.h |  14 +-
 examples/ipsec-secgw/parser.c| 599 ++
 examples/ipsec-secgw/parser.h| 116 +
 examples/ipsec-secgw/rt.c| 255 --
 examples/ipsec-secgw/sa.c| 737 +--
 examples/ipsec-secgw/sp4.c   | 538 
 examples/ipsec-secgw/sp6.c   | 539 +---
 12 files changed, 2703 insertions(+), 1319 deletions(-)
 create mode 100644 examples/ipsec-secgw/ep0.cfg
 create mode 100644 examples/ipsec-secgw/ep1.cfg
 create mode 100644 examples/ipsec-secgw/parser.c
 create mode 100644 examples/ipsec-secgw/parser.h

-- 
2.5.5

[dpdk-dev] [PATCH] vhost: fix driver unregister for client mode

2016-07-21 Thread Ilya Maximets

Thanks. Fixed.

Best regards, Ilya Maximets.

On 21.07.2016 11:24, Yuanhan Liu wrote:
> On Wed, Jul 20, 2016 at 11:32:43AM +0300, Ilya Maximets wrote:
>> Currently while calling of 'rte_vhost_driver_unregister()' connection
>> to QEMU will not be closed. This leads to inability to register driver
>> again and reconnect to same virtual machine.
>>
>> This scenario is reproducible with OVS. While executing of the following
>> command vhost port will be re-created (will be executed
>> 'rte_vhost_driver_register()' followed by 'rte_vhost_driver_unregister()')
>> network will be broken and QEMU possibly will crash:
>>
>>  ovs-vsctl set Interface vhost1 ofport_request=15
>>
>> Fix this by closing all established connections on driver unregister and
>> removing of pending connections from reconnection list.
>>
>> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
>> Signed-off-by: Ilya Maximets 
> 
> Again, thanks for the fix.
> 
>> ---
>>
>> Patch prepared for master branch because dpdk-next-virtio doesn't contain
>> commit acbff5c67ea7 ("vhost: fix crash when exceeding file descriptors").
>> Porting to dpdk-next-virtio/master is trivial and may be performed on
>> demand.
> 
> Yeah, my bad, I haven't updated it after rc2, since Thomas no longer
> pull request from it.  Anyway, you just remind me that I should have
> done that.
> 
>>  /**
>>   * Unregister the specified vhost socket
>>   */
>> @@ -672,20 +700,34 @@ rte_vhost_driver_unregister(const char *path)
>>  {
>>  int i;
>>  int count;
>> +struct vhost_user_connection *conn;
>>  
>>  pthread_mutex_lock(_user.mutex);
>>  
>>  for (i = 0; i < vhost_user.vsocket_cnt; i++) {
>> -if (!strcmp(vhost_user.vsockets[i]->path, path)) {
>> -if (vhost_user.vsockets[i]->is_server) {
>> -fdset_del(_user.fdset,
>> -vhost_user.vsockets[i]->listenfd);
>> -close(vhost_user.vsockets[i]->listenfd);
>> +struct vhost_user_socket *vsocket = vhost_user.vsockets[i];
>> +
>> +if (!strcmp(vsocket->path, path)) {
>> +if (vsocket->is_server) {
>> +(void) fdset_del(_user.fdset,
>> + vsocket->listenfd);
> 
> I would think the (void) cast is not neceessary here.
> 
>> +close(vsocket->listenfd);
>>  unlink(path);
>> +} else if (vsocket->reconnect) {
>> +vhost_user_remove_reconnect(vsocket);
>> +}
>> +
>> +conn = fdset_del(_user.fdset, vsocket->connfd);
>> +if (conn) {
>> +RTE_LOG(INFO, VHOST_CONFIG, "free connfd = %d"
>> +"for device '%s'\n", vsocket->connfd, path);
> 
> We should try not to break a log message into several lines, which
> hurts "git grep".  Here, it could be:
> 
>   RTE_LOG(INFO, VHOST_CONFIG,
>   "free connfd = %d for device '%s'\n",
>   vsocket->connfd, path);
> 
> Besides the two minor nits,
> 
> Acked-by: Yuanhan Liu 
> 
>   --yliu
> 
>

[dpdk-dev] [PATCH v2] vhost: fix driver unregister for client mode

2016-07-21 Thread Ilya Maximets

Currently while calling of 'rte_vhost_driver_unregister()' connection
to QEMU will not be closed. This leads to inability to register driver
again and reconnect to same virtual machine.

This scenario is reproducible with OVS. While executing of the following
command vhost port will be re-created (will be executed
'rte_vhost_driver_register()' followed by 'rte_vhost_driver_unregister()')
network will be broken and QEMU possibly will crash:

ovs-vsctl set Interface vhost1 ofport_request=15

Fix this by closing all established connections on driver unregister and
removing of pending connections from reconnection list.

Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")

Acked-by: Yuanhan Liu 
Signed-off-by: Ilya Maximets 
---
 lib/librte_vhost/vhost_user/fd_man.c | 15 ++--
 lib/librte_vhost/vhost_user/fd_man.h |  2 +-
 lib/librte_vhost/vhost_user/vhost-net-user.c | 56 
 3 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/lib/librte_vhost/vhost_user/fd_man.c 
b/lib/librte_vhost/vhost_user/fd_man.c
index c691339..2d3eeb7 100644
--- a/lib/librte_vhost/vhost_user/fd_man.c
+++ b/lib/librte_vhost/vhost_user/fd_man.c
@@ -132,8 +132,10 @@ fdset_init(struct fdset *pfdset)
if (pfdset == NULL)
return;

-   for (i = 0; i < MAX_FDS; i++)
+   for (i = 0; i < MAX_FDS; i++) {
pfdset->fd[i].fd = -1;
+   pfdset->fd[i].dat = NULL;
+   }
pfdset->num = 0;
 }

@@ -166,14 +168,16 @@ fdset_add(struct fdset *pfdset, int fd, fd_cb rcb, fd_cb 
wcb, void *dat)

 /**
  *  Unregister the fd from the fdset.
+ *  Returns context of a given fd or NULL.
  */
-void
+void *
 fdset_del(struct fdset *pfdset, int fd)
 {
int i;
+   void *dat = NULL;

if (pfdset == NULL || fd == -1)
-   return;
+   return NULL;

do {
pthread_mutex_lock(>fd_mutex);
@@ -181,13 +185,17 @@ fdset_del(struct fdset *pfdset, int fd)
i = fdset_find_fd(pfdset, fd);
if (i != -1 && pfdset->fd[i].busy == 0) {
/* busy indicates r/wcb is executing! */
+   dat = pfdset->fd[i].dat;
pfdset->fd[i].fd = -1;
pfdset->fd[i].rcb = pfdset->fd[i].wcb = NULL;
+   pfdset->fd[i].dat = NULL;
pfdset->num--;
i = -1;
}
pthread_mutex_unlock(>fd_mutex);
} while (i != -1);
+
+   return dat;
 }

 /**
@@ -203,6 +211,7 @@ fdset_del_slot(struct fdset *pfdset, int index)

pfdset->fd[index].fd = -1;
pfdset->fd[index].rcb = pfdset->fd[index].wcb = NULL;
+   pfdset->fd[index].dat = NULL;
pfdset->num--;

pthread_mutex_unlock(>fd_mutex);
diff --git a/lib/librte_vhost/vhost_user/fd_man.h 
b/lib/librte_vhost/vhost_user/fd_man.h
index 74ecde2..bd66ed1 100644
--- a/lib/librte_vhost/vhost_user/fd_man.h
+++ b/lib/librte_vhost/vhost_user/fd_man.h
@@ -60,7 +60,7 @@ void fdset_init(struct fdset *pfdset);
 int fdset_add(struct fdset *pfdset, int fd,
fd_cb rcb, fd_cb wcb, void *dat);

-void fdset_del(struct fdset *pfdset, int fd);
+void *fdset_del(struct fdset *pfdset, int fd);

 void fdset_event_dispatch(struct fdset *pfdset);

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index f0ea3a2..8c6a096 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -60,6 +60,7 @@
 struct vhost_user_socket {
char *path;
int listenfd;
+   int connfd;
bool is_server;
bool reconnect;
 };
@@ -277,11 +278,13 @@ vhost_user_add_connection(int fd, struct 
vhost_user_socket *vsocket)

RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);

+   vsocket->connfd = fd;
conn->vsocket = vsocket;
conn->vid = vid;
ret = fdset_add(_user.fdset, fd, vhost_user_msg_handler,
NULL, conn);
if (ret < 0) {
+   vsocket->connfd = -1;
free(conn);
close(fd);
RTE_LOG(ERR, VHOST_CONFIG,
@@ -329,6 +332,7 @@ vhost_user_msg_handler(int connfd, void *dat, int *remove)
RTE_LOG(ERR, VHOST_CONFIG,
"vhost read incorrect message\n");

+   vsocket->connfd = -1;
close(connfd);
*remove = 1;
free(conn);
@@ -635,6 +639,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
goto out;
memset(vsocket, 0, sizeof(struct vhost_user_socket));
vsocket->path = strdup(path);
+   vsocket->connfd = -1;

if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
@@ -664,6 +669,29 @@ out:

[dpdk-dev] [PATCH] vhost: fix connect hang in client mode

2016-07-21 Thread Ilya Maximets

If something abnormal happened to QEMU, 'connect()' can block calling
thread (e.g. main thread of OVS) forever or for a really long time.
This can break whole application or block the reconnection thread.

Example with OVS:

ovs_rcu(urcu2)|WARN|blocked 512000 ms waiting for main to quiesce
(gdb) bt
#0  connect () from /lib64/libpthread.so.0
#1  vhost_user_create_client (vsocket=0xa816e0)
#2  rte_vhost_driver_register
#3  netdev_dpdk_vhost_user_construct
#4  netdev_open (name=0xa664b0 "vhost1")
[...]
#11 main

Fix that by setting non-blocking mode for client sockets for connection.

Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")

Signed-off-by: Ilya Maximets 
---

This was reproduced with current QEMU master branch
(commit 1ecfb24da987b862f) + patch-set "vhost-user reconnect fixes"
(https://lists.nongnu.org/archive/html/qemu-devel/2016-07/msg01547.html).

OVS was patched to support client mode:
http://openvswitch.org/pipermail/dev/2016-July/074972.html

Following script forces QEMU to fail to initialize vhost because
disconnection occures while device not fully configured:

while true
do
ovs-vsctl set Interface vhost1 ofport_request=125
ovs-vsctl set Interface vhost1 ofport_request=126
done

As a result: QEMU still works, network interface broken and OVS main
 thread stalled inside 'connect()'.

 lib/librte_vhost/vhost_user/vhost-net-user.c | 77 ++--
 1 file changed, 73 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index f0f92f8..08a5f6b 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 #include 
@@ -449,6 +450,14 @@ create_unix_socket(const char *path, struct sockaddr_un 
*un, bool is_server)
RTE_LOG(INFO, VHOST_CONFIG, "vhost-user %s: socket created, fd: %d\n",
is_server ? "server" : "client", fd);

+   if (!is_server && fcntl(fd, F_SETFL, O_NONBLOCK)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "vhost-user: can't set nonblocking mode for socket, fd: 
"
+   "%d (%s)\n", fd, strerror(errno));
+   close(fd);
+   return -1;
+   }
+
memset(un, 0, sizeof(*un));
un->sun_family = AF_UNIX;
strncpy(un->sun_path, path, sizeof(un->sun_path));
@@ -516,9 +525,58 @@ struct vhost_user_reconnect_list {
 static struct vhost_user_reconnect_list reconn_list;
 static pthread_t reconn_tid;

+static int
+vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
+{
+   int ret, flags, so_error;
+   socklen_t len = sizeof(so_error);
+   fd_set fdset;
+   /* 5 second timeout */
+   struct timeval tv = {.tv_sec = 5, .tv_usec = 0};
+
+   errno = EINVAL;
+
+   ret = connect(fd, un, sz);
+   if (ret == -1 && errno != EINPROGRESS)
+   return -1;
+   if (ret == 0)
+   goto connected;
+
+   FD_ZERO();
+   FD_SET(fd, );
+
+   ret = select(fd + 1, NULL, , NULL, );
+   if (!ret)
+   errno = ETIMEDOUT;
+   if (ret != 1)
+   return -1;
+
+   ret = getsockopt(fd, SOL_SOCKET, SO_ERROR, _error, );
+   if (ret < 0 || so_error) {
+   if (!ret)
+   errno = so_error;
+   return -1;
+   }
+
+connected:
+   flags = fcntl(fd, F_GETFL, 0);
+   if (flags < 0) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "can't get flags for connfd %d\n", fd);
+   return -2;
+   }
+   if ((flags & O_NONBLOCK) && fcntl(fd, F_SETFL, flags & ~O_NONBLOCK)) {
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "can't disable nonblocking on fd %d\n", fd);
+   return -2;
+   }
+   return 0;
+}
+
 static void *
 vhost_user_client_reconnect(void *arg __rte_unused)
 {
+   int ret;
struct vhost_user_reconnect *reconn, *next;

while (1) {
@@ -532,13 +590,23 @@ vhost_user_client_reconnect(void *arg __rte_unused)
 reconn != NULL; reconn = next) {
next = TAILQ_NEXT(reconn, next);

-   if (connect(reconn->fd, (struct sockaddr *)>un,
-   sizeof(reconn->un)) < 0)
+   ret = vhost_user_connect_nonblock(reconn->fd,
+   (struct sockaddr *)>un,
+   sizeof(reconn->un));
+   if (ret == -2) {
+   close(reconn->fd);
+   RTE_LOG(ERR, VHOST_CONFIG,
+   "reconnection for fd %d failed\n",
+

[dpdk-dev] [PATCH] doc: deprecate vhost-cuse

2016-07-21 Thread Loftus, Ciara

> Subject: [dpdk-dev] [PATCH] doc: deprecate vhost-cuse
> 
> Vhost-cuse was invented before vhost-user exist. The both are actually
> doing the same thing: a vhost-net implementation in user space. But they
> are not exactly the same thing.
> 
> Firstly, vhost-cuse is harder for use; no one seems to care it, either.
> Furthermore, since v2.1, a large majority of development effort has gone
> to vhost-user. For example, we extended the vhost-user spec to add the
> multiple queue support. We also added the vhost-user live migration at
> v16.04 and the latest one, vhost-user reconnect that allows vhost app
> restart without restarting the guest. Both of them are very important
> features for product usage and none of them works for vhost-cuse.
> 
> You now see that the difference between vhost-user and vhost-cuse is
> big (and will be bigger and bigger as time moves forward), that you
> should never use vhost-cuse, that we should drop it completely.
> 
> The remove would also result to a much cleaner code base, allowing us
> to do all kinds of extending easier.
> 
> So here to mark vhost-cuse as deprecated in this release and will be
> removed in the next release (v16.11).
> 
> Signed-off-by: Yuanhan Liu 
> ---
>  doc/guides/rel_notes/deprecation.rst | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index f502f86..ee99558 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -41,3 +41,7 @@ Deprecation Notices
>  * The mempool functions for single/multi producer/consumer are
> deprecated and
>will be removed in 16.11.
>It is replaced by rte_mempool_generic_get/put functions.
> +
> +* The vhost-cuse will be removed in 16.11. Since v2.1, a large majority of
> +  development effort has gone to vhost-user, such as multiple-queue, live
> +  migration, reconnect etc. Therefore, vhost-user should be used instead.
> --
> 1.9.0

Acked-by: Ciara Loftus

[dpdk-dev] [PATCH] app/test: disable filtering with stripped binary

2016-07-21 Thread Olivier Matz



On 07/19/2016 06:53 PM, Thomas Monjalon wrote:
> The unavailable tests are filtered out by autotest by looking for
> the symbols in the binary:
> 
> PCI autotest:  Skipped [Not Available]   [00m 00s]
> Malloc autotest:   Success   [00m 00s]
> 
> It results to skip everything if the binary has no symbol (stripped):
> 
> PCI autotest:  Skipped [Not Available]   [00m 00s]
> Malloc autotest:   Skipped [Not Available]   [00m 00s]
> 
> This case is handled by getting back to the old behaviour if the binary
> has no symbol information:
> 
> PCI autotest:  Fail [Not found]  [00m 00s]
> Malloc autotest:   Success   [00m 00s]
> 
> Fixes: d553c8f2b1a2 ("app/test: filter out unavailable tests")
> 
> Signed-off-by: Thomas Monjalon 

Tested-by: Olivier Matz

1 2 >

1 - 100 of 129 matches

Mail list logo