date:20160720

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-20 Thread Wiles, Keith

> On Jul 20, 2016, at 3:16 PM, Neil Horman  wrote:
> 
> On Wed, Jul 20, 2016 at 07:47:32PM +, Wiles, Keith wrote:
>> 
>>> On Jul 20, 2016, at 12:48 PM, Neil Horman  wrote:
>>> 
>>> On Wed, Jul 20, 2016 at 07:40:49PM +0200, Thomas Monjalon wrote:
 2016-07-20 13:09, Neil Horman:
> From: Neil Horman 
> 
> John Mcnamara and I were discussing enhacing the validate_abi script to 
> build
> the dpdk tree faster with multiple jobs.  Theres no reason not to do it, 
> so this
> implements that requirement.  It uses a MAKE_JOBS variable that can be 
> set by
> the user to limit the job count.  By default the job count is set to the 
> number
> of online cpus.

 Please could you use the variable name DPDK_MAKE_JOBS?
 This name is already used in scripts/test-build.sh.

>>> Sure
>>> 
> +if [ -z "$MAKE_JOBS" ]
> +then
> + # This counts the number of cpus on the system
> + MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
> +fi

 Is lscpu common enough?

>>> I'm not sure how to answer that.  lscpu is part of the util-linux package, 
>>> which
>>> is part of any base install.  Theres a variant for BSD, but I'm not sure how
>>> common it is there.
>>> Neil
>>> 
 Another acceptable default would be just "-j" without any number.
 It would make the number of jobs unlimited.
>> 
>> I think the best is just use -j as it tries to use the correct number of 
>> jobs based on the number of cores, right?
>> 
> -j with no argument (or -j 0), is sort of, maybe what you want.  With either 
> of
> those options, make will just issue jobs as fast as it processes dependencies.
> Dependent on how parallel the build is, that can lead to tons of waiting 
> process
> (i.e. more than your number of online cpus), which can actually hurt your 
> build
> time.

I read the manual and looked at the code, which supports your statement. (I 
think I had some statement on stack overflow and the last time I believe 
anything on the internet :-) I have not seen a lot of differences in compile 
times with -j on my system. Mostly I suspect it is the number of paths in the 
dependency, cores and memory on the system.

I have 72 lcores or 2 sockets, 18 cores per socket. Xeon 2.3Ghz cores.

$ export RTE_TARGET=x86_64-native-linuxapp-gcc 

$ time make install T=${RTE_TARGET}
real0m59.445s user  0m27.344s sys   0m7.040s

$ time make install T=${RTE_TARGET} -j
real0m26.584s user  0m14.380s sys   0m5.120s

# Remove the x86_64-native-linuxapp-gcc

$ time make install T=${RTE_TARGET} -j 72
real0m23.454s user  0m10.832s sys   0m4.664s

$ time make install T=${RTE_TARGET} -j 8
real0m23.812s user  0m10.672s sys   0m4.276s

cd x86_64-native-linuxapp-gcc
$ make clean
$ time make
real0m28.539s user  0m9.820s sys0m3.620s

# Do a make clean between each build.

$ time make -j
real0m7.217s user   0m6.532s sys0m2.332s

$ time make -j 8
real0m8.256s user   0m6.472s sys0m2.456s

$ time make -j 72
real0m6.866s user   0m6.184s sys0m2.216s

Just the real time numbers in the following table.

processes real Time   depdirs
 no -j 59.4sYes
   -j 8 23.8sYes
  -j 7223.5sYes
-j   26.5sYes

 no -j 28.5s No
   -j 8   8.2s No
  -j 72  6.8s No
-j 7.2s No

Looks like the depdirs build time on my system:
$ make clean -j
$ rm .depdirs
$ time make -j
real0m23.734s user  0m11.228s sys   0m4.844s

About 16 seconds, which is not a lot of savings. Now the difference from no -j 
to -j is a lot, but the difference between -j and -j  is not a huge 
saving. This leads me back to over engineering the problem when ?-j? would work 
just as well here.

Even on my MacBook Pro i7 system the difference is not that much 1m8s without 
depdirs build for -j in a VirtualBox with all 4 cores 8G RAM. Compared to 1m13s 
with -j 4 option.

I just wonder if it makes a lot of sense to use cpuinfo in this given case if 
it turns out to be -j works with the 80% rule?

On some other project with a lot more files like the FreeBSD or Linux distro, 
yes it would make a fair amount of real time difference.

Keith

> 
> While its fine in los of cases, its not always fine, and with this
> implementation you can still opt in to that behavior by setting 
> DPDK_MAKE_JOBS=0
> 
> Neil
> 
>>

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-20 Thread Thomas Monjalon

2016-07-20 19:47, Wiles, Keith:
> On Jul 20, 2016, at 12:48 PM, Neil Horman  wrote:
> > On Wed, Jul 20, 2016 at 07:40:49PM +0200, Thomas Monjalon wrote:
> >> 2016-07-20 13:09, Neil Horman:
> >>> From: Neil Horman 
> >>> +if [ -z "$MAKE_JOBS" ]
> >>> +then
> >>> + # This counts the number of cpus on the system
> >>> + MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
> >>> +fi
> >> 
> >> Is lscpu common enough?
> >> 
> > I'm not sure how to answer that.  lscpu is part of the util-linux package, 
> > which
> > is part of any base install.  Theres a variant for BSD, but I'm not sure how
> > common it is there.
> > Neil
> > 
> >> Another acceptable default would be just "-j" without any number.
> >> It would make the number of jobs unlimited.
> 
> I think the best is just use -j as it tries to use the correct number of jobs 
> based on the number of cores, right?

No Keith, -j alone use as much jobs as it can create, i.e. much more than
the number of CPUs.
I have no measure but I remember it is less efficient than giving a number
based on available CPUs (with a multiply factor to avoid idling between jobs).
For a default value, both approaches are fine.

[dpdk-dev] [PATCH] vhost: fix driver unregister for client mode

2016-07-20 Thread Yuanhan Liu

On Wed, Jul 20, 2016 at 11:32:43AM +0300, Ilya Maximets wrote:
> Currently while calling of 'rte_vhost_driver_unregister()' connection
> to QEMU will not be closed. This leads to inability to register driver
> again and reconnect to same virtual machine.
> 
> This scenario is reproducible with OVS. While executing of the following
> command vhost port will be re-created (will be executed
> 'rte_vhost_driver_register()' followed by 'rte_vhost_driver_unregister()')
> network will be broken and QEMU possibly will crash:
> 
>   ovs-vsctl set Interface vhost1 ofport_request=15
> 
> Fix this by closing all established connections on driver unregister and
> removing of pending connections from reconnection list.
> 
> Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
> Signed-off-by: Ilya Maximets 

Ilya, Thanks a lot for the test and fix. It looks good to me. But
somehow I would like have a test tomorrow.

--yliu

[dpdk-dev] [PATCH] timer: fix break list when timer_cb reset running timer

2016-07-20 Thread Sanford, Robert

Hi Hiroyuki,

I am reviewing your 3 timer patches.
Can you please explain in more detail your use-case that results in a
problem?

For example, is it when timer A's callback tries to reset
(rte_timer_reset) timer B?
If yes, is timer B in PENDING state and likely to expire soon?

--
Thanks,
Robert


On 7/17/16 2:08 PM, "Hiroyuki Mikita"  wrote:

>When timer_cb resets another running timer on the same lcore,
>the list of expired timers is chained to the pending-list.
>This commit prevents a running timer from being reset
>by not its own timer_cb.
>
>Signed-off-by: Hiroyuki Mikita 
>---
> lib/librte_timer/rte_timer.c | 12 ++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
>diff --git a/lib/librte_timer/rte_timer.c b/lib/librte_timer/rte_timer.c
>index 3dcdab5..00e80c9 100644
>--- a/lib/librte_timer/rte_timer.c
>+++ b/lib/librte_timer/rte_timer.c
>@@ -69,6 +69,9 @@ struct priv_timer {
> 
>   unsigned prev_lcore;  /**< used for lcore round robin */
> 
>+  /** running timer on this lcore now */
>+  struct rte_timer *running_tim;
>+
> #ifdef RTE_LIBRTE_TIMER_DEBUG
>   /** per-lcore statistics */
>   struct rte_timer_debug_stats stats;
>@@ -135,9 +138,12 @@ timer_set_config_state(struct rte_timer *tim,
>   while (success == 0) {
>   prev_status.u32 = tim->status.u32;
> 
>-  /* timer is running on another core, exit */
>+  /* timer is running on another core
>+   * or ready to run on local core, exit
>+   */
>   if (prev_status.state == RTE_TIMER_RUNNING &&
>-  prev_status.owner != (uint16_t)lcore_id)
>+  (prev_status.owner != (uint16_t)lcore_id ||
>+   tim != priv_timer[lcore_id].running_tim))
>   return -1;
> 
>   /* timer is being configured on another core */
>@@ -580,6 +586,7 @@ void rte_timer_manage(void)
>   for (tim = run_first_tim; tim != NULL; tim = next_tim) {
>   next_tim = tim->sl_next[0];
>   priv_timer[lcore_id].updated = 0;
>+  priv_timer[lcore_id].running_tim = tim;
> 
>   /* execute callback function with list unlocked */
>   tim->f(tim, tim->arg);
>@@ -610,6 +617,7 @@ void rte_timer_manage(void)
>   rte_spinlock_unlock(&priv_timer[lcore_id].list_lock);
>   }
>   }
>+  priv_timer[lcore_id].running_tim = NULL;
> }
> 
> /* dump statistics about timers */
>-- 
>2.7.4
>

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-20 Thread Wiles, Keith


> On Jul 20, 2016, at 12:48 PM, Neil Horman  wrote:
> 
> On Wed, Jul 20, 2016 at 07:40:49PM +0200, Thomas Monjalon wrote:
>> 2016-07-20 13:09, Neil Horman:
>>> From: Neil Horman 
>>> 
>>> John Mcnamara and I were discussing enhacing the validate_abi script to 
>>> build
>>> the dpdk tree faster with multiple jobs.  Theres no reason not to do it, so 
>>> this
>>> implements that requirement.  It uses a MAKE_JOBS variable that can be set 
>>> by
>>> the user to limit the job count.  By default the job count is set to the 
>>> number
>>> of online cpus.
>> 
>> Please could you use the variable name DPDK_MAKE_JOBS?
>> This name is already used in scripts/test-build.sh.
>> 
> Sure
> 
>>> +if [ -z "$MAKE_JOBS" ]
>>> +then
>>> +   # This counts the number of cpus on the system
>>> +   MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
>>> +fi
>> 
>> Is lscpu common enough?
>> 
> I'm not sure how to answer that.  lscpu is part of the util-linux package, 
> which
> is part of any base install.  Theres a variant for BSD, but I'm not sure how
> common it is there.
> Neil
> 
>> Another acceptable default would be just "-j" without any number.
>> It would make the number of jobs unlimited.

I think the best is just use -j as it tries to use the correct number of jobs 
based on the number of cores, right?

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-20 Thread Thomas Monjalon

2016-07-20 13:09, Neil Horman:
> From: Neil Horman 
> 
> John Mcnamara and I were discussing enhacing the validate_abi script to build
> the dpdk tree faster with multiple jobs.  Theres no reason not to do it, so 
> this
> implements that requirement.  It uses a MAKE_JOBS variable that can be set by
> the user to limit the job count.  By default the job count is set to the 
> number
> of online cpus.

Please could you use the variable name DPDK_MAKE_JOBS?
This name is already used in scripts/test-build.sh.

> +if [ -z "$MAKE_JOBS" ]
> +then
> + # This counts the number of cpus on the system
> + MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
> +fi

Is lscpu common enough?

Another acceptable default would be just "-j" without any number.
It would make the number of jobs unlimited.

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-20 Thread Adrien Mazarguil

Hi Sugesh,

Please see below.

On Wed, Jul 20, 2016 at 04:32:50PM +, Chandran, Sugesh wrote:
[...]
> > > How about a hardware flow flag in packet descriptor that set when the
> > > packets hits any hardware rule. This way software doesn?t worry
> > > /blocked by a hardware rule . Even though there is an additional
> > > overhead of validating this flag, software datapath can identify the
> > hardware processed packets easily.
> > > This way the packets traverses the software fallback path until the
> > > rule configuration is complete. This flag avoids setting ID action for 
> > > every
> > hardware flow that are configuring.
> > 
> > That makes sense. I see it as a sort of single bit ID but it could be
> > implemented through a different action for less capable devices. PMDs that
> > support 32 bit IDs could reuse the same code for both features.
> > 
> > I understand you'd prefer having this feature always present, however we
> > already know that not all PMDs/devices support it, and like everything else
> > this is a kind of offload that needs to be explicitly requested by the
> > application as it may not be needed.
> > 
> > If we go with the separate action, then perhaps it would make sense to
> > rename "ID" to "MARK" to make things clearer:
> > 
> >  RTE_FLOW_ACTION_TYPE_FLAG /* Flag packets processed by flow rule. */
> > 
> >  RTE_FLOW_ACTION_TYPE_MARK /* Attach a 32 bit value to a packet. */
> > 
> > I guess the result of the FLAG action would be something in ol_flag.
> > 
> [Sugesh] This looks fine for me.

Great, I will update the specification accordingly.

> > Thoughts?
> > 
> [Sugesh] Two more queries that I missed out in the earlier comments are,
> Support for PTYPE :- Intel NICs can report packet type in mbuf.
> This can be used by software for the packet processing. Is generic API
> capable of handling that as well? 

Yes, however no PTYPE action has been defined for this (yet). It is only a
matter of adding one.

Currently packet type recognition is enabled per port using a separate API,
so correct me if I'm wrong but I am not aware of any adapter with the
ability to enable it per flow rule, so I do not think such an action needs
to be defined from the start. We may add it later.

> RSS hashing support :- Just to confirm, the RSS flow action allows application
> to decide the header fields to produce the hash. This gives
> programmability on load sharing across different queues. The
> application can program the NIC to calculate the RSS hash only using mac or 
> mac+ ip or 
> ip only using this.

I'd say yes but from your summary, I'm not sure we share the same idea of
what the RSS action is supposed to do, so here is mine.

Like all flow rules, the pattern part of the RSS action only filters the
packets on which the action will be performed.

The rss_conf parameter (struct rte_eth_rss_conf) only provides a key and a
RSS hash function to use (ETH_RSS_IPV4, ETH_RSS_NONFRAG_IPV6_UDP, etc).

Nothing prevents the RSS hash function from being applied to protocol
headers which are not necessarily present in the flow rule pattern. These
are two independent things, e.g. you could have a pattern matching IPv4
packets yet perform RSS hashing only on UDP headers.

Finally, the RSS action configuration only affects packets coming from this
flow rule. It is not performed on the device globally so packets which are
not matched are not affected by RSS processing. As a result it might not be
possible to configure two flow rules specifying incompatible RSS actions
simultaneously if the underlying device supports only a single global RSS
context.

Are we on the same page?

-- 
Adrien Mazarguil
6WIND

[dpdk-dev] DPDK on Xen maintenance

2016-07-20 Thread Thomas Monjalon

2016-07-13 10:30, Jan Blunck:
> On Di, 2016-07-12 at 11:34 +0200, Thomas Monjalon wrote:
> > Hi all,
> > 
> > We are facing some issues with Xen dom0.
> > Some were fixed in RC2:
> > http://dpdk.org/ml/archives/dev/2016-July/043760.html
> > and there still have some other issues.
> > 
> > It seems Xen is becoming less attractive:
> > - we do not have a lot of test reports or feedbacks
> > - there is no maintainer for DPDK on Xen
> > - we are still waiting for the Xen netfront PMD
> 
> Although progress is slow I'm still working on upstreaming the Xen
> netfront PMD. I will continue to maintain it afterwards of course.

OK thanks for the update.
So you are working on using DPDK in Domu.

It seems that nobody is interested in using DPDK in Dom0.
There is still a bug in 16.07:
http://dpdk.org/ml/archives/dev/2016-July/044207.html

> > I wonder wether it still makes sense to maintain this code or not?
> > In case of positive reply, it would be nice to have the name of
> > someone responsible for this code, i.e. a maintainer.

Given that we have no feedback on Xen Dom0, no maintainer, some bugs,
and it makes memory allocation improvements harder,
it sounds reasonnable to drop this code.

[dpdk-dev] [PATCH] doc: announce ivshmem support removal

2016-07-20 Thread Thomas Monjalon

There was a prior call with an explanation of what needs to be done:
http://dpdk.org/ml/archives/dev/2016-June/040844.html
- Qemu patch upstreamed
- IVSHMEM PCI device managed by a PCI driver
- No DPDK objects (ring/mempool) allocated by EAL

As nobody seems interested, it is time to remove this code which
makes EAL improvements harder.

Signed-off-by: Thomas Monjalon 
---
 doc/guides/rel_notes/deprecation.rst | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/doc/guides/rel_notes/deprecation.rst 
b/doc/guides/rel_notes/deprecation.rst
index 9cadf6a..1ef8460 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -42,6 +42,9 @@ Deprecation Notices
   will be removed in 16.11.
   It is replaced by rte_mempool_generic_get/put functions.

+* The ``rte_ivshmem`` feature (including library and EAL code) will be removed
+  in 16.11 because it has some design issues which are not planned to be fixed.
+
 * The ethtool support will be removed from KNI in 16.11.
   It is implemented only for igb and ixgbe.
   It is really hard to maintain because it requires some out-of-tree kernel
-- 
2.7.0

[dpdk-dev] [PATCH] mempool: adjust name string size in related data types

2016-07-20 Thread Zoltan Kiss



On 20/07/16 14:37, Olivier Matz wrote:
> Hi,
>
> On 07/20/2016 02:41 PM, Zoltan Kiss wrote:
>>
>>
>> On 19/07/16 17:17, Olivier Matz wrote:
>>> Hi Zoltan,
>>>
>>> On 07/19/2016 05:59 PM, Zoltan Kiss wrote:


 On 19/07/16 16:37, Olivier Matz wrote:
> Hi Zoltan,
>
> On 07/19/2016 04:37 PM, Zoltan Kiss wrote:
>> A recent fix brought up an issue about the size of the 'name' fields:
>>
>> 85cf0079 mem: avoid memzone/mempool/ring name truncation
>>
>> These relations should be observed:
>>
>> RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
>> RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE -
>> strlen(RTE_MEMPOOL_MZ_PREFIX)
>>
>> Setting all of them to 32 hides this restriction from the application.
>> This patch increases the memzone string size to accomodate for these
>> prefixes, and the same happens with the ring name string. The ABI
>> needs to
>> be broken to fix this API issue, this way doesn't break applications
>> previously not failing due to the truncating bug now fixed.
>>
>> Signed-off-by: Zoltan Kiss 
>
> I agree it is a problem for an application because it cannot know what
> is the maximum name length. On the other hand, breaking the ABI for
> this
> looks a bit overkill. Maybe we could reduce RTE_MEMPOOL_NAMESIZE and
> RTE_RING_NAMESIZE instead of increasing RTE_MEMZONE_NAMESIZE? That way,
> we could keep the ABI as is.

 But that would break the ABI too, wouldn't it? Unless you keep the array
 the same size (32 bytes) by using RTE_MEMZONE_NAMESIZE.
>>>
>>> Yes, that was the idea.
>>>
 And even then, the API breaks anyway. There are applications - I have at
 least some - which use all 32 bytes to store the name. Decrease that
 would cause headache to change the naming scheme, because it's a 30
 character long id, and chopping the last few chars would cause name
 collisions and annoying bugs.
>>>
>>> Before my patch (85cf0079), long names were silently truncated when
>>> mempool created its ring and/or memzones. Now, it returns an error.
>>
>> With 16.04 an application could operate as expected if the first 26
>> character were unique. Your patch revealed the problem that characters
>> after these were left out of the name. Now applications fail where this
>> never been a bug because their naming scheme guarantees the uniqueness
>> on the first 26 chars (or makes it very unlikely)
>> Where the first 26 is not unique, it failed earlier too, because at
>> memzone creation it checks for duplicate names.
>
> Yes, I understand that there is a behavior change for applications using
> names larger than 26 between 16.04 and 16.07. I also understand that
> there is no way for an application to know what is the maximum usable
> size for a mempool or a ring.
>
>
>>> I'm not getting why changing the struct to something like below would
>>> break the API, since it would already return an error today.
>>>
>>>#define RTE_MEMPOOL_NAMESIZE \
>>
>> Wait, this would mean applications need to recompile to use the smaller
>> value. AFAIK that's an ABI break too, right? At the moment I don't see a
>> way to fix this without breaking the ABI
>
> With this modification, if you don't recompile the application, its
> behavior will still be the same as today -> it will return ENAMETOOLONG.
> If you recompile it, the application will be aware of the maximum
> length. To me, it seems to be a acceptable compromise for this release.
>
> The patch you're proposing also changes the ABI of librte_ring and
> librte_eal, which cannot be accepted for the release.

Ok, I've sent a new version with this approach.

>
>
>>
>>>(RTE_MEMZONE_NAMESIZE - sizeof(pool_prefix) - sizeof(ring prefix))
>>>struct rte_mempool {
>>>union {
>>>  char name[RTE_MEMPOOL_NAMESIZE];
>>>  char pad[32];
>>>};
>>>...
>>>}
>>>
>>> Anyway, it may not be the proper solution since it supposes that a
>>> mempool includes a ring based on a memzone, which is not always true now
>>> with mempool handlers.
>>
>> Oh, as we dug deeper it gets better!
>> Indeed, we don't necessarily have this ring + memzone pair underneath,
>> but the user is not aware of that, and I think we should keep it that
>> way. It should only care that the string passed shouldn't be bigger than
>> a certain amount.
>
> Yes. What I'm just saying here is that it's not a good solution to write
> in the #define that "a mempool is based on a ring + a memzone", because
> if some someone adds a new mempool handler replacing the ring, and also
> creating a memzone prefixed by something larger than "rg_", we will have
> to break the ABI again.

If someone adds a new handler, (s)he needs to keep in mind what's the 
max size for pool name, and any derived object using that name as a base 
should check if it fits.

>
>
>> Also, even though we don't necessarily have the

[dpdk-dev] [PATCH] memzone: allow full length name

2016-07-20 Thread Zoltan Kiss

(strlen(name) == sizeof(mz->name) - 1) is a valid case, change the
condition to reflect that.
Move it earlier to avoid lookup with invalid name.
Change errno to ENAMETOOLONG.

Fixes: 85cf0079 ("mem: avoid memzone/mempool/ring name truncation")

Signed-off-by: Zoltan Kiss 
---
 lib/librte_eal/common/eal_common_memzone.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memzone.c 
b/lib/librte_eal/common/eal_common_memzone.c
index 5d28341..1bd0a33 100644
--- a/lib/librte_eal/common/eal_common_memzone.c
+++ b/lib/librte_eal/common/eal_common_memzone.c
@@ -144,6 +144,13 @@ memzone_reserve_aligned_thread_unsafe(const char *name, 
size_t len,
return NULL;
}

+   if (strlen(name) > sizeof(mz->name) - 1) {
+   RTE_LOG(DEBUG, EAL, "%s(): memzone <%s>: name too long\n",
+   __func__, name);
+   rte_errno = ENAMETOOLONG;
+   return NULL;
+   }
+
/* zone already exist */
if ((memzone_lookup_thread_unsafe(name)) != NULL) {
RTE_LOG(DEBUG, EAL, "%s(): memzone <%s> already exists\n",
@@ -152,13 +159,6 @@ memzone_reserve_aligned_thread_unsafe(const char *name, 
size_t len,
return NULL;
}

-   if (strlen(name) >= sizeof(mz->name) - 1) {
-   RTE_LOG(DEBUG, EAL, "%s(): memzone <%s>: name too long\n",
-   __func__, name);
-   rte_errno = EEXIST;
-   return NULL;
-   }
-
/* if alignment is not a power of two */
if (align && !rte_is_power_of_2(align)) {
RTE_LOG(ERR, EAL, "%s(): Invalid alignment: %u\n", __func__,
-- 
1.9.1

[dpdk-dev] [PATCH v2] mempool: adjust name string size in related data types

2016-07-20 Thread Zoltan Kiss

A recent patch brought up an issue about the size of the 'name' fields:

85cf0079 mem: avoid memzone/mempool/ring name truncation

These relations should be observed:

1. Each ring creates a memzone with a prefixed name:
RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)

2. There are some mempool handlers which create a ring with a prefixed
name:
RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE - strlen(RTE_MEMPOOL_MZ_PREFIX)

3. A mempool can create up to RTE_MAX_MEMZONE pre and postfixed memzones:
sprintf(postfix, "_%d", RTE_MAX_MEMZONE)
RTE_MEMPOOL_NAMESIZE <= RTE_MEMZONE_NAMESIZE -
strlen(RTE_MEMPOOL_MZ_PREFIX) - strlen(postfix)

Setting all of them to 32 hides this restriction from the application.
This patch decreases the mempool and ring string size to accommodate for
these prefixes, but it doesn't apply the 3rd constraint. Applications
relying on these constants need to be recompiled, otherwise they'll run
into ENAMETOOLONG issues.
The size of the arrays are kept 32 for ABI compatibility, it can be
decreased next time the ABI changes.

Signed-off-by: Zoltan Kiss 
---

Notes:
v2: keep arrays 32 bytes and decrease the max sizes to maintain ABI
compatibility

 lib/librte_mempool/rte_mempool.h | 11 +--
 lib/librte_ring/rte_ring.h   | 12 ++--
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 4a8fbb1..059ad9e 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -123,7 +123,9 @@ struct rte_mempool_objsz {
/**< Total size of an object (header + elt + trailer). */
 };

-#define RTE_MEMPOOL_NAMESIZE 32 /**< Maximum length of a memory pool. */
+/**< Maximum length of a memory pool's name. */
+#define RTE_MEMPOOL_NAMESIZE (RTE_RING_NAMESIZE - \
+ sizeof(RTE_MEMPOOL_MZ_PREFIX) + 1)
 #define RTE_MEMPOOL_MZ_PREFIX "MP_"

 /* "MP_" */
@@ -208,7 +210,12 @@ struct rte_mempool_memhdr {
  * The RTE mempool structure.
  */
 struct rte_mempool {
-   char name[RTE_MEMPOOL_NAMESIZE]; /**< Name of mempool. */
+   /*
+* Note: this field kept the RTE_MEMZONE_NAMESIZE size due to ABI
+* compatibility requirements, it could be changed to
+* RTE_MEMPOOL_NAMESIZE next time the ABI changes
+*/
+   char name[RTE_MEMZONE_NAMESIZE]; /**< Name of mempool. */
union {
void *pool_data; /**< Ring or pool to store objects. */
uint64_t pool_id;/**< External mempool identifier. */
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index eb45e41..0e22e69 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -100,6 +100,7 @@ extern "C" {
 #include 
 #include 
 #include 
+#include 

 #define RTE_TAILQ_RING_NAME "RTE_RING"

@@ -126,8 +127,10 @@ struct rte_ring_debug_stats {
 } __rte_cache_aligned;
 #endif

-#define RTE_RING_NAMESIZE 32 /**< The maximum length of a ring name. */
 #define RTE_RING_MZ_PREFIX "RG_"
+/**< The maximum length of a ring name. */
+#define RTE_RING_NAMESIZE (RTE_MEMZONE_NAMESIZE - \
+  sizeof(RTE_RING_MZ_PREFIX) + 1)

 #ifndef RTE_RING_PAUSE_REP_COUNT
 #define RTE_RING_PAUSE_REP_COUNT 0 /**< Yield after pause num of times, no 
yield
@@ -147,7 +150,12 @@ struct rte_memzone; /* forward declaration, so as not to 
require memzone.h */
  * a problem.
  */
 struct rte_ring {
-   char name[RTE_RING_NAMESIZE];/**< Name of the ring. */
+   /*
+* Note: this field kept the RTE_MEMZONE_NAMESIZE size due to ABI
+* compatibility requirements, it could be changed to RTE_RING_NAMESIZE
+* next time the ABI changes
+*/
+   char name[RTE_MEMZONE_NAMESIZE];/**< Name of the ring. */
int flags;   /**< Flags supplied at creation. */
const struct rte_memzone *memzone;
/**< Memzone, if any, containing the rte_ring */
-- 
1.9.1

[dpdk-dev] [PATCH v2] mempool: fix lack of free registration

2016-07-20 Thread Zoltan Kiss

The new mempool handler interface forgets to register the free() function
of the ops. Introduced in this patch:

Fixes: 449c49b93a6b ("mempool: support handler operations")

Signed-off-by: Zoltan Kiss 
Acked-by: Olivier Matz 
---

Notes:
v2: fix commit message

 lib/librte_mempool/rte_mempool_ops.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_mempool/rte_mempool_ops.c 
b/lib/librte_mempool/rte_mempool_ops.c
index fd0b64c..5f24de2 100644
--- a/lib/librte_mempool/rte_mempool_ops.c
+++ b/lib/librte_mempool/rte_mempool_ops.c
@@ -81,6 +81,7 @@ rte_mempool_register_ops(const struct rte_mempool_ops *h)
ops = &rte_mempool_ops_table.ops[ops_index];
snprintf(ops->name, sizeof(ops->name), "%s", h->name);
ops->alloc = h->alloc;
+   ops->free = h->free;
ops->enqueue = h->enqueue;
ops->dequeue = h->dequeue;
ops->get_count = h->get_count;
-- 
1.9.1

[dpdk-dev] [PATCH v2] net/i40e: remove weak symbols

2016-07-20 Thread Zoltan Kiss

Using weak symbols have a few issues with static linking:

- normally the linker searches the .o files already linked, if your weak
  one is there, it won't check if there is a strong version
- unless the symbol is directly referred, but it's not the right thing to
  rely on
- or --whole-archive specified in the command line, which pulls in a lot
  of unwanted stuff
- whole-archive also makes libtool dropping the library from the command
  line, which is what should happen with dynamic linking, but not with
  static (observed on Ubuntu 14.04). This is an important bug if you
  build a static library depending on DPDK

This patch merges the two version and make the behaviour rely on the
config.

If we can agree in the concept, I can send a series to fix this for the
other weak functions.

Signed-off-by: Zoltan Kiss 
---

Notes:
v2: fix commit message

 drivers/net/i40e/i40e_rxtx.c | 36 +++-
 drivers/net/i40e/i40e_rxtx_vec.c | 36 
 2 files changed, 35 insertions(+), 37 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index d3cfb98..ad34d3a 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3278,10 +3278,44 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
 }

 /* Stubs needed for linkage when CONFIG_RTE_I40E_INC_VECTOR is set to 'n' */
-int __attribute__((weak))
+int __attribute__((cold))
 i40e_rx_vec_dev_conf_condition_check(struct rte_eth_dev __rte_unused *dev)
 {
+#ifndef RTE_LIBRTE_I40E_INC_VECTOR
return -1;
+#else
+#ifndef RTE_LIBRTE_IEEE1588
+   struct rte_eth_rxmode *rxmode = &dev->data->dev_conf.rxmode;
+   struct rte_fdir_conf *fconf = &dev->data->dev_conf.fdir_conf;
+
+   /* need SSE4.1 support */
+   if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+   return -1;
+
+#ifndef RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE
+   /* whithout rx ol_flags, no VP flag report */
+   if (rxmode->hw_vlan_strip != 0 ||
+   rxmode->hw_vlan_extend != 0)
+   return -1;
+#endif /* RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE */
+
+   /* no fdir support */
+   if (fconf->mode != RTE_FDIR_MODE_NONE)
+   return -1;
+
+/* - no csum error report support
+* - no header split support
+*/
+   if (rxmode->hw_ip_checksum == 1 ||
+   rxmode->header_split == 1)
+   return -1;
+
+   return 0;
+#else
+   RTE_SET_USED(dev);
+   return -1;
+#endif /* RTE_LIBRTE_IEEE1588 */
+#endif /* RTE_LIBRTE_I40E_INC_VECTOR */
 }

 uint16_t __attribute__((weak))
diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c
index 05cb415..983b2c0 100644
--- a/drivers/net/i40e/i40e_rxtx_vec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec.c
@@ -723,39 +723,3 @@ i40e_txq_vec_setup(struct i40e_tx_queue __rte_unused *txq)
 {
return 0;
 }
-
-int __attribute__((cold))
-i40e_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev)
-{
-#ifndef RTE_LIBRTE_IEEE1588
-   struct rte_eth_rxmode *rxmode = &dev->data->dev_conf.rxmode;
-   struct rte_fdir_conf *fconf = &dev->data->dev_conf.fdir_conf;
-
-   /* need SSE4.1 support */
-   if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
-   return -1;
-
-#ifndef RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE
-   /* whithout rx ol_flags, no VP flag report */
-   if (rxmode->hw_vlan_strip != 0 ||
-   rxmode->hw_vlan_extend != 0)
-   return -1;
-#endif
-
-   /* no fdir support */
-   if (fconf->mode != RTE_FDIR_MODE_NONE)
-   return -1;
-
-/* - no csum error report support
-* - no header split support
-*/
-   if (rxmode->hw_ip_checksum == 1 ||
-   rxmode->header_split == 1)
-   return -1;
-
-   return 0;
-#else
-   RTE_SET_USED(dev);
-   return -1;
-#endif
-}
-- 
1.9.1

[dpdk-dev] [PATCH] doc: announce KNI ethtool removal

2016-07-20 Thread Thomas Monjalon

The out-of-tree kernel code must be avoided.
Moreover there is no good reason to keep this legacy feature
which is only partially supported.

As described earlier in this plan:
http://dpdk.org/ml/archives/dev/2016-July/043606.html
it will help to keep PCI ids in PMD code.

Signed-off-by: Thomas Monjalon 
---
 doc/guides/rel_notes/deprecation.rst | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/doc/guides/rel_notes/deprecation.rst 
b/doc/guides/rel_notes/deprecation.rst
index f502f86..9cadf6a 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -41,3 +41,10 @@ Deprecation Notices
 * The mempool functions for single/multi producer/consumer are deprecated and
   will be removed in 16.11.
   It is replaced by rte_mempool_generic_get/put functions.
+
+* The ethtool support will be removed from KNI in 16.11.
+  It is implemented only for igb and ixgbe.
+  It is really hard to maintain because it requires some out-of-tree kernel
+  code to be duplicated in this kernel module.
+  Removing this partial support will help to restrict the PCI id definitions
+  to the PMD code.
-- 
2.7.0

[dpdk-dev] [PATCH] ethdev: ensure consistent port id assignment

2016-07-20 Thread Tootoonchian, Amin

> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, July 20, 2016 8:12 AM
> To: Tootoonchian, Amin 
> Cc: dev at dpdk.org; Kerlin, MarcinX 
> Subject: Re: [dpdk-dev] [PATCH] ethdev: ensure consistent port id assignment
> 
> Hi,
> 
> 2016-07-20 15:07, Tootoonchian, Amin:
> > Thomas, your thoughts?
> 
> I have 2 thoughts:
> - it is too big for 16.07
> - it is related to multi-process mechanism, maintained by Sergio ;)
> 
> Sorry I won't look at it shortly.

[Adding Sergio to the thread for review.]

The actual change of this patch is quite small (see below) but I understand 
that it may be too late for 16.07:

* The main addition is getting the port id using rte_eth_dev_get_port_by_name 
for secondary processes in rte_eth_dev_allocate. Also zeroing out 
rte_eth_dev_data after freeing a port to avoid false positives in 
rte_eth_dev_get_port_by_name.

* The rest of the patch moves parts of rte_eth_dev_allocate to be executed only 
by the primary.

[dpdk-dev] [PATCH] doc: announce ABI change for rte_eth_dev structure

2016-07-20 Thread Thomas Monjalon

2016-07-20 15:13, Ananyev, Konstantin:
> Hi Thomas,
> 
> > Hi,
> > 
> > This patch announces an interesting change in the DPDK design.
> > 
> > 2016-07-20 16:24, Tomasz Kulasek:
> > > This is an ABI deprecation notice for DPDK 16.11 in librte_ether about
> > > changes in rte_eth_dev and rte_eth_desc_lim structures.
> > >
> > > In 16.11, we plan to introduce rte_eth_tx_prep() function to do
> > > necessary preparations of packet burst to be safely transmitted on
> > > device for desired HW offloads (set/reset checksum field according to
> > > the hardware requirements) and check HW constraints (number of
> > > segments per packet, etc).
> > >
> > > While the limitations and requirements may differ for devices, it
> > > requires to extend rte_eth_dev structure with new function pointer
> > > "tx_pkt_prep" which can be implemented in the driver to prepare and
> > > verify packets, in devices specific way, before burst, what should to
> > > prevent application to send malformed packets.
> > >
> > > Also new fields will be introduced in rte_eth_desc_lim: nb_seg_max and
> > > nb_mtu_seg_max, providing an information about max segments in TSO and
> > > non TSO packets acceptable by device.
> > 
> > We cannot acknowledge such notice without a prior design discussion.
> > Please explain why you plan to work on this change and give a draft of the 
> > new structures (a RFC patch would be ideal).
> 
> I think it is not really a deprecation note, but announce ABI change for 
> rte_ethdev.h structures.

An ABI break requires a deprecation notice. So it is :)

> The plan is to implement what was proposed & discussed the following thread:
> http://dpdk.org/ml/archives/dev/2015-September/023603.html

Please could you summarize it here?

[dpdk-dev] NIC speed capabilities

2016-07-20 Thread Thomas Monjalon

Hi drivers maintainers,

The first line of the table in doc/guides/nics/overview.rst is
outstandingly empty!
Since the commit http://dpdk.org/commit/e274f57, it is possible
to announce the speed capabilities of each port.
The first implementation was only trying to fill some realistic values
at the driver level.

Please complete this work to have the exact values depending of the
device model.

Thank you

[dpdk-dev] [PATCH] ethdev: ensure consistent port id assignment

2016-07-20 Thread Thomas Monjalon

Hi,

2016-07-20 15:07, Tootoonchian, Amin:
> Thomas, your thoughts?

I have 2 thoughts:
- it is too big for 16.07
- it is related to multi-process mechanism, maintained by Sergio ;)

Sorry I won't look at it shortly.

[dpdk-dev] [PATCH] doc: announce ABI change for rte_eth_dev structure

2016-07-20 Thread Thomas Monjalon

Hi,

This patch announces an interesting change in the DPDK design.

2016-07-20 16:24, Tomasz Kulasek:
> This is an ABI deprecation notice for DPDK 16.11 in librte_ether about
> changes in rte_eth_dev and rte_eth_desc_lim structures.
> 
> In 16.11, we plan to introduce rte_eth_tx_prep() function to do
> necessary preparations of packet burst to be safely transmitted on
> device for desired HW offloads (set/reset checksum field according to
> the hardware requirements) and check HW constraints (number of segments
> per packet, etc).
> 
> While the limitations and requirements may differ for devices, it
> requires to extend rte_eth_dev structure with new function pointer
> "tx_pkt_prep" which can be implemented in the driver to prepare and
> verify packets, in devices specific way, before burst, what should to
> prevent application to send malformed packets.
> 
> Also new fields will be introduced in rte_eth_desc_lim: nb_seg_max and
> nb_mtu_seg_max, providing an information about max segments in TSO and
> non TSO packets acceptable by device.

We cannot acknowledge such notice without a prior design discussion.
Please explain why you plan to work on this change and give a draft of
the new structures (a RFC patch would be ideal).

Thanks

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-20 Thread Chandran, Sugesh

Hi Adrien,

Sorry for the late reply.
Snipped out the parts we agreed.

Regards
_Sugesh


> -Original Message-
> From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> Sent: Monday, July 18, 2016 4:00 PM
> To: Chandran, Sugesh 
> Cc: dev at dpdk.org; Thomas Monjalon ;
> Zhang, Helin ; Wu, Jingjing
> ; Rasesh Mody ; Ajit
> Khaparde ; Rahul Lakkireddy
> ; Lu, Wenzhuo ;
> Jan Medala ; John Daley ; Chen,
> Jing D ; Ananyev, Konstantin
> ; Matej Vido ;
> Alejandro Lucero ; Sony Chacko
> ; Jerin Jacob
> ; De Lara Guarch, Pablo
> ; Olga Shern ;
> Chilikin, Andrey 
> Subject: Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification
> API
> 
> On Mon, Jul 18, 2016 at 01:26:09PM +, Chandran, Sugesh wrote:
> > Hi Adrien,
> > Thank you for getting back on this.
> > Please find my comments below.
> 
> Hi Sugesh,
> 
> Same for me, removed again the parts we agree on.
> 
> [...]
> > > For your above example, the application cannot assume a rule is
> > > added/deleted as long as the PMD has not completed the related
> > > operation, which means keeping the SW rule/fallback in place in the
> > > meantime. Should handle security concerns as long as after removing
> > > a rule, packets end up in a default queue entirely processed by SW.
> > > Obviously this may worsen response time.
> > >
> > > The ID action can help with this. By knowing which rule a received
> > > packet is associated with, processing can be temporarily offloaded
> > > by another thread without much complexity.
> > [Sugesh] Setting ID for every flow may not viable especially when the
> > size of ID is small(just only 8 bits). I am not sure is this a valid case 
> > though.
> 
> Agreed, I'm not saying this solution works for all devices, particularly those
> that do not support ID at all.
> 
> > How about a hardware flow flag in packet descriptor that set when the
> > packets hits any hardware rule. This way software doesn?t worry
> > /blocked by a hardware rule . Even though there is an additional
> > overhead of validating this flag, software datapath can identify the
> hardware processed packets easily.
> > This way the packets traverses the software fallback path until the
> > rule configuration is complete. This flag avoids setting ID action for every
> hardware flow that are configuring.
> 
> That makes sense. I see it as a sort of single bit ID but it could be
> implemented through a different action for less capable devices. PMDs that
> support 32 bit IDs could reuse the same code for both features.
> 
> I understand you'd prefer having this feature always present, however we
> already know that not all PMDs/devices support it, and like everything else
> this is a kind of offload that needs to be explicitly requested by the
> application as it may not be needed.
> 
> If we go with the separate action, then perhaps it would make sense to
> rename "ID" to "MARK" to make things clearer:
> 
>  RTE_FLOW_ACTION_TYPE_FLAG /* Flag packets processed by flow rule. */
> 
>  RTE_FLOW_ACTION_TYPE_MARK /* Attach a 32 bit value to a packet. */
> 
> I guess the result of the FLAG action would be something in ol_flag.
> 
[Sugesh] This looks fine for me.
> Thoughts?
> 
[Sugesh] Two more queries that I missed out in the earlier comments are,
Support for PTYPE :- Intel NICs can report packet type in mbuf.
This can be used by software for the packet processing. Is generic API
capable of handling that as well? 
RSS hashing support :- Just to confirm, the RSS flow action allows application
to decide the header fields to produce the hash. This gives
programmability on load sharing across different queues. The
application can program the NIC to calculate the RSS hash only using mac or 
mac+ ip or 
ip only using this.


> > > I think applications have to implement SW fallbacks all the time, as
> > > even some sort of guarantee on the flow rule processing time may not
> > > be enough to avoid misdirected packets and related security issues.
> > [Sugesh] Software fallback will be there always. However I am little
> > bit confused on the way software going to identify the packets that
> > are already hardware processed . I feel we need some notification in the
> packet itself, when a hardware rule hits. ID/flag/any other options?
> 
> Yeah I think so too, as long as it is optional because we cannot assume all
> PMDs will support it.
> 
> > > Let's wait for applications to start using this API and then
> > > consider an extra set of asynchronous / real-time functions when the
> > > need arises. It should not impact the way rules are specified
> > [Sugesh] Sure. I think the rule definition may not impact with this.
> 
> Thanks for your comments.
> 
> --
> Adrien Mazarguil
> 6WIND

[dpdk-dev] [PATCH] doc: announce ABI change for rte_eth_dev structure

2016-07-20 Thread Tomasz Kulasek

This is an ABI deprecation notice for DPDK 16.11 in librte_ether about
changes in rte_eth_dev and rte_eth_desc_lim structures.

In 16.11, we plan to introduce rte_eth_tx_prep() function to do
necessary preparations of packet burst to be safely transmitted on
device for desired HW offloads (set/reset checksum field according to
the hardware requirements) and check HW constraints (number of segments
per packet, etc).

While the limitations and requirements may differ for devices, it
requires to extend rte_eth_dev structure with new function pointer
"tx_pkt_prep" which can be implemented in the driver to prepare and
verify packets, in devices specific way, before burst, what should to
prevent application to send malformed packets.

Also new fields will be introduced in rte_eth_desc_lim: nb_seg_max and
nb_mtu_seg_max, providing an information about max segments in TSO and
non TSO packets acceptable by device.

Signed-off-by: Tomasz Kulasek 
---
 doc/guides/rel_notes/deprecation.rst |7 +++
 1 file changed, 7 insertions(+)

diff --git a/doc/guides/rel_notes/deprecation.rst 
b/doc/guides/rel_notes/deprecation.rst
index f502f86..485aacb 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -41,3 +41,10 @@ Deprecation Notices
 * The mempool functions for single/multi producer/consumer are deprecated and
   will be removed in 16.11.
   It is replaced by rte_mempool_generic_get/put functions.
+
+* In 16.11 ABI changes are plained: the ``rte_eth_dev`` structure will be
+  extended with new function pointer ``tx_pkt_prep`` allowing verification
+  and processing of packet burst to meet HW specific requirements before
+  transmit. Also new fields will be added to the ``rte_eth_desc_lim`` 
structure:
+  ``nb_seg_max`` and ``nb_mtu_seg_max`` provideing information about number of
+  segments limit to be transmitted by device for TSO/non-TSO packets.
-- 
1.7.9.5

[dpdk-dev] [PATCH] eal: fix check number of bytes from read function

2016-07-20 Thread Michal Jastrzebski

In rte_mem_virt2phy: Value returned from a function and indicating the
number of bytes was ignored. This could cause a wrong pfn (page frame
number) mask read from pagemap file.
When read returns less than the number of sizeof(uint64_t) bytes,
function rte_mem_virt2phy returns error.

Coverity issue: 13212
Fixes: 40b966a211ab ("ivshmem: library changes for mmaping using
ivshmem").

Signed-off-by: Michal Jastrzebski 
---
 lib/librte_eal/linuxapp/eal/eal_memory.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c 
b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 42a29fa..05769fb 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -158,7 +158,7 @@ rte_mem_lock_page(const void *virt)
 phys_addr_t
 rte_mem_virt2phy(const void *virtaddr)
 {
-   int fd;
+   int fd, retval;
uint64_t page, physaddr;
unsigned long virt_pfn;
int page_size;
@@ -209,11 +209,19 @@ rte_mem_virt2phy(const void *virtaddr)
close(fd);
return RTE_BAD_PHYS_ADDR;
}
-   if (read(fd, &page, sizeof(uint64_t)) < 0) {
+
+   retval = read(fd, &page, sizeof(uint64_t));
+   if (retval < 0) {
RTE_LOG(ERR, EAL, "%s(): cannot read /proc/self/pagemap: %s\n",
__func__, strerror(errno));
close(fd);
return RTE_BAD_PHYS_ADDR;
+   }   else if (retval >= 0 && retval < (int)sizeof(uint64_t)) {
+   RTE_LOG(ERR, EAL, "%s(): read %d bytes from /proc/self/pagemap "
+   "but expected %d: %s\n",
+   __func__, retval, (int)sizeof(uint64_t), 
strerror(errno));
+   close(fd);
+   return RTE_BAD_PHYS_ADDR;
}

/*
-- 
1.7.9.5

[dpdk-dev] [PATCH] unify tools naming

2016-07-20 Thread Thomas Monjalon

The following tools may be installed system-wise.
It may be cleaner and more convenient to find them with the same
dpdk- prefix (especially for autocompletion).
Moreover, the script dpdk_nic_bind.py deserves a new name because it is
not restricted to NICs and can be used for e.g. crypto.

These files are renamed:
pmdinfogen   -> dpdk-pmdinfogen
pmdinfo.py   -> dpdk-pmdinfo.py
dpdk_pdump   -> dpdk-pdump
dpdk_proc_info   -> dpdk-procinfo
dpdk_nic_bind.py -> dpdk-devbind.py
setup.sh -> dpdk-setup.sh

The tools pmdinfogen, pmdinfo.py and dpdk_pdump are new in 16.07.

The scripts dpdk_nic_bind.py and setup.sh may have been used with
previous releases by end users. That's why a symbolic link still
provide the old name in the installed tools directory.

Signed-off-by: Thomas Monjalon 
---

It would be good to have this rename in 16.07 before that pmdinfo.py
and dpdk_pdump are widely used.

A possible addition to this patch could be renaming the test apps:
test -> dpdk-test
testacl  -> dpdk-testacl
testpipeline -> dpdk-testpipeline
testpmd  -> dpdk-testpmd

---
 MAINTAINERS  |  2 +-
 app/pdump/Makefile   |  2 +-
 app/proc_info/Makefile   |  2 +-
 buildtools/pmdinfogen/Makefile   |  2 +-
 doc/guides/faq/faq.rst   |  2 +-
 doc/guides/linux_gsg/build_dpdk.rst  | 14 +++---
 doc/guides/linux_gsg/nic_perf_intel_platform.rst |  6 +++---
 doc/guides/linux_gsg/quick_start.rst | 12 ++--
 doc/guides/nics/bnx2x.rst|  4 ++--
 doc/guides/nics/cxgbe.rst|  4 ++--
 doc/guides/nics/ena.rst  |  2 +-
 doc/guides/nics/enic.rst |  6 +++---
 doc/guides/nics/i40e.rst |  4 ++--
 doc/guides/nics/intel_vf.rst |  8 
 doc/guides/nics/nfp.rst  |  8 
 doc/guides/nics/qede.rst |  2 +-
 doc/guides/nics/thunderx.rst | 16 
 doc/guides/nics/virtio.rst   |  2 +-
 doc/guides/prog_guide/dev_kit_build_system.rst   | 14 +++---
 doc/guides/rel_notes/release_16_07.rst   |  3 +++
 doc/guides/sample_app_ug/pdump.rst   | 14 +++---
 doc/guides/sample_app_ug/proc_info.rst   |  8 
 doc/guides/testpmd_app_ug/testpmd_funcs.rst  | 10 +-
 doc/guides/xen/pkt_switch.rst|  2 +-
 lib/librte_eal/common/eal_common_options.c   |  4 ++--
 mk/internal/rte.compile-pre.mk   |  2 +-
 mk/rte.sdkinstall.mk | 16 ++--
 mk/rte.sdktest.mk|  4 ++--
 tools/{dpdk_nic_bind.py => dpdk-devbind.py}  |  0
 tools/{pmdinfo.py => dpdk-pmdinfo.py}|  4 +---
 tools/{setup.sh => dpdk-setup.sh}| 24 
 31 files changed, 104 insertions(+), 99 deletions(-)
 rename tools/{dpdk_nic_bind.py => dpdk-devbind.py} (100%)
 rename tools/{pmdinfo.py => dpdk-pmdinfo.py} (99%)
 rename tools/{setup.sh => dpdk-setup.sh} (95%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2996b09..3bfcc9f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -70,7 +70,7 @@ F: scripts/validate-abi.sh

 Driver information
 F: buildtools/pmdinfogen/
-F: tools/pmdinfo.py
+F: tools/dpdk-pmdinfo.py


 Environment Abstraction Layer
diff --git a/app/pdump/Makefile b/app/pdump/Makefile
index d85bb08..536198f 100644
--- a/app/pdump/Makefile
+++ b/app/pdump/Makefile
@@ -33,7 +33,7 @@ include $(RTE_SDK)/mk/rte.vars.mk

 ifeq ($(CONFIG_RTE_LIBRTE_PDUMP),y)

-APP = dpdk_pdump
+APP = dpdk-pdump

 CFLAGS += $(WERROR_FLAGS)

diff --git a/app/proc_info/Makefile b/app/proc_info/Makefile
index 33e058e..e051e03 100644
--- a/app/proc_info/Makefile
+++ b/app/proc_info/Makefile
@@ -31,7 +31,7 @@

 include $(RTE_SDK)/mk/rte.vars.mk

-APP = dpdk_proc_info
+APP = dpdk-procinfo

 CFLAGS += $(WERROR_FLAGS)

diff --git a/buildtools/pmdinfogen/Makefile b/buildtools/pmdinfogen/Makefile
index 3885d3b..bd8f900 100644
--- a/buildtools/pmdinfogen/Makefile
+++ b/buildtools/pmdinfogen/Makefile
@@ -34,7 +34,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 #
 # library name
 #
-HOSTAPP = pmdinfogen
+HOSTAPP = dpdk-pmdinfogen

 #
 # all sources are stored in SRCS-y
diff --git a/doc/guides/faq/faq.rst b/doc/guides/faq/faq.rst
index 3228b92..8d1ea6c 100644
--- a/doc/guides/faq/faq.rst
+++ b/doc/guides/faq/faq.rst
@@ -50,7 +50,7 @@ When you stop and restart the test application, it looks to 
see if the pages are
 If you look in the directory, you will see ``n`` number of 2M pages files. If 
you specified 1024, you will see 1024 page files.
 These are then placed in memory segments to get contiguous memory.

-If you need to change the number of pages, it is easier to first remove the 
pages. The tools/setup.sh sc

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-20 Thread Neil Horman

On Wed, Jul 20, 2016 at 07:47:32PM +, Wiles, Keith wrote:
> 
> > On Jul 20, 2016, at 12:48 PM, Neil Horman  wrote:
> > 
> > On Wed, Jul 20, 2016 at 07:40:49PM +0200, Thomas Monjalon wrote:
> >> 2016-07-20 13:09, Neil Horman:
> >>> From: Neil Horman 
> >>> 
> >>> John Mcnamara and I were discussing enhacing the validate_abi script to 
> >>> build
> >>> the dpdk tree faster with multiple jobs.  Theres no reason not to do it, 
> >>> so this
> >>> implements that requirement.  It uses a MAKE_JOBS variable that can be 
> >>> set by
> >>> the user to limit the job count.  By default the job count is set to the 
> >>> number
> >>> of online cpus.
> >> 
> >> Please could you use the variable name DPDK_MAKE_JOBS?
> >> This name is already used in scripts/test-build.sh.
> >> 
> > Sure
> > 
> >>> +if [ -z "$MAKE_JOBS" ]
> >>> +then
> >>> + # This counts the number of cpus on the system
> >>> + MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
> >>> +fi
> >> 
> >> Is lscpu common enough?
> >> 
> > I'm not sure how to answer that.  lscpu is part of the util-linux package, 
> > which
> > is part of any base install.  Theres a variant for BSD, but I'm not sure how
> > common it is there.
> > Neil
> > 
> >> Another acceptable default would be just "-j" without any number.
> >> It would make the number of jobs unlimited.
> 
> I think the best is just use -j as it tries to use the correct number of jobs 
> based on the number of cores, right?
> 
-j with no argument (or -j 0), is sort of, maybe what you want.  With either of
those options, make will just issue jobs as fast as it processes dependencies.
Dependent on how parallel the build is, that can lead to tons of waiting process
(i.e. more than your number of online cpus), which can actually hurt your build
time.

While its fine in los of cases, its not always fine, and with this
implementation you can still opt in to that behavior by setting DPDK_MAKE_JOBS=0

Neil

>

[dpdk-dev] [PATCH] doc: announce ABI change for rte_eth_dev structure

2016-07-20 Thread Kulasek, TomaszX



> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, July 20, 2016 17:22
> To: Ananyev, Konstantin 
> Cc: Kulasek, TomaszX ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] doc: announce ABI change for rte_eth_dev
> structure
> 
> 2016-07-20 15:13, Ananyev, Konstantin:
> > Hi Thomas,
> >
> > > Hi,
> > >
> > > This patch announces an interesting change in the DPDK design.
> > >
> > > 2016-07-20 16:24, Tomasz Kulasek:
> > > > This is an ABI deprecation notice for DPDK 16.11 in librte_ether
> > > > about changes in rte_eth_dev and rte_eth_desc_lim structures.
> > > >
> > > > In 16.11, we plan to introduce rte_eth_tx_prep() function to do
> > > > necessary preparations of packet burst to be safely transmitted on
> > > > device for desired HW offloads (set/reset checksum field according
> > > > to the hardware requirements) and check HW constraints (number of
> > > > segments per packet, etc).
> > > >
> > > > While the limitations and requirements may differ for devices, it
> > > > requires to extend rte_eth_dev structure with new function pointer
> > > > "tx_pkt_prep" which can be implemented in the driver to prepare
> > > > and verify packets, in devices specific way, before burst, what
> > > > should to prevent application to send malformed packets.
> > > >
> > > > Also new fields will be introduced in rte_eth_desc_lim: nb_seg_max
> > > > and nb_mtu_seg_max, providing an information about max segments in
> > > > TSO and non TSO packets acceptable by device.
> > >
> > > We cannot acknowledge such notice without a prior design discussion.
> > > Please explain why you plan to work on this change and give a draft of
> the new structures (a RFC patch would be ideal).
> >
> > I think it is not really a deprecation note, but announce ABI change for
> rte_ethdev.h structures.
> 
> An ABI break requires a deprecation notice. So it is :)
> 
> > The plan is to implement what was proposed & discussed the following
> thread:
> > http://dpdk.org/ml/archives/dev/2015-September/023603.html
> 
> Please could you summarize it here?

Hi Thomas,

The implementation of rte_eth_tx_prep() will be similar to the 
rte_eth_tx_burst(), passing same arguments to the driver, so packets can be 
checked and modified before real burst will be done.

The API for new function will be implemented in the fallowed way:

+/**
+ * Process a burst of output packets on a transmit queue of an Ethernet device.
+ *
+ * The rte_eth_tx_prep() function is invoked to prepare output packets to be
+ * transmitted on the output queue *queue_id* of the Ethernet device designated
+ * by its *port_id*.
+ * The *nb_pkts* parameter is the number of packets to be prepared which are
+ * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
+ * allocated from a pool created with rte_pktmbuf_pool_create().
+ * For each packet to send, the rte_eth_tx_prep() function performs
+ * the following operations:
+ *
+ * - Check if packet meets devices requirements for tx offloads.
+ *
+ * - Check limitations about number of segments.
+ *
+ * - Check additional requirements when debug is enabled.
+ *
+ * - Update and/or reset required checksums when tx offload is set for packet.
+ *
+ * The rte_eth_tx_prep() function returns the number of packets ready it
+ * actually sent. A return value equal to *nb_pkts* means that all packets
+ * are valid and ready to be sent.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The index of the transmit queue through which output packets must be
+ *   sent.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param tx_pkts
+ *   The address of an array of *nb_pkts* pointers to *rte_mbuf* structures
+ *   which contain the output packets.
+ * @param nb_pkts
+ *   The maximum number of packets to process.
+ * @return
+ *   The number of packets correct and ready to be sent. The return value can 
be
+ *   less than the value of the *tx_pkts* parameter when some packet doesn't
+ *   meet devices requirements with rte_errno set appropriately.
+ */
+static inline uint16_t
+rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **tx_pkts,
+   uint16_t nb_pkts)
+{
+   struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+   if (!dev->tx_pkt_prep) {
+   rte_errno = -ENOTSUP;
+   return 0;
+   }
+
+#ifdef RTE_LIBRTE_ETHDEV_DEBUG
+   if (queue_id >= dev->data->nb_tx_queues) {
+   RTE_PMD_DEBUG_TRACE("Invalid TX queue_id=%d\n", queue_id);
+   rte_errno = -EINVAL;
+   return 0;
+   }
+#endif
+
+   return (*dev->tx_pkt_prep)(dev->data->tx_queues[queue_id], tx_pkts, 
nb_pkts);
+}

Tomasz

[dpdk-dev] [PATCH] mempool: adjust name string size in related data types

2016-07-20 Thread Olivier Matz

Hi,

On 07/20/2016 02:41 PM, Zoltan Kiss wrote:
> 
> 
> On 19/07/16 17:17, Olivier Matz wrote:
>> Hi Zoltan,
>>
>> On 07/19/2016 05:59 PM, Zoltan Kiss wrote:
>>>
>>>
>>> On 19/07/16 16:37, Olivier Matz wrote:
 Hi Zoltan,

 On 07/19/2016 04:37 PM, Zoltan Kiss wrote:
> A recent fix brought up an issue about the size of the 'name' fields:
>
> 85cf0079 mem: avoid memzone/mempool/ring name truncation
>
> These relations should be observed:
>
> RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
> RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE -
> strlen(RTE_MEMPOOL_MZ_PREFIX)
>
> Setting all of them to 32 hides this restriction from the application.
> This patch increases the memzone string size to accomodate for these
> prefixes, and the same happens with the ring name string. The ABI
> needs to
> be broken to fix this API issue, this way doesn't break applications
> previously not failing due to the truncating bug now fixed.
>
> Signed-off-by: Zoltan Kiss 

 I agree it is a problem for an application because it cannot know what
 is the maximum name length. On the other hand, breaking the ABI for
 this
 looks a bit overkill. Maybe we could reduce RTE_MEMPOOL_NAMESIZE and
 RTE_RING_NAMESIZE instead of increasing RTE_MEMZONE_NAMESIZE? That way,
 we could keep the ABI as is.
>>>
>>> But that would break the ABI too, wouldn't it? Unless you keep the array
>>> the same size (32 bytes) by using RTE_MEMZONE_NAMESIZE.
>>
>> Yes, that was the idea.
>>
>>> And even then, the API breaks anyway. There are applications - I have at
>>> least some - which use all 32 bytes to store the name. Decrease that
>>> would cause headache to change the naming scheme, because it's a 30
>>> character long id, and chopping the last few chars would cause name
>>> collisions and annoying bugs.
>>
>> Before my patch (85cf0079), long names were silently truncated when
>> mempool created its ring and/or memzones. Now, it returns an error.
> 
> With 16.04 an application could operate as expected if the first 26
> character were unique. Your patch revealed the problem that characters
> after these were left out of the name. Now applications fail where this
> never been a bug because their naming scheme guarantees the uniqueness
> on the first 26 chars (or makes it very unlikely)
> Where the first 26 is not unique, it failed earlier too, because at
> memzone creation it checks for duplicate names.

Yes, I understand that there is a behavior change for applications using
names larger than 26 between 16.04 and 16.07. I also understand that
there is no way for an application to know what is the maximum usable
size for a mempool or a ring.


>> I'm not getting why changing the struct to something like below would
>> break the API, since it would already return an error today.
>>
>>#define RTE_MEMPOOL_NAMESIZE \
> 
> Wait, this would mean applications need to recompile to use the smaller
> value. AFAIK that's an ABI break too, right? At the moment I don't see a
> way to fix this without breaking the ABI

With this modification, if you don't recompile the application, its
behavior will still be the same as today -> it will return ENAMETOOLONG.
If you recompile it, the application will be aware of the maximum
length. To me, it seems to be a acceptable compromise for this release.

The patch you're proposing also changes the ABI of librte_ring and
librte_eal, which cannot be accepted for the release.


> 
>>(RTE_MEMZONE_NAMESIZE - sizeof(pool_prefix) - sizeof(ring prefix))
>>struct rte_mempool {
>>union {
>>  char name[RTE_MEMPOOL_NAMESIZE];
>>  char pad[32];
>>};
>>...
>>}
>>
>> Anyway, it may not be the proper solution since it supposes that a
>> mempool includes a ring based on a memzone, which is not always true now
>> with mempool handlers.
> 
> Oh, as we dug deeper it gets better!
> Indeed, we don't necessarily have this ring + memzone pair underneath,
> but the user is not aware of that, and I think we should keep it that
> way. It should only care that the string passed shouldn't be bigger than
> a certain amount.

Yes. What I'm just saying here is that it's not a good solution to write
in the #define that "a mempool is based on a ring + a memzone", because
if some someone adds a new mempool handler replacing the ring, and also
creating a memzone prefixed by something larger than "rg_", we will have
to break the ABI again.


> Also, even though we don't necessarily have the ring, we still reserve
> memzone's in rte_mempool_populate_default(). And their name has a 3
> letter prefix, and a "_%d" postfix, where the %d could be as much as
> RTE_MAX_MEMZONE in worst case (2560 by default) So actually:
> 
> RTE_MEMPOOL_NAMESIZE <= RTE_MEMZONE_NAMESIZE -
> strlen(RTE_MEMPOOL_MZ_PREFIX) - strlen("_2560")
> 
> 
> As a side note, there is another bug around here: r

[dpdk-dev] [PATCH] doc: announce ABI change for rte_eth_dev structure

2016-07-20 Thread Ananyev, Konstantin

Hi Thomas,

> Hi,
> 
> This patch announces an interesting change in the DPDK design.
> 
> 2016-07-20 16:24, Tomasz Kulasek:
> > This is an ABI deprecation notice for DPDK 16.11 in librte_ether about
> > changes in rte_eth_dev and rte_eth_desc_lim structures.
> >
> > In 16.11, we plan to introduce rte_eth_tx_prep() function to do
> > necessary preparations of packet burst to be safely transmitted on
> > device for desired HW offloads (set/reset checksum field according to
> > the hardware requirements) and check HW constraints (number of
> > segments per packet, etc).
> >
> > While the limitations and requirements may differ for devices, it
> > requires to extend rte_eth_dev structure with new function pointer
> > "tx_pkt_prep" which can be implemented in the driver to prepare and
> > verify packets, in devices specific way, before burst, what should to
> > prevent application to send malformed packets.
> >
> > Also new fields will be introduced in rte_eth_desc_lim: nb_seg_max and
> > nb_mtu_seg_max, providing an information about max segments in TSO and
> > non TSO packets acceptable by device.
> 
> We cannot acknowledge such notice without a prior design discussion.
> Please explain why you plan to work on this change and give a draft of the 
> new structures (a RFC patch would be ideal).

I think it is not really a deprecation note, but announce ABI change for 
rte_ethdev.h structures.
The plan is to implement what was proposed & discussed the following thread:
http://dpdk.org/ml/archives/dev/2015-September/023603.html

Konstantin

[dpdk-dev] [PATCH] ethdev: ensure consistent port id assignment

2016-07-20 Thread Tootoonchian, Amin

Hi Marcin,

Comments inline:

> -Original Message-
> From: Kerlin, MarcinX
> Sent: Wednesday, July 20, 2016 1:51 AM
> To: Tootoonchian, Amin 
> Cc: dev at dpdk.org; thomas.monjalon at 6wind.com
> Subject: RE: [PATCH] ethdev: ensure consistent port id assignment
> 
> Hi Amin,
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Tootoonchian,
> > Amin
> > Sent: Tuesday, July 12, 2016 4:01 AM
> > To: thomas.monjalon at 6wind.com
> > Cc: dev at dpdk.org
> > Subject: [dpdk-dev] [PATCH] ethdev: ensure consistent port id
> > assignment
> >
> > The rte_eth_dev_allocate() code has an implicit assumption that the
> > port id assignment in the secondary process is consistent with that of
> > the primary. The current code breaks if the enumeration of ethdevs in
> > primary and secondary processes are not identical (e.g., when the
> > black/whitelist and vdev args of the primary and secondary do not
> > match, or when the primary dynamically adds/removes ethdevs).
> >
> > To fix this problem, rte_eth_dev_allocate() now looks up port id by
> > name in a secondary process (making it explicit that ethdevs can only
> > be created and initialized by the primary process). Upon releasing a
> > port, the primary process zeros out eth_dev->data to avoid false
> > positives in port id lookup by rte_eth_dev_get_port_by_name().
> >
> > Signed-off-by: Amin Tootoonchian 
> > ---
> >  lib/librte_ether/rte_ethdev.c | 44
> > +
> > --
> >  1 file changed, 30 insertions(+), 14 deletions(-)
> >
> > diff --git a/lib/librte_ether/rte_ethdev.c
> > b/lib/librte_ether/rte_ethdev.c index
> > 0a6e3f1..1801f57 100644
> > --- a/lib/librte_ether/rte_ethdev.c
> > +++ b/lib/librte_ether/rte_ethdev.c
> > @@ -195,25 +195,37 @@ rte_eth_dev_allocate(const char *name, enum
> > rte_eth_dev_type type)
> > uint8_t port_id;
> > struct rte_eth_dev *eth_dev;
> >
> > -   port_id = rte_eth_dev_find_free_port();
> > -   if (port_id == RTE_MAX_ETHPORTS) {
> > -   RTE_PMD_DEBUG_TRACE("Reached maximum number of
> > Ethernet ports\n");
> > -   return NULL;
> > -   }
> > -
> > if (rte_eth_dev_data == NULL)
> > rte_eth_dev_data_alloc();
> >
> > -   if (rte_eth_dev_allocated(name) != NULL) {
> > -   RTE_PMD_DEBUG_TRACE("Ethernet Device with name %s
> > already allocated!\n",
> > -   name);
> > -   return NULL;
> > +   if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> > +   port_id = rte_eth_dev_find_free_port();
> > +   if (port_id == RTE_MAX_ETHPORTS) {
> > +   RTE_PMD_DEBUG_TRACE("Reached maximum number
> > of Ethernet ports\n");
> > +   return NULL;
> > +   }
> > +
> > +   if (rte_eth_dev_allocated(name) != NULL) {
> > +   RTE_PMD_DEBUG_TRACE("Ethernet Device with name
> > %s already allocated!\n",
> > +   name);
> > +   return NULL;
> > +   }
> > +   } else {
> > +   if (rte_eth_dev_get_port_by_name(name, &port_id) != 0) {
> 
> 
> I was working also on this problem but I didn't send patch yet, so I did 
> review of
> your code.
> 
> Condition (rte_eth_dev_get_port_by_name(name, &port_id) != 0) will always
> fail.
> Secondary process enter here and he will be looking for him name but has not
> yet added and the application return NULL here e.g. we will run app with 
> device
> name pcap1 but this device is not on list and function return null.

This is the intended behavior with this patch. Ports are to be created only by 
the primary process. This is required for correct operation IMO, because if we 
allow secondary processes to create ports dynamically (and locally use 
conflicting port ids) without any synchronization mechanism, they're guaranteed 
to overwrite each other's rte_eth_dev_data.

> > +   RTE_PMD_DEBUG_TRACE("Ethernet Device with name
> > %s could not be found!\n",
> > +   name);
> > +   return NULL;
> > +   }
> > }
> >
> > eth_dev = &rte_eth_devices[port_id];
> > eth_dev->data = &rte_eth_dev_data[port_id];
> > -   snprintf(eth_dev->data->name, sizeof(eth_dev->data->name), "%s",
> > name);
> > -   eth_dev->data->port_id = port_id;
> > +
> > +   if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> > +   snprintf(eth_dev->data->name, sizeof(eth_dev->data->name),
> > "%s", name);
> > +   eth_dev->data->port_id = port_id;
> > +   }
> > +
> 
> 
> 1. rte_eth_dev_data[port_id] -> port id should be shifted because secondary
> process
> overwrite e.g. first position which is common with primary process, so should 
> be
> add at the end
>
> 2. If this condition is true only for primary it means that secondary process 
> can't
> add own name.
> So this excludes with above line: "if (rte_eth_dev_get_port_by_name(name,
> &port_id) != 0)"?

No, with this patch,

[dpdk-dev] [PATCH v2] validate_abi: build faster by augmenting make with job count

2016-07-20 Thread Neil Horman

John Mcnamara and I were discussing enhacing the validate_abi script to build
the dpdk tree faster with multiple jobs.  Theres no reason not to do it, so this
implements that requirement.  It uses a MAKE_JOBS variable that can be set by
the user to limit the job count.  By default the job count is set to the number
of online cpus.

Signed-off-by: Neil Horman 
CC: Thomas Monjalon 
CC: "Mcnamara, John" 

---
Change notes

v2) switch variable to DPDK_MAKE_JOBS
make provision for no existance of lscpu
---
 scripts/validate-abi.sh | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/scripts/validate-abi.sh b/scripts/validate-abi.sh
index c36ad61..feda6c8 100755
--- a/scripts/validate-abi.sh
+++ b/scripts/validate-abi.sh
@@ -97,6 +97,17 @@ fixup_config() {
 #trap on ctrl-c to clean up
 trap cleanup_and_exit SIGINT

+if [ -z "$DPDK_MAKE_JOBS" ]
+then
+   # This counts the number of cpus on the system
+   if [ -e /usr/bin/lscpu ]
+   then
+   DPDK_MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
+   else
+   DPDK_MAKE_JOBS=1
+   fi
+fi
+
 #Save the current branch
 CURRENT_BRANCH=`git branch | grep \* | cut -d' ' -f2`

@@ -183,7 +194,7 @@ log "INFO" "Configuring DPDK $TAG1"
 make config T=$TARGET O=$TARGET > $VERBOSE 2>&1

 log "INFO" "Building DPDK $TAG1. This might take a moment"
-make O=$TARGET > $VERBOSE 2>&1
+make -j$DPDK_MAKE_JOBS O=$TARGET > $VERBOSE 2>&1

 if [ $? -ne 0 ]
 then
@@ -214,7 +225,7 @@ log "INFO" "Configuring DPDK $TAG2"
 make config T=$TARGET O=$TARGET > $VERBOSE 2>&1

 log "INFO" "Building DPDK $TAG2. This might take a moment"
-make O=$TARGET > $VERBOSE 2>&1
+make -j$DPDK_MAKE_JOBS O=$TARGET > $VERBOSE 2>&1

 if [ $? -ne 0 ]
 then
-- 
2.5.5

[dpdk-dev] [PATCH] examples/vhost: fix perf regression

2016-07-20 Thread Tan, Jianfeng



On 7/20/2016 2:13 PM, Yuanhan Liu wrote:
> On Wed, Jul 20, 2016 at 01:50:34PM +0800, Tan, Jianfeng wrote:
>>
>> On 7/20/2016 12:38 PM, Yuanhan Liu wrote:
>>> On Tue, Jul 19, 2016 at 01:53:11PM +, Jianfeng Tan wrote:
 We find significant perfermance drop introduced by below commit,
 when vhost example is started with --mergeable 0 and inside vm,
 kernel virtio-net driver is used to do ip based forwarding.

 The root cause is that below commit adds support for
 VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6, and when
 mergeable is disabled, it triggers big_packets path of virtio-net
 driver. In this path, virtio driver uses 19 desc with 18 4K-sized
 pages to receive each packet, so that it can receive a big packet
 with size of 64K. But QEMU only creates 256 desc entries for each
 vq, which results in that only 13 packets can be received. VM
 kernel can quickly handle those packets and go to sleep (HLT).

 As QEMU has no option to set the desc entries of a vq, so here,
 we disable VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6
 with VIRTIO_NET_F_HOST_TSO4 and VIRTIO_NET_F_HOST_TSO6 when we
 disable tso of vhost example, to avoid VM kernel virtio driver
 go into big_packets path.

 Fixes: 859b480d5afd ("vhost: add guest offload setting")
>>> And here you are patching vhost example to try to fix an "issue"
>>> in vhost lib, this is __logically__ wrong.
>>>
>>> --yliu
>> This is not an issue from vhost lib's perspective, vhost lib should provide
>> all features it supports by default.
> Bingo.., that's why "Fixes: 859b480d5afd ... " is wrong to me.
>
>> Applications can enable/disable
>> features according to their own requirements.
> Yes, application can, but application normally doesn't do that. And
> as stated in my early reply, the qemu is the place you should go for
> all those options enabling/disabling, but not vhost (not vhost-example).
>
> I think it's sometimes more handy if we can do that by introducing
> some vhost-example options, and I guess that's why those options are
> given.
>
> In another word, there is nothing wrong about the commit 859b480d5afd,
> if you want to "fix" anything here, following commit is something
> we need fix:
>
>  Fixes: 9fd72e3cbd29 ("examples/vhost: add virtio offload")
>
> Because that commit just partially disables some TSO related features,
> letting the virtio net driver goes to the slow path.

Great, I see. And thanks for detailed clarification. I'll send v2.

>
>> And the vhost example after
>> this commit just triggers a slow path of virtio driver. So this fix just
>> makes sure vhost example does not go into the slow path by default.
> I have made a statement in the first time, that I am not object to
> have this patch at all.
>
> Meanwhile, the right "fix" is you need disable all TSO related features
> from QEMU, in such way, we should see no such issue from all vhost
> application, but not only this one, the one we used mostly internally.
>
> As you can see, it's more about the usage.

Yes, I agree this is the BKM we should adopt and recommend users to use.

Thanks,
Jianfeng

>
>> By the way, if a fix patch should only involve those commits it will change?
> IMO, logically, yes.
>
>   --yliu

[dpdk-dev] SRIOV hot unplug

2016-07-20 Thread Tetsuya Mukawa

Hi Eli,

On 2016/07/19 20:18, Eli Britstein wrote:
> 
> 
>> -Original Message-
>> From: Tetsuya Mukawa [mailto:mukawa at igel.co.jp]
>> Sent: Tuesday, 19 July, 2016 5:49 AM
>> To: Eli Britstein; dev at dpdk.org
>> Cc: Iremonger, Bernard
>> Subject: Re: [dpdk-dev] SRIOV hot unplug
>>
>> Hi Eli,
>>
>> On 2016/07/18 17:47, Eli Britstein wrote:
>>> Hi Bernard,
>>>
>>> Thank you for your answer.
>>> However, to do this, I have to have some communication protocol to the
>> VM's application in order for it to do this sequence and acknowledge that it 
>> is
>> now safe to proceed with detaching the device.
>>> This implies some kind of integration from the host side, which I would like
>> to avoid.
>>
>> I guess you should have some kind of communication channel to notice the
>> hotpluging events from host to VM.
> [Eli Britstein]
> In order just to notice the hotplugging events inside the VM, I can use add 
> some udev action in the VM, in /etc/udev/rules.d/XXX
> However, those are asynchronous events. The host proceeds with unplugging 
> without waiting for the VM to acknowledge it.

Yes, it will be asynchronous.

> 
>>
>>> Do you think might there be any other way for the application to handle
>> such event in a smooth way?
>>
>> So far, I guess having one more virtio-net device will be easiest way.
> [Eli Britstein]
> Could you please elaborate your meaning? How do you mean to use this extra 
> virtio-net device?

If you have one more network device, you can send and receive some kind
of control messages between host and VM.
It may be okay to use general socket API for this additional network
device. So I guess it's not so difficult to additionally implement it.

Thanks,
Tetsuya

> 
> To clarify: I would like to have my bond device automatically set the vNIC as 
> its primary, and close/remove the VF before it is unplugged.
> 
> Thanks,
> Eli
> 
>>
>> Thanks,
>> Tetsuya
>>
>>>
>>> Thanks,
>>> Eli
>>>
 -Original Message-
 From: Iremonger, Bernard [mailto:bernard.iremonger at intel.com]
 Sent: Sunday, 17 July, 2016 11:53 PM
 To: Eli Britstein; dev at dpdk.org
 Cc: Iremonger, Bernard
 Subject: RE: SRIOV hot unplug

 Hi Eli,

 The DPDK application in the VM should remove the slave device from
 the bond device, stop, close and detach the device in the VM before
 doing "virsh detach-device" from the host.

 Regards,

 Bernard.


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Eli Britstein
> Sent: Sunday, July 17, 2016 9:58 AM
> To: dev at dpdk.org
> Subject: [dpdk-dev] SRIOV hot unplug
>
> Hello,
>
> A DPDK application with a DPDK bond device, with 2 slaves: one vnic,
> and another is a SRIOV VF connected as a pathrough.
> The bond device is configured as ACTIVE/BACKUP, and the primary is
> the VF slave.
> Now, I do "virsh detach-device" from the host, and the DPDK process
> in the VM gets segmentation fault, as it tries to poll an address
> that is not mmaped anymore.
> I wonder if this flow is supposed to be supported by DPDK, or not.
> Please advise.
>
> Thanks,
> Eli
> 
> --
> 
> ---
> This email and any files transmitted and/or attachments with it are
> confidential and proprietary information of Toga Networks Ltd., and
> intended solely for the use of the individual or entity to whom they
> are addressed.
> If you have received this email in error please notify the system
>> manager.
> This message contains confidential information of Toga Networks
> Ltd., and is intended only for the individual named. If you are not
> the named addressee you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately by e-mail if you
> have received this e-mail by mistake and delete this e-mail from
> your system. If you are not the intended recipient you are notified
> that disclosing, copying, distributing or taking any action in
> reliance on the
 contents of this information is strictly prohibited.
> 
> --
> 
> --
>>>
>>> --
>>> --
>>> - This email and any files transmitted and/or attachments with it
>>> are confidential and proprietary information of Toga Networks Ltd.,
>>> and intended solely for the use of the individual or entity to whom they are
>> addressed.
>>> If you have received this email in error please notify the system
>>> manager. This message contains confidential infor

[dpdk-dev] [PATCH] examples/vhost: fix perf regression

2016-07-20 Thread Yuanhan Liu

On Wed, Jul 20, 2016 at 01:50:34PM +0800, Tan, Jianfeng wrote:
> 
> 
> On 7/20/2016 12:38 PM, Yuanhan Liu wrote:
> >On Tue, Jul 19, 2016 at 01:53:11PM +, Jianfeng Tan wrote:
> >>We find significant perfermance drop introduced by below commit,
> >>when vhost example is started with --mergeable 0 and inside vm,
> >>kernel virtio-net driver is used to do ip based forwarding.
> >>
> >>The root cause is that below commit adds support for
> >>VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6, and when
> >>mergeable is disabled, it triggers big_packets path of virtio-net
> >>driver. In this path, virtio driver uses 19 desc with 18 4K-sized
> >>pages to receive each packet, so that it can receive a big packet
> >>with size of 64K. But QEMU only creates 256 desc entries for each
> >>vq, which results in that only 13 packets can be received. VM
> >>kernel can quickly handle those packets and go to sleep (HLT).
> >>
> >>As QEMU has no option to set the desc entries of a vq, so here,
> >>we disable VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6
> >>with VIRTIO_NET_F_HOST_TSO4 and VIRTIO_NET_F_HOST_TSO6 when we
> >>disable tso of vhost example, to avoid VM kernel virtio driver
> >>go into big_packets path.
> >>
> >>Fixes: 859b480d5afd ("vhost: add guest offload setting")
> >And here you are patching vhost example to try to fix an "issue"
> >in vhost lib, this is __logically__ wrong.
> >
> > --yliu
> 
> This is not an issue from vhost lib's perspective, vhost lib should provide
> all features it supports by default.

Bingo.., that's why "Fixes: 859b480d5afd ... " is wrong to me.

> Applications can enable/disable
> features according to their own requirements.

Yes, application can, but application normally doesn't do that. And
as stated in my early reply, the qemu is the place you should go for
all those options enabling/disabling, but not vhost (not vhost-example).

I think it's sometimes more handy if we can do that by introducing
some vhost-example options, and I guess that's why those options are
given.

In another word, there is nothing wrong about the commit 859b480d5afd,
if you want to "fix" anything here, following commit is something
we need fix:

Fixes: 9fd72e3cbd29 ("examples/vhost: add virtio offload")

Because that commit just partially disables some TSO related features,
letting the virtio net driver goes to the slow path.

> And the vhost example after
> this commit just triggers a slow path of virtio driver. So this fix just
> makes sure vhost example does not go into the slow path by default.

I have made a statement in the first time, that I am not object to
have this patch at all.

Meanwhile, the right "fix" is you need disable all TSO related features
from QEMU, in such way, we should see no such issue from all vhost
application, but not only this one, the one we used mostly internally.

As you can see, it's more about the usage.

> By the way, if a fix patch should only involve those commits it will change?

IMO, logically, yes.

--yliu

[dpdk-dev] [PATCH] rte_delay_us can be replaced with user function

2016-07-20 Thread jozma...@cisco.com

From: Jozef Martiniak 

When running single-core, some drivers tend to call rte_delay_us for a
long time, and that is causing packet drops.
To avoid this, rte_delay_us can be replaced with user-defined delay
function with:

void rte_delay_us_callback_register(void(*userfunc)(unsigned));

When userfunc==NULL, build-in blocking delay function is restored.

Signed-off-by: Jozef Martiniak 
---
 app/test/test_cycles.c | 45 ++
 lib/librte_eal/bsdapp/eal/rte_eal_version.map  |  7 
 lib/librte_eal/common/eal_common_timer.c   | 22 ++-
 lib/librte_eal/common/include/generic/rte_cycles.h | 15 +++-
 lib/librte_eal/linuxapp/eal/rte_eal_version.map|  7 
 5 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/app/test/test_cycles.c b/app/test/test_cycles.c
index f6c043a..7415ac8 100644
--- a/app/test/test_cycles.c
+++ b/app/test/test_cycles.c
@@ -90,3 +90,48 @@ test_cycles(void)
 }

 REGISTER_TEST_COMMAND(cycles_autotest, test_cycles);
+
+/*
+ * rte_delay_us_callback test
+ *
+ * - check if callback is correctly registered/unregistered
+ *
+ */
+
+static int pattern;
+static void my_rte_delay_us(unsigned us)
+{
+pattern += us;
+}
+
+static int
+test_user_delay_us(void)
+{
+pattern = 0;
+
+rte_delay_us(2);
+if (pattern != 0)
+return -1;
+
+/* register custom delay function */
+rte_delay_us_callback_register(my_rte_delay_us);
+
+rte_delay_us(2);
+if (pattern != 2)
+return -1;
+
+rte_delay_us(3);
+if (pattern != 5)
+return -1;
+
+/* restore original delay function */
+rte_delay_us_callback_register(NULL);
+
+rte_delay_us(3);
+if (pattern != 5)
+return -1;
+
+return 0;
+}
+
+REGISTER_TEST_COMMAND(user_delay_us, test_user_delay_us);
diff --git a/lib/librte_eal/bsdapp/eal/rte_eal_version.map 
b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
index a335e04..2f83ed0 100644
--- a/lib/librte_eal/bsdapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
@@ -162,3 +162,10 @@ DPDK_16.07 {
rte_thread_setname;

 } DPDK_16.04;
+
+DPDK_16.11 {
+   global:
+
+   rte_delay_us_callback_register;
+
+} DPDK_16.07;
diff --git a/lib/librte_eal/common/eal_common_timer.c 
b/lib/librte_eal/common/eal_common_timer.c
index c4227cd..ddfa9d1 100644
--- a/lib/librte_eal/common/eal_common_timer.c
+++ b/lib/librte_eal/common/eal_common_timer.c
@@ -47,8 +47,11 @@
 /* The frequency of the RDTSC timer resolution */
 static uint64_t eal_tsc_resolution_hz;

-void
-rte_delay_us(unsigned us)
+/* Pointer to user delay function */
+void (*rte_delay_us)(unsigned) = NULL;
+
+static void
+rte_delay_us_block(unsigned us)
 {
const uint64_t start = rte_get_timer_cycles();
const uint64_t ticks = (uint64_t)us * rte_get_timer_hz() / 1E6;
@@ -84,3 +87,18 @@ set_tsc_freq(void)
RTE_LOG(DEBUG, EAL, "TSC frequency is ~%" PRIu64 " KHz\n", freq / 1000);
eal_tsc_resolution_hz = freq;
 }
+
+void rte_delay_us_callback_register(void (*userfunc)(unsigned))
+{
+if (userfunc == NULL)
+rte_delay_us = rte_delay_us_block;
+else
+rte_delay_us = userfunc;
+}
+
+static void __attribute__((constructor))
+rte_timer_init(void)
+{
+/* set rte_delay_us_block as a delay function */
+rte_delay_us_callback_register(NULL);
+}
diff --git a/lib/librte_eal/common/include/generic/rte_cycles.h 
b/lib/librte_eal/common/include/generic/rte_cycles.h
index 8cc21f2..7a45b58 100644
--- a/lib/librte_eal/common/include/generic/rte_cycles.h
+++ b/lib/librte_eal/common/include/generic/rte_cycles.h
@@ -182,13 +182,16 @@ rte_get_timer_hz(void)
 }

 /**
+ *
  * Wait at least us microseconds.
+ * This function can be replaced with user-defined function using
+ * rte_delay_us_callback_register
  *
  * @param us
  *   The number of microseconds to wait.
  */
 void
-rte_delay_us(unsigned us);
+(*rte_delay_us)(unsigned us);

 /**
  * Wait at least ms milliseconds.
@@ -202,4 +205,14 @@ rte_delay_ms(unsigned ms)
rte_delay_us(ms * 1000);
 }

+/**
+ * Replace rte_delay_us with user defined function.
+ *
+ * @param userfunc
+ *   User function which replaces rte_delay_us. NULL restores
+ *   buildin block delay function.
+ */
+void rte_delay_us_callback_register(void(*userfunc)(unsigned));
+
+
 #endif /* _RTE_CYCLES_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map 
b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
index db8c984..4ba30ed 100644
--- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
@@ -166,3 +166,10 @@ DPDK_16.07 {
rte_thread_setname;

 } DPDK_16.04;
+
+DPDK_16.11 {
+   global:
+
+   rte_delay_us_callback_register;
+
+} DPDK_16.07;
-- 
2.1.4

[dpdk-dev] [PATCH] mempool: adjust name string size in related data types

2016-07-20 Thread Richardson, Bruce



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> Sent: Wednesday, July 20, 2016 2:37 PM
> To: Zoltan Kiss ; Zoltan Kiss
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] mempool: adjust name string size in
> related data types
> 
> Hi,
> 
> On 07/20/2016 02:41 PM, Zoltan Kiss wrote:
> >
> >
> > On 19/07/16 17:17, Olivier Matz wrote:
> >> Hi Zoltan,
> >>
> >> On 07/19/2016 05:59 PM, Zoltan Kiss wrote:
> >>>
> >>>
> >>> On 19/07/16 16:37, Olivier Matz wrote:
>  Hi Zoltan,
> 
>  On 07/19/2016 04:37 PM, Zoltan Kiss wrote:
> > A recent fix brought up an issue about the size of the 'name'
> fields:
> >
> > 85cf0079 mem: avoid memzone/mempool/ring name truncation
> >
> > These relations should be observed:
> >
> > RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE -
> > strlen(RTE_RING_MZ_PREFIX) RTE_MEMPOOL_NAMESIZE <=
> > RTE_RING_NAMESIZE -
> > strlen(RTE_MEMPOOL_MZ_PREFIX)
> >
> > Setting all of them to 32 hides this restriction from the
> application.
> > This patch increases the memzone string size to accomodate for
> > these prefixes, and the same happens with the ring name string.
> > The ABI needs to be broken to fix this API issue, this way doesn't
> > break applications previously not failing due to the truncating
> > bug now fixed.
> >
> > Signed-off-by: Zoltan Kiss 
> 
>  I agree it is a problem for an application because it cannot know
>  what is the maximum name length. On the other hand, breaking the
>  ABI for this looks a bit overkill. Maybe we could reduce
>  RTE_MEMPOOL_NAMESIZE and RTE_RING_NAMESIZE instead of increasing
>  RTE_MEMZONE_NAMESIZE? That way, we could keep the ABI as is.
> >>>
> >>> But that would break the ABI too, wouldn't it? Unless you keep the
> >>> array the same size (32 bytes) by using RTE_MEMZONE_NAMESIZE.
> >>
> >> Yes, that was the idea.
> >>
> >>> And even then, the API breaks anyway. There are applications - I
> >>> have at least some - which use all 32 bytes to store the name.
> >>> Decrease that would cause headache to change the naming scheme,
> >>> because it's a 30 character long id, and chopping the last few chars
> >>> would cause name collisions and annoying bugs.
> >>
> >> Before my patch (85cf0079), long names were silently truncated when
> >> mempool created its ring and/or memzones. Now, it returns an error.
> >
> > With 16.04 an application could operate as expected if the first 26
> > character were unique. Your patch revealed the problem that characters
> > after these were left out of the name. Now applications fail where
> > this never been a bug because their naming scheme guarantees the
> > uniqueness on the first 26 chars (or makes it very unlikely) Where the
> > first 26 is not unique, it failed earlier too, because at memzone
> > creation it checks for duplicate names.
> 
> Yes, I understand that there is a behavior change for applications using
> names larger than 26 between 16.04 and 16.07. I also understand that there
> is no way for an application to know what is the maximum usable size for a
> mempool or a ring.
> 
> 
> >> I'm not getting why changing the struct to something like below would
> >> break the API, since it would already return an error today.
> >>
> >>#define RTE_MEMPOOL_NAMESIZE \
> >
> > Wait, this would mean applications need to recompile to use the
> > smaller value. AFAIK that's an ABI break too, right? At the moment I
> > don't see a way to fix this without breaking the ABI
> 
> With this modification, if you don't recompile the application, its
> behavior will still be the same as today -> it will return ENAMETOOLONG.
> If you recompile it, the application will be aware of the maximum length.
> To me, it seems to be a acceptable compromise for this release.
> 
> The patch you're proposing also changes the ABI of librte_ring and
> librte_eal, which cannot be accepted for the release.
> 
> 
> >
> >>(RTE_MEMZONE_NAMESIZE - sizeof(pool_prefix) - sizeof(ring
> prefix))
> >>struct rte_mempool {
> >>union {
> >>  char name[RTE_MEMPOOL_NAMESIZE];
> >>  char pad[32];
> >>};
> >>...
> >>}
> >>
> >> Anyway, it may not be the proper solution since it supposes that a
> >> mempool includes a ring based on a memzone, which is not always true
> >> now with mempool handlers.
> >
> > Oh, as we dug deeper it gets better!
> > Indeed, we don't necessarily have this ring + memzone pair underneath,
> > but the user is not aware of that, and I think we should keep it that
> > way. It should only care that the string passed shouldn't be bigger
> > than a certain amount.
> 
> Yes. What I'm just saying here is that it's not a good solution to write
> in the #define that "a mempool is based on a ring + a memzone", because if
> some someone adds a new mempool handler replacing the ring, and also
> creating a memzone prefix

[dpdk-dev] [PATCH] examples/vhost: fix perf regression

2016-07-20 Thread Tan, Jianfeng



On 7/20/2016 12:38 PM, Yuanhan Liu wrote:
> On Tue, Jul 19, 2016 at 01:53:11PM +, Jianfeng Tan wrote:
>> We find significant perfermance drop introduced by below commit,
>> when vhost example is started with --mergeable 0 and inside vm,
>> kernel virtio-net driver is used to do ip based forwarding.
>>
>> The root cause is that below commit adds support for
>> VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6, and when
>> mergeable is disabled, it triggers big_packets path of virtio-net
>> driver. In this path, virtio driver uses 19 desc with 18 4K-sized
>> pages to receive each packet, so that it can receive a big packet
>> with size of 64K. But QEMU only creates 256 desc entries for each
>> vq, which results in that only 13 packets can be received. VM
>> kernel can quickly handle those packets and go to sleep (HLT).
>>
>> As QEMU has no option to set the desc entries of a vq, so here,
>> we disable VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6
>> with VIRTIO_NET_F_HOST_TSO4 and VIRTIO_NET_F_HOST_TSO6 when we
>> disable tso of vhost example, to avoid VM kernel virtio driver
>> go into big_packets path.
>>
>> Fixes: 859b480d5afd ("vhost: add guest offload setting")
> And here you are patching vhost example to try to fix an "issue"
> in vhost lib, this is __logically__ wrong.
>
>   --yliu

This is not an issue from vhost lib's perspective, vhost lib should 
provide all features it supports by default. Applications can 
enable/disable features according to their own requirements. And the 
vhost example after this commit just triggers a slow path of virtio 
driver. So this fix just makes sure vhost example does not go into the 
slow path by default.

By the way, if a fix patch should only involve those commits it will change?

Thanks,
Jianfeng

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-20 Thread Neil Horman

On Wed, Jul 20, 2016 at 07:40:49PM +0200, Thomas Monjalon wrote:
> 2016-07-20 13:09, Neil Horman:
> > From: Neil Horman 
> > 
> > John Mcnamara and I were discussing enhacing the validate_abi script to 
> > build
> > the dpdk tree faster with multiple jobs.  Theres no reason not to do it, so 
> > this
> > implements that requirement.  It uses a MAKE_JOBS variable that can be set 
> > by
> > the user to limit the job count.  By default the job count is set to the 
> > number
> > of online cpus.
> 
> Please could you use the variable name DPDK_MAKE_JOBS?
> This name is already used in scripts/test-build.sh.
> 
Sure

> > +if [ -z "$MAKE_JOBS" ]
> > +then
> > +   # This counts the number of cpus on the system
> > +   MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
> > +fi
> 
> Is lscpu common enough?
> 
I'm not sure how to answer that.  lscpu is part of the util-linux package, which
is part of any base install.  Theres a variant for BSD, but I'm not sure how
common it is there.
Neil

> Another acceptable default would be just "-j" without any number.
> It would make the number of jobs unlimited.

[dpdk-dev] [PATCH] mempool: adjust name string size in related data types

2016-07-20 Thread Zoltan Kiss

On 19/07/16 17:17, Olivier Matz wrote:
> Hi Zoltan,
>
> On 07/19/2016 05:59 PM, Zoltan Kiss wrote:
>>
>>
>> On 19/07/16 16:37, Olivier Matz wrote:
>>> Hi Zoltan,
>>>
>>> On 07/19/2016 04:37 PM, Zoltan Kiss wrote:
 A recent fix brought up an issue about the size of the 'name' fields:

 85cf0079 mem: avoid memzone/mempool/ring name truncation

 These relations should be observed:

 RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
 RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE -
 strlen(RTE_MEMPOOL_MZ_PREFIX)

 Setting all of them to 32 hides this restriction from the application.
 This patch increases the memzone string size to accomodate for these
 prefixes, and the same happens with the ring name string. The ABI
 needs to
 be broken to fix this API issue, this way doesn't break applications
 previously not failing due to the truncating bug now fixed.

 Signed-off-by: Zoltan Kiss 
>>>
>>> I agree it is a problem for an application because it cannot know what
>>> is the maximum name length. On the other hand, breaking the ABI for this
>>> looks a bit overkill. Maybe we could reduce RTE_MEMPOOL_NAMESIZE and
>>> RTE_RING_NAMESIZE instead of increasing RTE_MEMZONE_NAMESIZE? That way,
>>> we could keep the ABI as is.
>>
>> But that would break the ABI too, wouldn't it? Unless you keep the array
>> the same size (32 bytes) by using RTE_MEMZONE_NAMESIZE.
>
> Yes, that was the idea.
>
>> And even then, the API breaks anyway. There are applications - I have at
>> least some - which use all 32 bytes to store the name. Decrease that
>> would cause headache to change the naming scheme, because it's a 30
>> character long id, and chopping the last few chars would cause name
>> collisions and annoying bugs.
>
> Before my patch (85cf0079), long names were silently truncated when
> mempool created its ring and/or memzones. Now, it returns an error.

With 16.04 an application could operate as expected if the first 26 
character were unique. Your patch revealed the problem that characters 
after these were left out of the name. Now applications fail where this 
never been a bug because their naming scheme guarantees the uniqueness 
on the first 26 chars (or makes it very unlikely)
Where the first 26 is not unique, it failed earlier too, because at 
memzone creation it checks for duplicate names.

>
> I'm not getting why changing the struct to something like below would
> break the API, since it would already return an error today.
>
>#define RTE_MEMPOOL_NAMESIZE \

Wait, this would mean applications need to recompile to use the smaller 
value. AFAIK that's an ABI break too, right? At the moment I don't see a 
way to fix this without breaking the ABI

>(RTE_MEMZONE_NAMESIZE - sizeof(pool_prefix) - sizeof(ring prefix))
>struct rte_mempool {
>union {
>  char name[RTE_MEMPOOL_NAMESIZE];
>  char pad[32];
>};
>...
>}
>
> Anyway, it may not be the proper solution since it supposes that a
> mempool includes a ring based on a memzone, which is not always true now
> with mempool handlers.

Oh, as we dug deeper it gets better!
Indeed, we don't necessarily have this ring + memzone pair underneath, 
but the user is not aware of that, and I think we should keep it that 
way. It should only care that the string passed shouldn't be bigger than 
a certain amount.
Also, even though we don't necessarily have the ring, we still reserve 
memzone's in rte_mempool_populate_default(). And their name has a 3 
letter prefix, and a "_%d" postfix, where the %d could be as much as 
RTE_MAX_MEMZONE in worst case (2560 by default) So actually:

RTE_MEMPOOL_NAMESIZE <= RTE_MEMZONE_NAMESIZE - 
strlen(RTE_MEMPOOL_MZ_PREFIX) - strlen("_2560")

As a side note, there is another bug around here: rte_ring_create() 
doesn't check for name duplications. However the user of the library can 
lookup based on the name with rte_ring_lookup(), and it will return the 
first ring with that name

>
>>> It would even be better to get rid of this static char[] for the
>>> structure names and replace it by an allocated const char *. I didn't
>>> check it's feasible for memzones. What do you think?
>>
>> It would work too, but I don't think it would help a lot. We would still
>> need max sizes for the names. Storing them somewhere else won't help us
>> in this problem.
>
> Why should we have a maximum length for the names?

What happens if an application loads DPDK and create a mempool with a 
name string 2 million characters long? Maybe nothing we should worry 
about, but in general I think unlimited length function parameters are 
problematic at the very least. The length should be passed at least 
(which also creates a max due to the size of the param). But I think it 
would be just easier to have these maximums set, observing the above 
constrains.

>
>
> Thanks,
> Olivier
>

[dpdk-dev] [PATCH] doc: add cryptodev shared library version to release notes

2016-07-20 Thread Pablo de Lara

Signed-off-by: Pablo de Lara 
---
 doc/guides/rel_notes/release_16_07.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/doc/guides/rel_notes/release_16_07.rst 
b/doc/guides/rel_notes/release_16_07.rst
index d3a144f..82ac5bb 100644
--- a/doc/guides/rel_notes/release_16_07.rst
+++ b/doc/guides/rel_notes/release_16_07.rst
@@ -321,6 +321,7 @@ The libraries prepended with a plus sign were incremented 
in this version.
  librte_acl.so.2
  librte_cfgfile.so.2
  librte_cmdline.so.2
+ librte_cryptodev.so.1
  librte_distributor.so.1
  librte_eal.so.2
  librte_hash.so.2
-- 
2.7.4

[dpdk-dev] [PATCH] validate_abi: build faster by augmenting make with job count

2016-07-20 Thread Neil Horman

From: Neil Horman 

John Mcnamara and I were discussing enhacing the validate_abi script to build
the dpdk tree faster with multiple jobs.  Theres no reason not to do it, so this
implements that requirement.  It uses a MAKE_JOBS variable that can be set by
the user to limit the job count.  By default the job count is set to the number
of online cpus.

Signed-off-by: Neil Horman 
CC: Thomas Monjalon 
CC: "Mcnamara, John" 
---
 scripts/validate-abi.sh | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/scripts/validate-abi.sh b/scripts/validate-abi.sh
index c36ad61..1c9627b 100755
--- a/scripts/validate-abi.sh
+++ b/scripts/validate-abi.sh
@@ -97,6 +97,12 @@ fixup_config() {
 #trap on ctrl-c to clean up
 trap cleanup_and_exit SIGINT

+if [ -z "$MAKE_JOBS" ]
+then
+   # This counts the number of cpus on the system
+   MAKE_JOBS=`lscpu -p=cpu | grep -v "#" | wc -l`
+fi
+
 #Save the current branch
 CURRENT_BRANCH=`git branch | grep \* | cut -d' ' -f2`

@@ -183,7 +189,7 @@ log "INFO" "Configuring DPDK $TAG1"
 make config T=$TARGET O=$TARGET > $VERBOSE 2>&1

 log "INFO" "Building DPDK $TAG1. This might take a moment"
-make O=$TARGET > $VERBOSE 2>&1
+make -j$MAKE_JOBS O=$TARGET > $VERBOSE 2>&1

 if [ $? -ne 0 ]
 then
@@ -214,7 +220,7 @@ log "INFO" "Configuring DPDK $TAG2"
 make config T=$TARGET O=$TARGET > $VERBOSE 2>&1

 log "INFO" "Building DPDK $TAG2. This might take a moment"
-make O=$TARGET > $VERBOSE 2>&1
+make -j$MAKE_JOBS O=$TARGET > $VERBOSE 2>&1

 if [ $? -ne 0 ]
 then
-- 
2.5.5

[dpdk-dev] [PATCH 01/10] ethdev: add a generic flow and new behavior switch to fdir

2016-07-20 Thread Thomas Monjalon

2016-02-26 01:17, Wu, Jingjing:
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > 2016-02-25 15:03, Rahul Lakkireddy:
> > > On Wednesday, February 02/24/16, 2016 at 14:17:58 -0800, Thomas Monjalon 
> > > wrote:
> > > > > In our case, it is possible to match several flows
> > > > > in a single rule.  For example, it's possible to set an ethernet, 
> > > > > vlan,
> > > > > ip and tcp/udp flows all in a single rule.  We can specify all of 
> > > > > these
> > > > > flows in a single raw input flow, which can then be passed to cxgbe 
> > > > > flow
> > > > > director to set the corresponding filter.
> > > >
> > > > I feel we need to define what is an API.
> > > > If the application wants to call something specific to the NIC, why 
> > > > using
> > > > the ethdev API? You just have to include cxgbe.h.
> > >
> > > Well, in that sense, flow-director is also very intel specific, no ?
> > 
> > Yes. I think the term "flow director" comes from Intel.
> > 
> > > What we are trying to do is make flow-director generic
> > 
> > So let's stop calling it flow director.
> > We are talking about filtering, right?
> 
> Are you suggesting chelsio to define a new filter type?
> 
> > Why is it so complex? We are talking about packet filtering, not rocket 
> > science!
> >
> The complex is due to different NICs different behavior :-)
> As I know, it is a common way to use used-define data pass specific infor to 
> driver.
> 
> Even flow director is concept from Intel's NIC, but I think it is the generic 
> one comparing
> with other kinds of filters. So I think that's why Rahul choose it to add 
> their kind of filters.
> As I know enic driver also uses flow director API to support their filters.
[...]
> Any better suggestion?

We have a more generic proposal now:
http://dpdk.org/ml/archives/dev/2016-July/043365.html
Rahul, does it fit your needs?

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-20 Thread Adrien Mazarguil

Hi Wenzhuo,

On Wed, Jul 20, 2016 at 02:16:51AM +, Lu, Wenzhuo wrote:
[...]
> > So, today an application cannot combine N-tuple and FDIR flow rules and get 
> > a
> > reliable outcome, unless it is designed for specific devices with a known
> > behavior.
> > 
> > > What's the right behavior of PMD if APP want to create a flow director 
> > > rule
> > which has a higher or even equal priority than an existing n-tuple rule? 
> > Should
> > PMD return fail?
> > 
> > First remember applications only deal with the generic API, PMDs are
> > responsible for choosing the most appropriate HW implementation to use
> > according to the requested flow rules (FDIR, N-tuple or anything else).
> > 
> > For the specific case of FDIR vs N-tuple, if the underlying HW supports 
> > both I do
> > not see why the PMD would create a N-tuple rule. Doesn't FDIR support
> > everything N-tuple can do and much more?
> Talking about the filters, fdir can cover n-tuple. I think that's why i40e 
> only supports fdir but not n-tuple. But n-tuple has its own highlight. As we 
> know, at least on intel NICs, fdir only supports per device mask. But n-tuple 
> can support per rule mask.
> As every pattern has spec and mask both, we cannot guarantee the masks are 
> same. I think ixgbe will try to use n-tuple first if can. Because even the 
> masks are different, we can support them all.

OK, makes sense. In that case existing rules may indeed prevent subsequent
ones from getting created if their priority is wrong. I do not think there
is a way around that if the application needs this exact ordering.

> > Assuming such a thing happened anyway, that the PMD had to create a rule
> > using a high priority filter type and that the application requests the 
> > creation of a
> > rule that can only be done using a lower priority filter type, but also 
> > requested a
> > higher priority for that rule, then yes, it should obviously fail.
> > 
> > That is, unless the PMD can perform some kind of workaround to have both.
> > 
> > > If so, do we need more fail reasons? According to this RFC, I think we 
> > > need
> > return " EEXIST: collision with an existing rule. ", but it's not very 
> > clear, APP
> > doesn't know the problem is priority, maybe more detailed reason is helpful.
> > 
> > Possibly, I've defined a basic set of errors, there are quite a number of 
> > errno
> > values to choose from. However I think we should not define too many values.
> > In my opinion the basic set covers every possible failure:
> > 
> > - EINVAL: invalid format, rule is broken or cannot be understood by the PMD
> >   anyhow.
> > 
> > - ENOTSUP: pattern/actions look fine but something in the requested rule is
> >   not supported and thus cannot be applied.
> > 
> > - EEXIST: pattern/actions are fine and could have been applied if only some
> >   other rule did not prevent the PMD to do it (I see it as the closest thing
> >   to "ETOOBAD" which unfortunately does not exist).
> > 
> > - ENOMEM: like EEXIST, except it is due to the lack of resources not because
> >   of another rule. I wasn't sure which of ENOMEM or ENOSPC was better but
> >   settled on ENOMEM as it is well known. Still open to debate.
> > 
> > Errno values are only useful to get a rough idea of the reason, and another
> > mechanism is needed to pinpoint the exact problem for debugging/reporting
> > purposes, something like:
> > 
> >  enum rte_flow_error_type {
> >  RTE_FLOW_ERROR_TYPE_NONE,
> >  RTE_FLOW_ERROR_TYPE_UNKNOWN,
> >  RTE_FLOW_ERROR_TYPE_PRIORITY,
> >  RTE_FLOW_ERROR_TYPE_PATTERN,
> >  RTE_FLOW_ERROR_TYPE_ACTION,
> >  };
> > 
> >  struct rte_flow_error {
> >  enum rte_flow_error_type type;
> >  void *offset; /* Points to the exact pattern item or action. */
> >  const char *message;
> >  };
> When we are using a CLI and it fails, normally it will let us know which 
> parameter is not appropriate. So, I think it?s a good idea to have this error 
> structure :)

Agreed.

> > Then either provide an optional struct rte_flow_error pointer to
> > rte_flow_validate(), or a separate function (rte_flow_analyze()?), since
> > processing this may be quite expensive and applications may not care about 
> > the
> > exact reason.
> Agree the processing may be too expensive. Maybe we can say it's optional to 
> return error details. And that's a good question that what APP should do if 
> creating the rule fails. I believe normally it will choose handle the rule by 
> itself. But I think it's not bad to feedback more. Or even the APP want to 
> adjust the rules, it cannot be an option for lack of info.

All right then, I'll add it to the specification.

 int
 rte_flow_validate(uint8_t port_id,
   const struct rte_flow_pattern *pattern,
   const struct rte_flow_actions *actions,
   struct rte_flow_error *error);

With error possibly NULL if the application does not care. Is it fine for
you?

[...]
> > > > > > - PMDs, no

[dpdk-dev] [PATCH] examples/vhost: fix perf regression

2016-07-20 Thread Yuanhan Liu

On Tue, Jul 19, 2016 at 01:53:11PM +, Jianfeng Tan wrote:
> We find significant perfermance drop introduced by below commit,
> when vhost example is started with --mergeable 0 and inside vm,
> kernel virtio-net driver is used to do ip based forwarding.
> 
> The root cause is that below commit adds support for
> VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6, and when
> mergeable is disabled, it triggers big_packets path of virtio-net
> driver. In this path, virtio driver uses 19 desc with 18 4K-sized
> pages to receive each packet, so that it can receive a big packet
> with size of 64K. But QEMU only creates 256 desc entries for each
> vq, which results in that only 13 packets can be received. VM
> kernel can quickly handle those packets and go to sleep (HLT).
> 
> As QEMU has no option to set the desc entries of a vq, so here,
> we disable VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6
> with VIRTIO_NET_F_HOST_TSO4 and VIRTIO_NET_F_HOST_TSO6 when we
> disable tso of vhost example, to avoid VM kernel virtio driver
> go into big_packets path.
> 
> Fixes: 859b480d5afd ("vhost: add guest offload setting")

And here you are patching vhost example to try to fix an "issue"
in vhost lib, this is __logically__ wrong.

--yliu
> 
> Reported-by: Qian Xu 
> Signed-off-by: Jianfeng Tan 
> ---
>  examples/vhost/main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index 3b98f42..92a9823 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -327,6 +327,8 @@ port_init(uint8_t port)
>   if (enable_tso == 0) {
>   rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4);
>   rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO6);
> + rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_GUEST_TSO4);
> + rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_GUEST_TSO6);
>   }
>  
>   rx_rings = (uint16_t)dev_info.max_rx_queues;
> -- 
> 2.7.4

[dpdk-dev] [PATCH] maintainers: split networking and crypto drivers

2016-07-20 Thread Thomas Monjalon

There are now 2 different sections for drivers/net/ and drivers/crypto/.
It makes possible to declare some dedicated git trees.

Signed-off-by: Thomas Monjalon 
---
 MAINTAINERS | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9c76352..2996b09 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -242,8 +242,8 @@ F: app/test/test_cryptodev*
 F: examples/l2fwd-crypto/


-Drivers

+Networking Drivers
+--

 Link bonding
 M: Declan Doherty 
@@ -386,10 +386,14 @@ F: doc/guides/nics/pcap_ring.rst
 F: app/test/test_pmd_ring.c
 F: app/test/test_pmd_ring_perf.c

-Null PMD
+Null Networking PMD
 M: Tetsuya Mukawa 
 F: drivers/net/null/

+
+Crypto Drivers
+--
+
 Intel AES-NI GCM PMD
 M: Declan Doherty 
 F: drivers/crypto/aesni_gcm/
-- 
2.7.0

[dpdk-dev] [dpdk-announce] IMPORTANT discussion about Rx filtering API

2016-07-20 Thread Thomas Monjalon

Hi,

There is a proposal in progress to totally rework the configuration
of Rx filtering (including flow director, RSS, etc):
http://dpdk.org/ml/archives/dev/2016-July/043365.html
The target is to integrate this API in 16.11 and migrate the drivers
as fast as possible.

There were some good comments already but only from people interested
in Cavium, Mellanox and Intel devices.
Please developers and maintainers from Amazon, Broadcom, Chelsio, Cisco,
Netronome, QLogic and others, we are waiting your feedbacks.
The intent is to be as generic as possible. But if unfortunately this
new API do not interface well with the hardware you care of, it will
be really harder to fix it later.

Last question: can you commit to migrate your driver in the 16.11
or 17.02 timeframe?

Please answer in the original thread, not this one. This is only
an announce to get more attention on the thread
"Generic flow director/filtering/classification API"

[dpdk-dev] [PATCH v2 2/2] bnx2x: fix fw mempool name length

2016-07-20 Thread Rasesh Mody

This patch fixes following error:
   EAL: Detected 36 lcore(s)
   EAL: Probing VFIO support...
   PMD: bnxt_rte_pmd_init() called for (null)
   EAL: PCI device :08:00.0 on NUMA socket 0
   EAL:   probe driver: 14e4:16a1 rte_bnx2x_pmd
   EAL: PCI device :08:00.1 on NUMA socket 0
   EAL:   probe driver: 14e4:16a1 rte_bnx2x_pmd
   Lcore 0: RX port 0
   Lcore 1: RX port 1
   Initializing port 0... EAL: Error - exiting with code: 1
 Cause: Cannot configure device: err=-6, port=0

Fixes: 540a2110 ("bnx2x: driver core")
Fixes: 85cf0079 ("mem: avoid memzone/mempool/ring name truncation")

Signed-off-by: Rasesh Mody 
---
 drivers/net/bnx2x/bnx2x.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/bnx2x/bnx2x.c b/drivers/net/bnx2x/bnx2x.c
index a059858..a49a07f 100644
--- a/drivers/net/bnx2x/bnx2x.c
+++ b/drivers/net/bnx2x/bnx2x.c
@@ -8886,7 +8886,7 @@ int bnx2x_alloc_hsi_mem(struct bnx2x_softc *sc)
 /***/

if (bnx2x_dma_alloc(sc, FW_BUF_SIZE, &sc->gz_buf_dma,
- "fw_dec_buf", RTE_CACHE_LINE_SIZE) != 0) {
+ "fw_buf", RTE_CACHE_LINE_SIZE) != 0) {
sc->spq = NULL;
sc->sp = NULL;
sc->eq = NULL;
-- 
1.7.10.3

[dpdk-dev] [PATCH v2 1/2] bnx2x: disable fast path interrupts

2016-07-20 Thread Rasesh Mody

Disable fastpath interrupts and remove unneeded delay in
bnx2x_interrupt_action(). This patch fixes and prevents performance
degradation (upto 50% drop) for BNX2X PMD.

Fixes: 540a2110 ("bnx2x: driver core")

Signed-off-by: Rasesh Mody 
---
 drivers/net/bnx2x/bnx2x.c|2 +-
 drivers/net/bnx2x/bnx2x_ethdev.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x.c b/drivers/net/bnx2x/bnx2x.c
index 95fbad8..a059858 100644
--- a/drivers/net/bnx2x/bnx2x.c
+++ b/drivers/net/bnx2x/bnx2x.c
@@ -4507,7 +4507,7 @@ static void bnx2x_handle_fp_tq(struct bnx2x_fastpath *fp, 
int scan_fp)
}

bnx2x_ack_sb(sc, fp->igu_sb_id, USTORM_ID,
-  le16toh(fp->fp_hc_idx), IGU_INT_ENABLE, 1);
+  le16toh(fp->fp_hc_idx), IGU_INT_DISABLE, 1);
 }

 /*
diff --git a/drivers/net/bnx2x/bnx2x_ethdev.c b/drivers/net/bnx2x/bnx2x_ethdev.c
index c8d2bf2..f97dd5b 100644
--- a/drivers/net/bnx2x/bnx2x_ethdev.c
+++ b/drivers/net/bnx2x/bnx2x_ethdev.c
@@ -107,8 +107,8 @@ bnx2x_interrupt_action(struct rte_eth_dev *dev)

PMD_DEBUG_PERIODIC_LOG(INFO, "Interrupt handled");

-   if (bnx2x_intr_legacy(sc, 0))
-   DELAY_MS(250);
+   bnx2x_intr_legacy(sc, 0);
+
if (sc->periodic_flags & PERIODIC_GO)
bnx2x_periodic_callout(sc);
link_status = REG_RD(sc, sc->link_params.shmem_base +
-- 
1.7.10.3

[dpdk-dev] [PATCH] examples/vhost: fix perf regression

2016-07-20 Thread Yuanhan Liu

On Wed, Jul 20, 2016 at 10:44:13AM +0800, Tan, Jianfeng wrote:
> Hi Yuanhan,
> 
> On 7/20/2016 9:44 AM, Yuanhan Liu wrote:
> >On Tue, Jul 19, 2016 at 01:53:11PM +, Jianfeng Tan wrote:
> >>We find significant perfermance drop introduced by below commit,
> >>when vhost example is started with --mergeable 0 and inside vm,
> >>kernel virtio-net driver is used to do ip based forwarding.
> >>
> >>The root cause is that below commit adds support for
> >>VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6, and when
> >>mergeable is disabled, it triggers big_packets path of virtio-net
> >>driver. In this path, virtio driver uses 19 desc with 18 4K-sized
> >>pages to receive each packet, so that it can receive a big packet
> >>with size of 64K. But QEMU only creates 256 desc entries for each
> >>vq, which results in that only 13 packets can be received. VM
> >>kernel can quickly handle those packets and go to sleep (HLT).
> >>
> >>As QEMU has no option to set the desc entries of a vq, so here,
> >>we disable VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6
> >>with VIRTIO_NET_F_HOST_TSO4 and VIRTIO_NET_F_HOST_TSO6 when we
> >>disable tso of vhost example, to avoid VM kernel virtio driver
> >>go into big_packets path.
> >>
> >>Fixes: 859b480d5afd ("vhost: add guest offload setting")
> >>
> >>Reported-by: Qian Xu 
> >>Signed-off-by: Jianfeng Tan 
> >We could apply this patch, but I don't think it actually fix anything:
> >
> >- it doesn't fix other vhost applications, say OVS, which is for sure
> >   way more widly used than vhost-example.
> 
> If I remember it correctly, OVS will enable mergeable.

Yes, and actually, vhost-example also should have enabled it by default.
Meanwhile, all features could be enabled/disabled by user.

> >
> >- it doesn't even fix it when tso is enabled and mergeable-rx is disabled
> >   with this vhost-example.
> 
> But we'd better avoid users go into such doubt that performance drops
> because of that commit under the case tso=off,mergeable=off, right?

I doubt people would actually use vhost-example (besides developer like
us), meaning they can NOT see the benifit from this patch; it also means
that user __does__ go into doubt that performance drops for the case
tso=off,mergeable=off.

Actually, it looks wrong to me to fiddle with those flags in the vhost-example.
If you want to disable tso, you should go disable it on the qemu side,
with something like:

csum=off,gso=off,guest_tso4=off,guest_tso6=off,...

--yliu

[dpdk-dev] [PATCH v4 00/10] Fix build errors related to exported headers

2016-07-20 Thread Thomas Monjalon

2016-07-15 22:03, Bruce Richardson:
> On Wed, Jul 13, 2016 at 03:02:37PM +0200, Adrien Mazarguil wrote:
> > DPDK uses GNU C language extensions in most of its code base. This is fine
> > for internal source files whose compilation flags are controlled by DPDK,
> > however user applications that use exported "public" headers may experience
> > compilation failures when enabling strict error/standard checks (-std and
> > -pedantic for instance).
> > 
> > Exported headers are installed system-wide and must be as clean as possible
> > so applications do not have to resort to workarounds.
> > 
> > This patchset affects exported headers only, compilation problems are
> > addressed as follows:
> > 
> > - Adding the __extension__ keyword to nonstandard constructs (same method
> >   as existing libraries when there is no other choice).
> > - Adding the __extension__ keyword to C11 constructs to remain compatible
> >   with pure C99.
> > - Adding missing includes so exported files can be included out of order
> >   and on their own.
> > - Fixing GNU printf-like variadic macros as there is no magic keyword for
> >   these.
> > 
> 
> Having upgraded to Fedora 24, I'm seeing quite a few errors compiling with gcc
> 6.1.1 in debug mode. Applying this patchset seems to really cut down on those
> errors, so may need to be applied for 16.07 release.

It is reducing the number of warnings but do not completely solve it, right?
It is a very good patchset but it needs to be validated with a large number
of compilers and options. That's why I think it is too late for 16.07.

[dpdk-dev] [PATCH] vhost: fix driver unregister for client mode

2016-07-20 Thread Ilya Maximets

Currently while calling of 'rte_vhost_driver_unregister()' connection
to QEMU will not be closed. This leads to inability to register driver
again and reconnect to same virtual machine.

This scenario is reproducible with OVS. While executing of the following
command vhost port will be re-created (will be executed
'rte_vhost_driver_register()' followed by 'rte_vhost_driver_unregister()')
network will be broken and QEMU possibly will crash:

ovs-vsctl set Interface vhost1 ofport_request=15

Fix this by closing all established connections on driver unregister and
removing of pending connections from reconnection list.

Fixes: 64ab701c3d1e ("vhost: add vhost-user client mode")
Signed-off-by: Ilya Maximets 
---

Patch prepared for master branch because dpdk-next-virtio doesn't contain
commit acbff5c67ea7 ("vhost: fix crash when exceeding file descriptors").
Porting to dpdk-next-virtio/master is trivial and may be performed on
demand.

 lib/librte_vhost/vhost_user/fd_man.c | 15 ++--
 lib/librte_vhost/vhost_user/fd_man.h |  2 +-
 lib/librte_vhost/vhost_user/vhost-net-user.c | 56 
 3 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/lib/librte_vhost/vhost_user/fd_man.c 
b/lib/librte_vhost/vhost_user/fd_man.c
index c691339..2d3eeb7 100644
--- a/lib/librte_vhost/vhost_user/fd_man.c
+++ b/lib/librte_vhost/vhost_user/fd_man.c
@@ -132,8 +132,10 @@ fdset_init(struct fdset *pfdset)
if (pfdset == NULL)
return;

-   for (i = 0; i < MAX_FDS; i++)
+   for (i = 0; i < MAX_FDS; i++) {
pfdset->fd[i].fd = -1;
+   pfdset->fd[i].dat = NULL;
+   }
pfdset->num = 0;
 }

@@ -166,14 +168,16 @@ fdset_add(struct fdset *pfdset, int fd, fd_cb rcb, fd_cb 
wcb, void *dat)

 /**
  *  Unregister the fd from the fdset.
+ *  Returns context of a given fd or NULL.
  */
-void
+void *
 fdset_del(struct fdset *pfdset, int fd)
 {
int i;
+   void *dat = NULL;

if (pfdset == NULL || fd == -1)
-   return;
+   return NULL;

do {
pthread_mutex_lock(&pfdset->fd_mutex);
@@ -181,13 +185,17 @@ fdset_del(struct fdset *pfdset, int fd)
i = fdset_find_fd(pfdset, fd);
if (i != -1 && pfdset->fd[i].busy == 0) {
/* busy indicates r/wcb is executing! */
+   dat = pfdset->fd[i].dat;
pfdset->fd[i].fd = -1;
pfdset->fd[i].rcb = pfdset->fd[i].wcb = NULL;
+   pfdset->fd[i].dat = NULL;
pfdset->num--;
i = -1;
}
pthread_mutex_unlock(&pfdset->fd_mutex);
} while (i != -1);
+
+   return dat;
 }

 /**
@@ -203,6 +211,7 @@ fdset_del_slot(struct fdset *pfdset, int index)

pfdset->fd[index].fd = -1;
pfdset->fd[index].rcb = pfdset->fd[index].wcb = NULL;
+   pfdset->fd[index].dat = NULL;
pfdset->num--;

pthread_mutex_unlock(&pfdset->fd_mutex);
diff --git a/lib/librte_vhost/vhost_user/fd_man.h 
b/lib/librte_vhost/vhost_user/fd_man.h
index 74ecde2..bd66ed1 100644
--- a/lib/librte_vhost/vhost_user/fd_man.h
+++ b/lib/librte_vhost/vhost_user/fd_man.h
@@ -60,7 +60,7 @@ void fdset_init(struct fdset *pfdset);
 int fdset_add(struct fdset *pfdset, int fd,
fd_cb rcb, fd_cb wcb, void *dat);

-void fdset_del(struct fdset *pfdset, int fd);
+void *fdset_del(struct fdset *pfdset, int fd);

 void fdset_event_dispatch(struct fdset *pfdset);

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index f0ea3a2..f0f92f8 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -60,6 +60,7 @@
 struct vhost_user_socket {
char *path;
int listenfd;
+   int connfd;
bool is_server;
bool reconnect;
 };
@@ -277,11 +278,13 @@ vhost_user_add_connection(int fd, struct 
vhost_user_socket *vsocket)

RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);

+   vsocket->connfd = fd;
conn->vsocket = vsocket;
conn->vid = vid;
ret = fdset_add(&vhost_user.fdset, fd, vhost_user_msg_handler,
NULL, conn);
if (ret < 0) {
+   vsocket->connfd = -1;
free(conn);
close(fd);
RTE_LOG(ERR, VHOST_CONFIG,
@@ -329,6 +332,7 @@ vhost_user_msg_handler(int connfd, void *dat, int *remove)
RTE_LOG(ERR, VHOST_CONFIG,
"vhost read incorrect message\n");

+   vsocket->connfd = -1;
close(connfd);
*remove = 1;
free(conn);
@@ -635,6 +639,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
goto out;
memset(vsocket, 0, sizeof(struct vhost_user_socket))

[dpdk-dev] [PATCH] examples/vhost: fix perf regression

2016-07-20 Thread Tan, Jianfeng

Hi Yuanhan,

On 7/20/2016 9:44 AM, Yuanhan Liu wrote:
> On Tue, Jul 19, 2016 at 01:53:11PM +, Jianfeng Tan wrote:
>> We find significant perfermance drop introduced by below commit,
>> when vhost example is started with --mergeable 0 and inside vm,
>> kernel virtio-net driver is used to do ip based forwarding.
>>
>> The root cause is that below commit adds support for
>> VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6, and when
>> mergeable is disabled, it triggers big_packets path of virtio-net
>> driver. In this path, virtio driver uses 19 desc with 18 4K-sized
>> pages to receive each packet, so that it can receive a big packet
>> with size of 64K. But QEMU only creates 256 desc entries for each
>> vq, which results in that only 13 packets can be received. VM
>> kernel can quickly handle those packets and go to sleep (HLT).
>>
>> As QEMU has no option to set the desc entries of a vq, so here,
>> we disable VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6
>> with VIRTIO_NET_F_HOST_TSO4 and VIRTIO_NET_F_HOST_TSO6 when we
>> disable tso of vhost example, to avoid VM kernel virtio driver
>> go into big_packets path.
>>
>> Fixes: 859b480d5afd ("vhost: add guest offload setting")
>>
>> Reported-by: Qian Xu 
>> Signed-off-by: Jianfeng Tan 
> We could apply this patch, but I don't think it actually fix anything:
>
> - it doesn't fix other vhost applications, say OVS, which is for sure
>way more widly used than vhost-example.

If I remember it correctly, OVS will enable mergeable.

>
> - it doesn't even fix it when tso is enabled and mergeable-rx is disabled
>with this vhost-example.

But we'd better avoid users go into such doubt that performance drops 
because of that commit under the case tso=off,mergeable=off, right?

Thanks,
Jianfeng

>
> Thanks for the good root-cause, btw!
>
>   --yliu

[dpdk-dev] [PATCH] eal: fix rte_intr_dp_is_en() check

2016-07-20 Thread Thomas Monjalon

2016-07-20 07:03, Liang, Cunming:
>Probably a clean way is not to handle device external interrupt event in 
> EAL
>interrupt thread (intr mb may have some problem). The EAL interrupt thread 
> is
>only used to postpone the delay execution or other background interrupt
>(e.g. alarm). Then misc/non-misc can be combined,  while requiring APP to
>detect the interrupt causes. 

I am not sure it was a good idea to have a thread for the link interrupt.
It may be simpler and cleaner to let the application do the pthread_create.

[dpdk-dev] [PATCH v2] doc: announce ABI change for mbuf structure

2016-07-20 Thread Ferruh Yigit

On 7/20/2016 8:16 AM, Olivier Matz wrote:
> For 16.11, the mbuf structure will be modified implying ABI breakage.
> Some discussions already took place here:
> http://www.dpdk.org/dev/patchwork/patch/12878/
> 
> Signed-off-by: Olivier Matz 
> ---
> 
> v1->v2:
> - reword the sentences to keep things more open, as suggested by Bruce
> 
>  doc/guides/rel_notes/deprecation.rst | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst 
> b/doc/guides/rel_notes/deprecation.rst
> index f502f86..b9f5a93 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -41,3 +41,9 @@ Deprecation Notices
>  * The mempool functions for single/multi producer/consumer are deprecated and
>will be removed in 16.11.
>It is replaced by rte_mempool_generic_get/put functions.
> +
> +* ABI changes are planned for 16.11 in the ``rte_mbuf`` structure: some 
> fields
> +  may be reordered to facilitate the writing of ``data_off``, ``refcnt``, and
> +  ``nb_segs`` in one operation, because some platforms have an overhead if 
> the
> +  store address is not naturally aligned. Other mbuf fields, such as the
> +  ``port`` field, may be moved or removed as part of this mbuf work.
> 

Not directly related to this patch, but generally for deprecation
notices, does it make sense to tag explicitly which library effected, like:

* librte_mbuf [perhaps with version here]:
  Explanation about deprecation ...

For this case it is more clear which library effected, but sometimes
that is not obvious from deprecation notice.

Also when checked for if specific library effected, it is harder to find
with current notes.

Thanks,
ferruh

[dpdk-dev] [PATCH] examples/vhost: fix perf regression

2016-07-20 Thread Yuanhan Liu

On Tue, Jul 19, 2016 at 01:53:11PM +, Jianfeng Tan wrote:
> We find significant perfermance drop introduced by below commit,
> when vhost example is started with --mergeable 0 and inside vm,
> kernel virtio-net driver is used to do ip based forwarding.
> 
> The root cause is that below commit adds support for
> VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6, and when
> mergeable is disabled, it triggers big_packets path of virtio-net
> driver. In this path, virtio driver uses 19 desc with 18 4K-sized
> pages to receive each packet, so that it can receive a big packet
> with size of 64K. But QEMU only creates 256 desc entries for each
> vq, which results in that only 13 packets can be received. VM
> kernel can quickly handle those packets and go to sleep (HLT).
> 
> As QEMU has no option to set the desc entries of a vq, so here,
> we disable VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6
> with VIRTIO_NET_F_HOST_TSO4 and VIRTIO_NET_F_HOST_TSO6 when we
> disable tso of vhost example, to avoid VM kernel virtio driver
> go into big_packets path.
> 
> Fixes: 859b480d5afd ("vhost: add guest offload setting")
> 
> Reported-by: Qian Xu 
> Signed-off-by: Jianfeng Tan 

We could apply this patch, but I don't think it actually fix anything:

- it doesn't fix other vhost applications, say OVS, which is for sure
  way more widly used than vhost-example.

- it doesn't even fix it when tso is enabled and mergeable-rx is disabled
  with this vhost-example.

Thanks for the good root-cause, btw!

--yliu

[dpdk-dev] [PATCH v2] doc: announce ABI change for mbuf structure

2016-07-20 Thread Olivier Matz

For 16.11, the mbuf structure will be modified implying ABI breakage.
Some discussions already took place here:
http://www.dpdk.org/dev/patchwork/patch/12878/

Signed-off-by: Olivier Matz 
---

v1->v2:
- reword the sentences to keep things more open, as suggested by Bruce

 doc/guides/rel_notes/deprecation.rst | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/doc/guides/rel_notes/deprecation.rst 
b/doc/guides/rel_notes/deprecation.rst
index f502f86..b9f5a93 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -41,3 +41,9 @@ Deprecation Notices
 * The mempool functions for single/multi producer/consumer are deprecated and
   will be removed in 16.11.
   It is replaced by rte_mempool_generic_get/put functions.
+
+* ABI changes are planned for 16.11 in the ``rte_mbuf`` structure: some fields
+  may be reordered to facilitate the writing of ``data_off``, ``refcnt``, and
+  ``nb_segs`` in one operation, because some platforms have an overhead if the
+  store address is not naturally aligned. Other mbuf fields, such as the
+  ``port`` field, may be moved or removed as part of this mbuf work.
-- 
2.8.1

[dpdk-dev] [PATCH] doc: note a pitfall on reconnect feature

2016-07-20 Thread Yuanhan Liu

On Tue, Jul 19, 2016 at 01:57:11PM +, Mcnamara, John wrote:
> > -  Note: the "reconnect" feature requires **QEMU v2.7** (or above).
> > +  .. Note::
> > + * The "reconnect" feature requires **QEMU v2.7** (or above).
> > +
> > + * The vhost supported features must be exactly the same before and
> > +   after the restart. For example, if TSO is disabled and then
> > enabled,
> > +   nothing will work and issues undefined might happen.
> 
> s/ issues undefined / undefined issues /

Thomas, mind to fix it while apply?

> Apart from that:
> 
> Acked-by: John McNamara 

Thanks!

--yliu

[dpdk-dev] SRIOV hot unplug

2016-07-20 Thread Eli Britstein

Integrating Tetsuya's last comments on top of the latest message and reply.

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Eli Britstein
> Sent: Tuesday, 19 July, 2016 5:34 PM
> To: Declan Doherty; Tetsuya Mukawa; dev at dpdk.org
> Cc: Iremonger, Bernard
> Subject: Re: [dpdk-dev] SRIOV hot unplug
>
>
>
> > -Original Message-
> > From: Declan Doherty [mailto:declan.doherty at intel.com]
> > Sent: Tuesday, 19 July, 2016 5:15 PM
> > To: Eli Britstein; Tetsuya Mukawa; dev at dpdk.org
> > Cc: Iremonger, Bernard
> > Subject: Re: [dpdk-dev] SRIOV hot unplug
> >
> > On 07/19/2016 12:18 PM, Eli Britstein wrote:
> > >
> > >
> > >> -Original Message-
> > >> From: Tetsuya Mukawa [mailto:mukawa at igel.co.jp]
> > >> Sent: Tuesday, 19 July, 2016 5:49 AM
> > >> To: Eli Britstein; dev at dpdk.org
> > >> Cc: Iremonger, Bernard
> > >> Subject: Re: [dpdk-dev] SRIOV hot unplug
> > >>
> > >> Hi Eli,
> > >>
> > >> On 2016/07/18 17:47, Eli Britstein wrote:
> > >>> Hi Bernard,
> > >>>
> > >>> Thank you for your answer.
> > >>> However, to do this, I have to have some communication protocol to
> > >>> the
> > >> VM's application in order for it to do this sequence and
> > >> acknowledge that it is now safe to proceed with detaching the device.
> > >>> This implies some kind of integration from the host side, which I
> > >>> would like
> > >> to avoid.
> > >>
> > >> I guess you should have some kind of communication channel to
> > >> notice the hotpluging events from host to VM.
> > > [Eli Britstein]
> > > In order just to notice the hotplugging events inside the VM, I can
> > > use add some udev action in the VM, in /etc/udev/rules.d/XXX
> > > However,
> > those are asynchronous events. The host proceeds with unplugging
> > without waiting for the VM to acknowledge it.
> > >
[Tetsuya] Yes, it will be asynchronous.

> > >>
> > >>> Do you think might there be any other way for the application to
> > >>> handle
> > >> such event in a smooth way?
> > >>
> > >> So far, I guess having one more virtio-net device will be easiest way.
> > > [Eli Britstein]
> > > Could you please elaborate your meaning? How do you mean to use this
> > extra virtio-net device?
> > >
[Tetsuya] If you have one more network device, you can send and receive some 
kind of control messages between host and VM.
It may be okay to use general socket API for this additional network device. So 
I guess it's not so difficult to additionally implement it.

[Eli Britstein] Again, my goal is to do the hot unplug *without* any 
change/integration in the host side, just from the guest side.

Thanks, Eli

Thanks,
Tetsuya
> > > To clarify: I would like to have my bond device automatically set
> > > the vNIC as
> > its primary, and close/remove the VF before it is unplugged.
> > >
> > > Thanks,
> > > Eli
> >
> > The only that I can think of which would allow bonding to support this
> > is if you could force a link down notification to the VF. The bonding
> > driver should automatically then fail over to the vNIC, but again this
> > is asynchronous event so I'm not sure how you could confirm this
> > action from the host other than by monitoring traffic?
> [Eli Britstein]
> How can I force link down notification to the VF?
>
> >
> > Also I don't know if the DPDK environment has been tested to support
> > the removal of dynamic hot-plugging you are suggesting via the udev
> action.
> > There are no hooks that I'm aware of which would detect that the PCI
> > device is no longer available and then clean the resources away in DPDK.
> >
> > >
> > 
> > >
> >
> > Also could you please drop the footer/disclaimer on replies to the
> > mailing list (see http://dpdk.org/ml)
>
> --
> ---
> This email and any files transmitted and/or attachments with it are
> confidential and proprietary information of Toga Networks Ltd., and
> intended solely for the use of the individual or entity to whom they are
> addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information of Toga Networks Ltd., and is
> intended only for the individual named. If you are not the named addressee
> you should not disseminate, distribute or copy this e-mail. Please notify the
> sender immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any action 
> in
> reliance on the contents of this information is strictly prohibited.
> --
> --
>
> --
> ---

[dpdk-dev] [PATCH] eal: fix rte_intr_dp_is_en() check

2016-07-20 Thread Liang, Cunming

Hi Thomas,

> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, July 20, 2016 4:42 PM
> To: Liang, Cunming 
> Cc: Yong Wang ; dev at dpdk.org;
> david.marchand at 6wind.com
> Subject: Re: [dpdk-dev] [PATCH] eal: fix rte_intr_dp_is_en() check
> 
> 2016-07-20 07:03, Liang, Cunming:
> >Probably a clean way is not to handle device external interrupt event in 
> > EAL
> >interrupt thread (intr mb may have some problem). The EAL interrupt 
> > thread
> is
> >only used to postpone the delay execution or other background interrupt
> >(e.g. alarm). Then misc/non-misc can be combined,  while requiring APP to
> >detect the interrupt causes.
> 
> I am not sure it was a good idea to have a thread for the link interrupt.
> It may be simpler and cleaner to let the application do the pthread_create.
The EAL intr thread is reserved for all interrupt before, is not design for 
link interrupt
on the first day. Personally I vote to remove link interrupt from interrupt 
thread.
-Cunming

[dpdk-dev] [PATCH] ethdev: ensure consistent port id assignment

2016-07-20 Thread Kerlin, MarcinX

Hi Amin,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Tootoonchian, Amin
> Sent: Tuesday, July 12, 2016 4:01 AM
> To: thomas.monjalon at 6wind.com
> Cc: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH] ethdev: ensure consistent port id assignment
> 
> The rte_eth_dev_allocate() code has an implicit assumption that the port id
> assignment in the secondary process is consistent with that of the primary. 
> The
> current code breaks if the enumeration of ethdevs in primary and secondary
> processes are not identical (e.g., when the black/whitelist and vdev args of 
> the
> primary and secondary do not match, or when the primary dynamically
> adds/removes ethdevs).
> 
> To fix this problem, rte_eth_dev_allocate() now looks up port id by name in a
> secondary process (making it explicit that ethdevs can only be created and
> initialized by the primary process). Upon releasing a port, the primary 
> process
> zeros out eth_dev->data to avoid false positives in port id lookup by
> rte_eth_dev_get_port_by_name().
> 
> Signed-off-by: Amin Tootoonchian 
> ---
>  lib/librte_ether/rte_ethdev.c | 44 +
> --
>  1 file changed, 30 insertions(+), 14 deletions(-)
> 
> diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c 
> index
> 0a6e3f1..1801f57 100644
> --- a/lib/librte_ether/rte_ethdev.c
> +++ b/lib/librte_ether/rte_ethdev.c
> @@ -195,25 +195,37 @@ rte_eth_dev_allocate(const char *name, enum
> rte_eth_dev_type type)
>   uint8_t port_id;
>   struct rte_eth_dev *eth_dev;
> 
> - port_id = rte_eth_dev_find_free_port();
> - if (port_id == RTE_MAX_ETHPORTS) {
> - RTE_PMD_DEBUG_TRACE("Reached maximum number of
> Ethernet ports\n");
> - return NULL;
> - }
> -
>   if (rte_eth_dev_data == NULL)
>   rte_eth_dev_data_alloc();
> 
> - if (rte_eth_dev_allocated(name) != NULL) {
> - RTE_PMD_DEBUG_TRACE("Ethernet Device with name %s
> already allocated!\n",
> - name);
> - return NULL;
> + if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> + port_id = rte_eth_dev_find_free_port();
> + if (port_id == RTE_MAX_ETHPORTS) {
> + RTE_PMD_DEBUG_TRACE("Reached maximum number
> of Ethernet ports\n");
> + return NULL;
> + }
> +
> + if (rte_eth_dev_allocated(name) != NULL) {
> + RTE_PMD_DEBUG_TRACE("Ethernet Device with name
> %s already allocated!\n",
> + name);
> + return NULL;
> + }
> + } else {
> + if (rte_eth_dev_get_port_by_name(name, &port_id) != 0) {


I was working also on this problem but I didn't send patch yet, so I did review 
of your code.

Condition (rte_eth_dev_get_port_by_name(name, &port_id) != 0) will always fail.
Secondary process enter here and he will be looking for him name but has not 
yet added
and the application return NULL here e.g. we will run app with device name 
pcap1 but this 
device is not on list and function return null.


> + RTE_PMD_DEBUG_TRACE("Ethernet Device with name
> %s could not be found!\n",
> + name);
> + return NULL;
> + }
>   }
> 
>   eth_dev = &rte_eth_devices[port_id];
>   eth_dev->data = &rte_eth_dev_data[port_id];
> - snprintf(eth_dev->data->name, sizeof(eth_dev->data->name), "%s",
> name);
> - eth_dev->data->port_id = port_id;
> +
> + if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> + snprintf(eth_dev->data->name, sizeof(eth_dev->data->name),
> "%s", name);
> + eth_dev->data->port_id = port_id;
> + }
> +


1. rte_eth_dev_data[port_id] -> port id should be shifted because secondary 
process 
overwrite e.g. first position which is common with primary process, so should 
be add at the end

2. If this condition is true only for primary it means that secondary process 
can't add own name.
So this excludes with above line: "if (rte_eth_dev_get_port_by_name(name, 
&port_id) != 0)"?

I will send also my patch soon and we can compare and prepare a common version. 
We should 
keep in mind also the hot plugging.


>   eth_dev->attached = DEV_ATTACHED;
>   eth_dev->dev_type = type;
>   nb_ports++;
> @@ -293,8 +305,10 @@ rte_eth_dev_init(struct rte_pci_driver *pci_drv,
>   pci_drv->name,
>   (unsigned) pci_dev->id.vendor_id,
>   (unsigned) pci_dev->id.device_id);
> - if (rte_eal_process_type() == RTE_PROC_PRIMARY)
> + if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
>   rte_free(eth_dev->data->dev_private);
> + memset(eth_dev->data, 0, sizeof(*rte_eth_dev_data));
> + }
>   rte_eth_dev_release_port(eth_dev);
>   return diag;
>  }
> @@ -330,8 +344,

[dpdk-dev] [PATCH] doc: fix vhost setup in tep-termination app guide

2016-07-20 Thread Kavanagh, Mark B

>
>Hi Mark,

Hi Jianfeng,

Thanks for your comments - I've responded inline.

Cheers,
Mark

>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Mark Kavanagh
>> Sent: Tuesday, July 19, 2016 11:32 PM
>> To: dev at dpdk.org
>> Subject: [dpdk-dev] [PATCH] doc: fix vhost setup in tep-termination app
>> guide
>>
>> - Fix vhost setup flags
>> - Add minor edits to improve readability and consistency
>>
>> Signed-off-by: Mark Kavanagh 
>> ---
>>  doc/guides/sample_app_ug/tep_termination.rst | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>  mode change 100644 => 100755
>> doc/guides/sample_app_ug/tep_termination.rst
>>
>> diff --git a/doc/guides/sample_app_ug/tep_termination.rst
>> b/doc/guides/sample_app_ug/tep_termination.rst
>> old mode 100644
>> new mode 100755
>> index 2d86a03..c3d1e97
>> --- a/doc/guides/sample_app_ug/tep_termination.rst
>> +++ b/doc/guides/sample_app_ug/tep_termination.rst
>> @@ -59,8 +59,8 @@ This allows network isolation, QOS, etc to be provided
>> on a per client basis.
>>  In a typical setup, the network overlay tunnel is terminated at the
>> Virtual/Tunnel End Point (VEP/TEP).
>>  The TEP is normally located at the physical host level ideally in the 
>> software
>> switch.
>>  Due to processing constraints and the inevitable bottleneck that the switch
>> -becomes the ability to offload overlay support features becomes an
>> important requirement.
>> -Intel? XL710 10/40 G Ethernet network card provides hardware filtering
>> +becomes, the ability to offload overlay support features becomes an
>> important requirement.
>> +Intel? XL710 10/40 Gigabit Ethernet network card provides hardware
>> filtering
>>  and offload capabilities to support overlay networks implementations such
>> as MAC in UDP and MAC in GRE.
>>
>>  Sample Code Overview
>> @@ -131,14 +131,14 @@ Compiling the Sample Code
>>
>>  .. code-block:: console
>>
>> -CONFIG_RTE_LIBRTE_VHOST=n
>> +CONFIG_RTE_LIBRTE_VHOST=y
>>
>>  vhost user is turned on by default in the configure file
>> config/common_linuxapp.
>>  To enable vhost cuse, disable vhost user.
>>
>>  .. code-block:: console
>>
>> -CONFIG_RTE_LIBRTE_VHOST_USER=y
>> +CONFIG_RTE_LIBRTE_VHOST_USER=n
>
>Actually, this example does not necessarily disable vhost-user. It can work 
>with vhost-user.

Okay; however, the instruction that I modified relates to the case where vhost 
cuse is the preferred option (or at least that's how it comes across upon 
reading).
With that in mind, CONFIG_RTE_LIBRTE_VHOST_USER should be set to 'n', and not 
'y', as previously described.

>I also notice that there are many comments inside this example to indicate it 
>works with
>vhost-cuse. I think we need to correct this. 

Do you mean to say that the example app does not work with vhost cuse? What 
should be corrected? 

>By the way, vhost-cuse could be deprecated in
>the future, please refer here, 
>http://dpdk.org/ml/archives/dev/2016-July/044080.html

Thanks Jianfeng, I'm aware of this; however, as that will not be the case in 
the v16.07 release, I think that the cuse instructions should still be updated.
>
>Thanks,
>Jianfeng
>
>>
>>   After vhost is enabled and the implementation is selected, build the 
>> vhost
>> library.
>>
>> --
>> 1.9.3

[dpdk-dev] [PATCH] eal: fix rte_intr_dp_is_en() check

2016-07-20 Thread Liang, Cunming

Hi Yong,

> -Original Message-
> From: Yong Wang [mailto:yongwang at vmware.com]
> Sent: Wednesday, July 20, 2016 5:59 AM
> To: Thomas Monjalon 
> Cc: dev at dpdk.org; Liang, Cunming ;
> david.marchand at 6wind.com
> Subject: Re: [dpdk-dev] [PATCH] eal: fix rte_intr_dp_is_en() check
> 
> 
> > On Jul 18, 2016, at 2:21 AM, Thomas Monjalon 
> wrote:
> >
> > Hi Yong,
> >
> > I think the interrupt management should be simpler.
> > If you want to invest some time to rework this API, you
> > are very welcome.
> >
> >
> > 2016-07-18 14:19, Liang, Cunming:
> >> Hi Yong,
> >>
> >> rte_intr_dp_is_en() returns true when rte_intr_efd_enable() (the way to
> >> enable data-path interrupt) sets a number of event fds.
> >> In this case, "intr_conf.rxq=1" configuration causes "nb_efd=1". The
> >> value comes from RTE_MIN($nb_efd, 1) from data-path, but not from link
> >> event.
> >> Per link event, you shouldn't use rte_intr_dp_is_en() as the indication.
> >> As igb_uio only has a single vector, when the conflict(both intr_rxq and
> >> intr_lsc turn on) happens, the intr_rxq has high priority than intr_lsc
> >> as default PMD behavior.
> >> Reference as PG 3.1.9 note in
> >> https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__dpdk.org_doc_guides_prog-5Fguide_env-5Fabstraction-
> 5Flayer.html&d=CwICAg&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-
> uEs&r=44mSO5N5yEs4CeCdtQE0xt0F7J0p67_mApYVAzyYms0&m=xFO005GdZMA
> 9a6IC2kvhKYaXW7Rhl3dlcom5HfOSy2g&s=yERFYJTe6TBWFCltmR3xikmxkLKutbot
> 5BfFxK30tZ0&e=
> >>
> >> Regards,
> >> Cunning
> 
> Thanks Cunming and Thomas for the feedback.  I agree with Thomas that the rx
> queue interrupt APIs could be simplified and it?s not clear to me if the API 
> user
> needs to be concerned about rte_intr_cap_multiple(), rte_intr_allow_others(),
> rte_intr_dp_is_en(), etc. if all he needs is to enable intr with certain # of 
> vectors
> and map the vectors to either efds or callbacks.  I feel it?s simpler just 
> returning
> the number of vectors the system can support if it cannot satisfy nb_efd 
> vectors
> and let the application decide how should this be handled. 

It's welcome to make a simpler intr API. At the beginning of introducing rxq 
intr, it
didn't attract much attention, the coordination problem of EAL intr thread 
remains.

Status quo:

- The 3 APIs you talked is the helper function for PMD developer, not for end 
user.
   EAL provides efds/vectors mapping, hides vfio/uio difference. The 
causes/vectors
   mapping is handled by PMD. Then it's not necessary for the end user to aware
   intr vector, just take aware of event by queue id.

   Your idea on exposing vectors to end user is an option. In terms of it, the 
APP
   has to understand the causes of vectors, misc(e.g. lsc, mb, etc) vs 
non-misc(rxq).
   In particular case(e.g. mb), it seems APP is not the right person to turn 
on/off.
   If all are acceptable, ethdev can provide new API for causes to vector 
mapping.

   There's one note for replacing the 3 API provided to PMD. PMD is blind
   to the user space IO infrastructure, vfio or uio. The way for PMD to 
configure
   mapping for legacy/MSI(e.g. uio case) and MSIX(e.g. vfio case) isn't the 
same.

>The link event and rxq
> interrupt init/uninit path also looks quite different and some unification 
> will be
> helpful.

- 'intr_conf' and interrupt thread for misc
   To turn on lsc intr, it requires 'intr_conf.lsc=1'. It cases ethdev to 
register handler
   for further processing in the interrupt thread. It's not necessary for rxq 
intr.

   Probably a clean way is not to handle device external interrupt event in EAL
   interrupt thread (intr mb may have some problem). The EAL interrupt thread is
   only used to postpone the delay execution or other background interrupt
   (e.g. alarm). Then misc/non-misc can be combined,  while requiring APP to
   detect the interrupt causes. 

> 
> The API documentation can also use some more clarification.  For example,
> without reading the program guide, it?s easy to miss the fact that when the
> device is bound to UIO, the implementation will try to enable rxq instead of 
> lsc
> when both intr_conf.lsc and intr_conf.rxq is set to 1. I can imagine there 
> are cases
> where an application would prefer to enable link event instead of rxq 
> interrupt in
> such cases and currently there is no easy way to achieve that.  Or is there 
> any
> particular reason such a preference is chosen?

In uio, if do care lsc. The suggestion is to only turn on intr_conf.lsc. It's 
just a default
behavior definition for the single vector tradeoff. Which don't want to 
introduce
another configuration. The uio msix patch was expected to upstream on that 
time...
The uio single vector problem probably will remain for a long time. So welcome 
to
rework these stuffs. 

Thanks,
Cunming

> 
> >>
> >> On 7/15/2016 8:36 AM, Yong Wang wrote:
> >>> When binding a device to igb_uio with intr_conf.rxq set to 1, nb_efd
> >>> is 1 (for link event) but rte_intr_dp

[dpdk-dev] [PATCH] examples/vhost: fix perf regression

2016-07-20 Thread Xu, Qian Q

Tested-by: Qian Xu 

- Test Commit: 8f6f24342281f59de0df7bd976a32f714d39b9a9
- OS/Kernel: Fedora 21/4.1.13
- GCC: gcc (GCC) 4.9.2 20141101 (Red Hat 4.9.2-1)
- CPU: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10
- NIC: Intel(R) Ethernet Controller X710 for 10GbE SFP+
- Total 2 cases, 2 passed, 0 failed. 

Test Case1: Virtio-net IPV4 fwd performance with mergable=off

Summary: 
Launch the vhost-switch sample, and launch VM with 2 virtio-net devices, let 2 
virtio-net run IPV4 fwd, send traffic to the NIC port and let the traffic go 
through 2 virtio-net devices. Check the performance.

Details: 
1. Bind one port to igb_uio. 
2. Run vhost switch sample with mergeable=0, disable mergeable. 
taskset -c 18-19 ./examples/vhost/build/vhost-switch -c 0xc -n 4 --huge-dir 
/mnt/huge --socket-mem 1024,1024 -- -p 0x1 --mergeable 0 --vm2vm 0
3. Launch VM: 
taskset -c 22-23 \
/root/qemu-versions/qemu-2.6.0/x86_64-softmmu/qemu-system-x86_64 -name vm1 \
-cpu host -enable-kvm -m 2048 -object 
memory-backend-file,id=mem,size=2048M,mem-path=/mnt/huge,share=on -numa 
node,memdev=mem -mem-prealloc \
-smp cores=4 -drive file=/home/img/vm1.img  \
-chardev socket,id=char0,path=./vhost-net \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,mac=52:54:00:00:00:01,netdev=mynet1,mrg_rxbuf=on \
-chardev socket,id=char1,path=./vhost-net \
-netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce \
-device virtio-net-pci,mac=52:54:00:00:00:02,netdev=mynet2,mrg_rxbuf=on \
-netdev tap,id=ipvm1,ifname=tap3,script=/etc/qemu-ifup -device 
rtl8139,netdev=ipvm1,id=net0,mac=00:00:00:00:10:01 \
-vnc :3 -daemonize
4. Set IPV4 fwd rules in VM: 
virtio1=$1
virtio2=$2
systemctl stop firewalld.service
systemctl disable firewalld.service
systemctl stop ip6tables.service
systemctl disable ip6tables.service
systemctl stop iptables.service
systemctl disable iptables.service
systemctl stop NetworkManager.service
echo 1 >/proc/sys/net/ipv4/ip_forward
ip addr add 192.168.1.2/24 dev $virtio1
ip neigh add 192.168.1.1 lladdr 00:00:10:00:24:00 dev $virtio1
ip link set dev $virtio1 up

ip addr add 192.168.2.2/24 dev $virtio2
ip neigh add 192.168.2.1 lladdr 00:00:10:00:24:01 dev $virtio2
ip link set dev $virtio2 up

5. Send traffic to NIC and see the performance back from virtio2. The 
performance is back with the patch. 

Test Case2: Virtio-net IPV4 fwd performance with mergable=on
Similar steps, just one feature set is different, set mergable=1 in the 
vhost-switch sample, then the performance is good as before. 



Thanks
Qian

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Jianfeng Tan
Sent: Tuesday, July 19, 2016 9:53 PM
To: dev at dpdk.org
Cc: yuanhan.liu at linux.intel.com; Wang, Zhihong ; 
Tan, Jianfeng 
Subject: [dpdk-dev] [PATCH] examples/vhost: fix perf regression

We find significant perfermance drop introduced by below commit, when vhost 
example is started with --mergeable 0 and inside vm, kernel virtio-net driver 
is used to do ip based forwarding.

The root cause is that below commit adds support for
VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6, and when mergeable is 
disabled, it triggers big_packets path of virtio-net driver. In this path, 
virtio driver uses 19 desc with 18 4K-sized pages to receive each packet, so 
that it can receive a big packet with size of 64K. But QEMU only creates 256 
desc entries for each vq, which results in that only 13 packets can be 
received. VM kernel can quickly handle those packets and go to sleep (HLT).

As QEMU has no option to set the desc entries of a vq, so here, we disable 
VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6 with VIRTIO_NET_F_HOST_TSO4 
and VIRTIO_NET_F_HOST_TSO6 when we disable tso of vhost example, to avoid VM 
kernel virtio driver go into big_packets path.

Fixes: 859b480d5afd ("vhost: add guest offload setting")

Reported-by: Qian Xu 
Signed-off-by: Jianfeng Tan 
---
 examples/vhost/main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/examples/vhost/main.c b/examples/vhost/main.c index 
3b98f42..92a9823 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -327,6 +327,8 @@ port_init(uint8_t port)
if (enable_tso == 0) {
rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4);
rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO6);
+   rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_GUEST_TSO4);
+   rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_GUEST_TSO6);
}

rx_rings = (uint16_t)dev_info.max_rx_queues;
--
2.7.4

[dpdk-dev] [PATCH] examples/vhost: fix perf regression

2016-07-20 Thread Xu, Qian Q

My comments below. 

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tan, Jianfeng
Sent: Wednesday, July 20, 2016 10:44 AM
To: Yuanhan Liu 
Cc: dev at dpdk.org; Wang, Zhihong 
Subject: Re: [dpdk-dev] [PATCH] examples/vhost: fix perf regression

Hi Yuanhan,

On 7/20/2016 9:44 AM, Yuanhan Liu wrote:
> On Tue, Jul 19, 2016 at 01:53:11PM +, Jianfeng Tan wrote:
>> We find significant perfermance drop introduced by below commit, when 
>> vhost example is started with --mergeable 0 and inside vm, kernel 
>> virtio-net driver is used to do ip based forwarding.
>>
>> The root cause is that below commit adds support for
>> VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6, and when 
>> mergeable is disabled, it triggers big_packets path of virtio-net 
>> driver. In this path, virtio driver uses 19 desc with 18 4K-sized 
>> pages to receive each packet, so that it can receive a big packet 
>> with size of 64K. But QEMU only creates 256 desc entries for each vq, 
>> which results in that only 13 packets can be received. VM kernel can 
>> quickly handle those packets and go to sleep (HLT).
>>
>> As QEMU has no option to set the desc entries of a vq, so here, we 
>> disable VIRTIO_NET_F_GUEST_TSO4 and VIRTIO_NET_F_GUEST_TSO6 with 
>> VIRTIO_NET_F_HOST_TSO4 and VIRTIO_NET_F_HOST_TSO6 when we disable tso 
>> of vhost example, to avoid VM kernel virtio driver go into 
>> big_packets path.
>>
>> Fixes: 859b480d5afd ("vhost: add guest offload setting")
>>
>> Reported-by: Qian Xu 
>> Signed-off-by: Jianfeng Tan 
> We could apply this patch, but I don't think it actually fix anything:
>
> - it doesn't fix other vhost applications, say OVS, which is for sure
>way more widly used than vhost-example.

If I remember it correctly, OVS will enable mergeable.

>
> - it doesn't even fix it when tso is enabled and mergeable-rx is disabled
>with this vhost-example.

But we'd better avoid users go into such doubt that performance drops because 
of that commit under the case tso=off,mergeable=off, right?

Normally, when people enable TSO, they should turn on mergeable, if they don't 
turn on mergeable, then please don't expect high performance, 
so this is not a problem. They may get low performance due to the improper 
settings. 

As to a complete fix for the issue, we may need go back to the TSO feature 
design for vhost, currently, the feature negotiation code is in the 
application, 
but it's better to be considered in the vhost/virtio library so that 
application doesn't need to check/set the feature. But now it's too late for 
the complete fix, 
so the workaround is ok for this release from my view. 

Thanks,
Jianfeng

>
> Thanks for the good root-cause, btw!
>
>   --yliu

[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-20 Thread Lu, Wenzhuo

Hi Adrien,


> -Original Message-
> From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> Sent: Tuesday, July 19, 2016 9:12 PM
> To: Lu, Wenzhuo
> Cc: dev at dpdk.org; Thomas Monjalon; Zhang, Helin; Wu, Jingjing; Rasesh Mody;
> Ajit Khaparde; Rahul Lakkireddy; Jan Medala; John Daley; Chen, Jing D; 
> Ananyev,
> Konstantin; Matej Vido; Alejandro Lucero; Sony Chacko; Jerin Jacob; De Lara
> Guarch, Pablo; Olga Shern
> Subject: Re: [RFC] Generic flow director/filtering/classification API
> 
> On Tue, Jul 19, 2016 at 08:11:48AM +, Lu, Wenzhuo wrote:
> > Hi Adrien,
> > Thanks for your clarification.  Most of my questions are clear, but still
> something may need to be discussed, comment below.
> 
> Hi Wenzhuo,
> 
> Please see below.
> 
> [...]
> > > > > Requirements for a new API:
> > > > >
> > > > > - Flexible and extensible without causing API/ABI problems for 
> > > > > existing
> > > > >   applications.
> > > > > - Should be unambiguous and easy to use.
> > > > > - Support existing filtering features and actions listed in `Filter 
> > > > > types`_.
> > > > > - Support packet alteration.
> > > > > - In case of overlapping filters, their priority should be well 
> > > > > documented.
> > > > Does that mean we don't guarantee the consistent of priority? The
> > > > priority can
> > > be different on different NICs. So the behavior of the actions  can be
> different.
> > > Right?
> > >
> > > No, the intent is precisely to define what happens in order to get a
> > > consistent result across different devices, and document cases with
> undefined behavior.
> > > There must be no room left for interpretation.
> > >
> > > For example, the API must describe what happens when two overlapping
> > > filters (e.g. one matching an Ethernet header, another one matching
> > > an IP header) match a given packet at a given priority level.
> > >
> > > It is documented in section 4.1.1 (priorities) as "undefined behavior".
> > > Applications remain free to do it and deal with consequences, at
> > > least they know they cannot expect a consistent outcome, unless they
> > > use different priority levels for both rules, see also 4.4.5 (flow rules 
> > > priority).
> > >
> > > > Seems the users still need to aware the some details of the HW? Do
> > > > we need
> > > to add the negotiation for the priority?
> > >
> > > Priorities as defined in this document may not be directly mappable
> > > to HW capabilities (e.g. HW does not support enough priorities, or
> > > that some corner case make them not work as described), in which
> > > case the PMD may choose to simulate priorities (again 4.4.5), as
> > > long as the end result follows the specification.
> > >
> > > So users must not be aware of some HW details, the PMD does and must
> > > perform the needed workarounds to suit their expectations. Users may
> > > only be impacted by errors while attempting to create rules that are
> > > either unsupported or would cause them (or existing rules) to diverge from
> the spec.
> > The problem is sometime the priority of the filters is fixed according
> > to
> > > HW's implementation. For example, on ixgbe, n-tuple has a higher
> > > priority than flow director.
> 
> As a side note I did not know that N-tuple had a higher priority than flow
> director on ixgbe, priorities among filter types do not seem to be documented 
> at
> all in DPDK. This is one of the reasons I think we need a generic API to 
> handle
> flow configuration.
Totally agree with you. We haven't documented the info well enough. And even we 
do that, users have to study the details of every NIC, it can still make the 
filters very hard to use. I believe a generic API is very helpful here :)

> 
> 
> So, today an application cannot combine N-tuple and FDIR flow rules and get a
> reliable outcome, unless it is designed for specific devices with a known
> behavior.
> 
> > What's the right behavior of PMD if APP want to create a flow director rule
> which has a higher or even equal priority than an existing n-tuple rule? 
> Should
> PMD return fail?
> 
> First remember applications only deal with the generic API, PMDs are
> responsible for choosing the most appropriate HW implementation to use
> according to the requested flow rules (FDIR, N-tuple or anything else).
> 
> For the specific case of FDIR vs N-tuple, if the underlying HW supports both 
> I do
> not see why the PMD would create a N-tuple rule. Doesn't FDIR support
> everything N-tuple can do and much more?
Talking about the filters, fdir can cover n-tuple. I think that's why i40e only 
supports fdir but not n-tuple. But n-tuple has its own highlight. As we know, 
at least on intel NICs, fdir only supports per device mask. But n-tuple can 
support per rule mask.
As every pattern has spec and mask both, we cannot guarantee the masks are 
same. I think ixgbe will try to use n-tuple first if can. Because even the 
masks are different, we can support them all.

> 
> Assuming such a thing happened any

[dpdk-dev] [PATCH] net/enic: heed VLAN strip flag in device configure function

2016-07-20 Thread David Harton (dharton)



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of John Daley (johndale)
> Sent: Tuesday, July 19, 2016 6:44 PM
> To: dev at dpdk.org
> Cc: bruce.richardson at intel.com; John Daley (johndale)  cisco.com>
> Subject: [dpdk-dev] [PATCH] net/enic: heed VLAN strip flag in device
> configure function
> 
> The configure function enicpmd_dev_configure() was not paying attention to
> the rxmode VLAN strip bit. Set the VLAN strip mode according to the bit.
> 
> Fixes: fefed3d1e62c ("enic: new driver")
> 
> Signed-off-by: John Daley 
> ---
> 
> Although not critical, it is low risk and would be nice to have in 16_07.
> 
>  drivers/net/enic/enic_ethdev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/enic/enic_ethdev.c
> b/drivers/net/enic/enic_ethdev.c index 59082d2..47b07c9 100644
> --- a/drivers/net/enic/enic_ethdev.c
> +++ b/drivers/net/enic/enic_ethdev.c
> @@ -356,6 +356,7 @@ static int enicpmd_dev_configure(struct rte_eth_dev
> *eth_dev)
>   eth_dev->data->dev_conf.rxmode.split_hdr_size);
>   }
> 
> + enicpmd_vlan_offload_set(eth_dev, ETH_VLAN_STRIP_MASK);
>   enic->hw_ip_checksum = eth_dev->data-
> >dev_conf.rxmode.hw_ip_checksum;
>   return 0;
>  }
> --
> 2.7.0

Reviewed-by: David Harton 
Tested-by: David Harton

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-20 Thread Xing, Beilei

Hi Ceara,

> -Original Message-
> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
> Sent: Tuesday, July 19, 2016 10:59 PM
> To: Xing, Beilei 
> Cc: Zhang, Helin ; Wu, Jingjing
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710
> NICs for some RX mbuf sizes
> 
> Hi Beilei,
> 
> I changed the way I run testmpd to:
> 
> testpmd -c 0x331 -w :82:00.0 -w :83:00.0 -- --mbuf-size 1152 --rss-ip 
> -
> -rxq=2 --txpkts 1024 -i
> 
> As far as I understand this will allocate mbufs with the same size I was using
> in my test (--mbuf-size seems to include the mbuf headroom therefore 1152
> = 1024 + 128 headroom).
> 
> testpmd> start tx_first
>   io packet forwarding - CRC stripping disabled - packets/burst=32
>   nb forwarding cores=1 - nb forwarding ports=2
>   RX queues=2 - RX desc=128 - RX free threshold=32
>   RX threshold registers: pthresh=8 hthresh=8 wthresh=0
>   TX queues=1 - TX desc=512 - TX free threshold=32
>   TX threshold registers: pthresh=32 hthresh=0 wthresh=0
>   TX RS bit threshold=32 - TXQ flags=0xf01
> testpmd> show port stats all
> 
>    NIC statistics for port 0
> 
>   RX-packets: 18817613   RX-missed: 5  RX-bytes:  19269115888
>   RX-errors: 0
>   RX-nombuf:  0
>   TX-packets: 18818064   TX-errors: 0  TX-bytes:  19269567464
> 
> ##
> ##
> 
>    NIC statistics for port 1
> 
>   RX-packets: 18818392   RX-missed: 5  RX-bytes:  19269903360
>   RX-errors: 0
>   RX-nombuf:  0
>   TX-packets: 18817979   TX-errors: 0  TX-bytes:  19269479424
> 
> ##
> ##
> 
> Ttraffic is sent/received. However, I couldn't find any way to verify that the
> incoming mbufs actually have the mbuf->hash.rss field set except for starting
> test-pmd with gdb and setting a breakpoint in the io fwd engine. After doing
> that I noticed that none of the incoming packets has the PKT_RX_RSS_HASH
> flag set in ol_flags... I guess for some reason test-pmd doesn't actually
> configure RSS in this case but I fail to see where.
> 

Actually there's a way to check mbuf->hash.rss, you need set forward mode to 
"rxonly", and set verbose to 1.
I run testpmd with the configuration you used, and found i40e RSS works well.
With the following steps, you can see RSS hash value and receive queue, and 
PKT_RX_RSS_HASH is set too.
I think you can use the same way to check what you want.

./testpmd -c f -n 4 -- -i --coremask=0xe --rxq=16 --txq=16 --mbuf-size 
1152 --rss-ip --txpkts 1024
testpmd> set verbose 1
testpmd> set fwd rxonly
testpmd> start
testpmd> port 0/queue 1: received 1 packets
  src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - nb
 - Receive queue=0x1
  PKT_RX_RSS_HASH
port 0/queue 0: received 1 packets
  src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
nb_segs=1 - RSS hash=0x4e949f23 - RSS queue=0x0Unknown packet type
 - Receive queue=0x0
  PKT_RX_RSS_HASH
port 0/queue 8: received 1 packets
  src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
nb_segs=1 - RSS hash=0xa3c78b2b - RSS queue=0x8Unknown packet type
 - Receive queue=0x8
  PKT_RX_RSS_HASH
port 0/queue 5: received 1 packets
  src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
nb_segs=1 - RSS hash=0xe29b3d36 - RSS queue=0x5Unknown packet type
 - Receive queue=0x5
  PKT_RX_RSS_HASH

/Beilei

> Thanks,
> Dumitru
>

[dpdk-dev] [PATCH] doc: fix vhost setup in tep-termination app guide

2016-07-20 Thread Tan, Jianfeng

Hi Mark,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Mark Kavanagh
> Sent: Tuesday, July 19, 2016 11:32 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH] doc: fix vhost setup in tep-termination app
> guide
> 
> - Fix vhost setup flags
> - Add minor edits to improve readability and consistency
> 
> Signed-off-by: Mark Kavanagh 
> ---
>  doc/guides/sample_app_ug/tep_termination.rst | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>  mode change 100644 => 100755
> doc/guides/sample_app_ug/tep_termination.rst
> 
> diff --git a/doc/guides/sample_app_ug/tep_termination.rst
> b/doc/guides/sample_app_ug/tep_termination.rst
> old mode 100644
> new mode 100755
> index 2d86a03..c3d1e97
> --- a/doc/guides/sample_app_ug/tep_termination.rst
> +++ b/doc/guides/sample_app_ug/tep_termination.rst
> @@ -59,8 +59,8 @@ This allows network isolation, QOS, etc to be provided
> on a per client basis.
>  In a typical setup, the network overlay tunnel is terminated at the
> Virtual/Tunnel End Point (VEP/TEP).
>  The TEP is normally located at the physical host level ideally in the 
> software
> switch.
>  Due to processing constraints and the inevitable bottleneck that the switch
> -becomes the ability to offload overlay support features becomes an
> important requirement.
> -Intel? XL710 10/40 G Ethernet network card provides hardware filtering
> +becomes, the ability to offload overlay support features becomes an
> important requirement.
> +Intel? XL710 10/40 Gigabit Ethernet network card provides hardware
> filtering
>  and offload capabilities to support overlay networks implementations such
> as MAC in UDP and MAC in GRE.
> 
>  Sample Code Overview
> @@ -131,14 +131,14 @@ Compiling the Sample Code
> 
>  .. code-block:: console
> 
> -CONFIG_RTE_LIBRTE_VHOST=n
> +CONFIG_RTE_LIBRTE_VHOST=y
> 
>  vhost user is turned on by default in the configure file
> config/common_linuxapp.
>  To enable vhost cuse, disable vhost user.
> 
>  .. code-block:: console
> 
> -CONFIG_RTE_LIBRTE_VHOST_USER=y
> +CONFIG_RTE_LIBRTE_VHOST_USER=n

Actually, this example does not necessarily disable vhost-user. It can work 
with vhost-user. I also notice that there are many comments inside this example 
to indicate it works with vhost-cuse. I think we need to correct this. By the 
way, vhost-cuse could be deprecated in the future, please refer here, 
http://dpdk.org/ml/archives/dev/2016-July/044080.html

Thanks,
Jianfeng

> 
>   After vhost is enabled and the implementation is selected, build the 
> vhost
> library.
> 
> --
> 1.9.3

[dpdk-dev] [PATCH] mk: fix static link with glibc < 2.17

2016-07-20 Thread Gu, YongjieX

Tested-by: Yongjie Gu 



- Check patch: success

- Apply patch: success

- compilation: success

 OS: fedora20

 GCC: gcc_x86-64, 4.8.3

 ICC: 16.0.2

Commit: 
608487f3fc96704271c624d0f3fe9d7fb2187aea

 i686-native-linuxapp-icc: compile pass

 x86_64-native-linuxapp-gcc-combined: compile pass

 i686-native-linuxapp-gcc: compile pass

 x86_64-native-linuxapp-gcc: compile pass

 x86_64-native-linuxapp-icc: compile pass

 x86_64-native-linuxapp-gcc-debug: compile pass

 x86_64-native-linuxapp-gcc-shared: compile pass

 x86_64-native-linuxapp-clang: compile pass



 OS: UB1204

 GCC: 4.6.3

 Kernel: 3.8.0-29

Commit: 
608487f3fc96704271c624d0f3fe9d7fb2187aea

X86_64-ivshmem-linuxapp-gcc: compile pass







Thanks

Yongjie



-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Thomas Monjalon
Sent: Monday, July 18, 2016 5:44 PM
To: dev at dpdk.org
Subject: [dpdk-dev] [PATCH] mk: fix static link with glibc < 2.17



There is an error when linking static EAL library with an application:



eal_alarm.c:(.text+0xd7): undefined reference to `clock_gettime'

eal_alarm.c:(.text+0x20f): undefined reference to `clock_gettime'

eal_timer.c:(.text+0x108): undefined reference to `clock_gettime'

eal_timer.c:(.text+0x146): undefined reference to `clock_gettime'



The function clock_gettime() is in librt for old glibc.



Fixes: 281948b4753e ("mk: fix missing librt dependencies")



Signed-off-by: Thomas Monjalon mailto:thomas.monjalon at 6wind.com>>

---

mk/rte.app.mk | 1 +

1 file changed, 1 insertion(+)



diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 886dbdd..eb28e11 100644

--- a/mk/rte.app.mk

+++ b/mk/rte.app.mk

@@ -151,6 +151,7 @@ _LDLIBS-y += --no-whole-archive  ifeq 
($(CONFIG_RTE_BUILD_SHARED_LIB),n)

# The static libraries do not know their dependencies.

# So linking with static library requires explicit dependencies.

+_LDLIBS-$(CONFIG_RTE_LIBRTE_EAL)+= -lrt

_LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED)  += -lm

_LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED)  += -lrt

_LDLIBS-$(CONFIG_RTE_LIBRTE_METER)  += -lm

--

2.7.0

[dpdk-dev] [PATCH v2] i40e:Fix for wrong publish of card speed (was mixed between 10G and 40G)

2016-07-20 Thread Wu, Jingjing



> -Original Message-
> From: Ido Barnea (ibarnea) [mailto:ibarnea at cisco.com]
> Sent: Monday, July 18, 2016 5:57 PM
> To: dev at dpdk.org
> Cc: Wu, Jingjing
> Subject: [PATCH v2] i40e:Fix for wrong publish of card speed (was mixed
> between 10G and 40G)
> 
> Signed-off-by: Ido Barnea 
> 
Acked-by: Jingjing Wu

[dpdk-dev] [PATCH v4] doc: flow bifurcation guide on Linux

2016-07-20 Thread Wu, Jingjing



> -Original Message-
> From: Mcnamara, John
> Sent: Wednesday, July 20, 2016 12:35 AM
> To: Wu, Jingjing
> Cc: dev at dpdk.org; Liu, Yong; Zhang, Helin; Iremonger, Bernard
> Subject: RE: [PATCH v4] doc: flow bifurcation guide on Linux
> 
> 
> 
> > -Original Message-
> > From: Wu, Jingjing
> > Sent: Tuesday, July 19, 2016 4:31 AM
> > To: Mcnamara, John 
> > Cc: dev at dpdk.org; Wu, Jingjing ; Liu, Yong
> > ; Zhang, Helin 
> > Subject: [PATCH v4] doc: flow bifurcation guide on Linux
> >
> > Flow Bifurcation is a mechanism which uses features of advanced
> > Ethernet devices to split traffic between queues. It provides the
> > capability to let the kernel driver and DPDK driver co-exist and take
> advantage of both.
> >
> > It is achieved by using SR-IOV and the NIC's advanced filtering. This
> > patch describes Flow Bifurcation and adds the user guide for ixgbe and
> > i40e NICs.
> >
> > Signed-off-by: Jingjing Wu 
> 
> 
> 
> 
> This patch is dependent on:
> [PATCH v5 1/2] doc: live migration of VM with Virtio and VF
> 
> The following patch is also dependent on it:
> 
> [PATCH v3 1/2] doc: live migration of VM with vhost_user on host
> 
> There is now a merge conflict between your patch and the second patch
> because index.rst has been changed.
> 
> Maybe the 2 patches above could be combined and you can rebase your
> patch against it.
> 
Sure, I can definitely rebase if above one or two patches are merged. Thanks.

> But as for the patch itself:
> 
> 
> Acked-by: John McNamara

70 matches

Mail list logo