from:"Zoltan Kiss"

[dpdk-dev] [PATCH v2] net/i40e: remove weak symbols

2016-07-26 Thread Zoltan Kiss

Hi,

I forgot to add the i40 maintainers as well. Would anyone like to weigh in?

Regards,

Zoltan

On 22/07/16 12:34, Zoltan Kiss wrote:
>
>
> On 21/07/16 19:58, bynes adam wrote:
>> On Wed, Jul 20, 2016 at 06:11:16PM +0100, Zoltan Kiss wrote:
>> Hi, Kiss
>>> Using weak symbols have a few issues with static linking:
>>>
>>> - normally the linker searches the .o files already linked, if your weak
>>>   one is there, it won't check if there is a strong version
>>> - unless the symbol is directly referred, but it's not the right
>>> thing to
>>>   rely on
>>> - or --whole-archive specified in the command line, which pulls in a lot
>>>   of unwanted stuff
>> --whole-archive on the other hand can ensure all the object files are
>> linked,
>> and the strong symbol will take precedence over weak symbol. So we
>> don't need to
>> take care of the sequence of the object files in the ar or between ar.
>
> --whole-archive is primarily for shared libraries, just as weak symbols.
> When people do static linking, their reason is often to avoid carrying
> around a big library while they only use a fraction of it.
>
>>> - whole-archive also makes libtool dropping the library from the command
>>>   line, which is what should happen with dynamic linking, but not with
>>>   static (observed on Ubuntu 14.04). This is an important bug if you
>>>   build a static library depending on DPDK
>> you mean the libtool bug for --whole-archive?
>> if it does, you shouldn't using the libtool,
>
> GNU libtool is a quite common tool for building libraries, it's a bit
> painful sometimes, but it works. What else do you recommend? I mean,
> something which is proven to be better?
>
>> BTW what's the circumstance you must use the libtool. using you own
>> makefile for
>> the library or APP.
>
> Building libraries which depend on DPDK
>
>>>
>>> This patch merges the two version and make the behaviour rely on the
>>> config.
>>>
>>> If we can agree in the concept, I can send a series to fix this for the
>>> other weak functions.
>>>
>>> Signed-off-by: Zoltan Kiss 
>>> ---
>>>
>>> Notes:
>>> v2: fix commit message
>>>
>>>  drivers/net/i40e/i40e_rxtx.c | 36
>>> +++-
>>>  drivers/net/i40e/i40e_rxtx_vec.c | 36
>>> 
>>>  2 files changed, 35 insertions(+), 37 deletions(-)
>>>
>> From the original design, we follow the rule, we don't want the Macro
>> in the file
>> to seperate the different Rx/Tx path for disabled/enabled vector
>> configuration.
>
> And why is it better to compile two versions of the same function when
> you already know at the time of compilation which one do you want to use?
>
>> Adam Bynes
>>

[dpdk-dev] net/pcap: set rte_errno on TX error

2016-07-25 Thread Zoltan Kiss

This returns the error code provided by pcap_sendpacket()

Signed-off-by: Zoltan Kiss 

diff --git a/drivers/net/pcap/rte_eth_pcap.c b/drivers/net/pcap/rte_eth_pcap.c
index 7e213eb..0899bac 100644
--- a/drivers/net/pcap/rte_eth_pcap.c
+++ b/drivers/net/pcap/rte_eth_pcap.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 

@@ -360,8 +361,10 @@ eth_pcap_tx(void *queue,
}
}

-   if (unlikely(ret != 0))
+   if (unlikely(ret != 0)) {
+   rte_errno = ret;
break;
+   }
num_tx++;
tx_bytes += mbuf->pkt_len;
rte_pktmbuf_free(mbuf);

[dpdk-dev] [PATCH v2] net/i40e: remove weak symbols

2016-07-22 Thread Zoltan Kiss



On 21/07/16 19:58, bynes adam wrote:
> On Wed, Jul 20, 2016 at 06:11:16PM +0100, Zoltan Kiss wrote:
> Hi, Kiss
>> Using weak symbols have a few issues with static linking:
>>
>> - normally the linker searches the .o files already linked, if your weak
>>   one is there, it won't check if there is a strong version
>> - unless the symbol is directly referred, but it's not the right thing to
>>   rely on
>> - or --whole-archive specified in the command line, which pulls in a lot
>>   of unwanted stuff
> --whole-archive on the other hand can ensure all the object files are linked,
> and the strong symbol will take precedence over weak symbol. So we don't need 
> to
> take care of the sequence of the object files in the ar or between ar.

--whole-archive is primarily for shared libraries, just as weak symbols. 
When people do static linking, their reason is often to avoid carrying 
around a big library while they only use a fraction of it.

>> - whole-archive also makes libtool dropping the library from the command
>>   line, which is what should happen with dynamic linking, but not with
>>   static (observed on Ubuntu 14.04). This is an important bug if you
>>   build a static library depending on DPDK
> you mean the libtool bug for --whole-archive?
> if it does, you shouldn't using the libtool,

GNU libtool is a quite common tool for building libraries, it's a bit 
painful sometimes, but it works. What else do you recommend? I mean, 
something which is proven to be better?

> BTW what's the circumstance you must use the libtool. using you own makefile 
> for
> the library or APP.

Building libraries which depend on DPDK

>>
>> This patch merges the two version and make the behaviour rely on the
>> config.
>>
>> If we can agree in the concept, I can send a series to fix this for the
>> other weak functions.
>>
>> Signed-off-by: Zoltan Kiss 
>> ---
>>
>> Notes:
>> v2: fix commit message
>>
>>  drivers/net/i40e/i40e_rxtx.c | 36 +++-
>>  drivers/net/i40e/i40e_rxtx_vec.c | 36 
>>  2 files changed, 35 insertions(+), 37 deletions(-)
>>
> From the original design, we follow the rule, we don't want the Macro in the 
> file
> to seperate the different Rx/Tx path for disabled/enabled vector 
> configuration.

And why is it better to compile two versions of the same function when 
you already know at the time of compilation which one do you want to use?

> Adam Bynes
>

[dpdk-dev] [PATCH v2] mempool: adjust name string size in related data types

2016-07-21 Thread Zoltan Kiss



On 21/07/16 14:40, Olivier Matz wrote:
> Hi Zoltan,
>
>
> On 07/20/2016 07:16 PM, Zoltan Kiss wrote:
>> A recent patch brought up an issue about the size of the 'name' fields:
>>
>> 85cf0079 mem: avoid memzone/mempool/ring name truncation
>>
>> These relations should be observed:
>>
>> 1. Each ring creates a memzone with a prefixed name:
>> RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
>>
>> 2. There are some mempool handlers which create a ring with a prefixed
>> name:
>> RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE - strlen(RTE_MEMPOOL_MZ_PREFIX)
>>
>> 3. A mempool can create up to RTE_MAX_MEMZONE pre and postfixed memzones:
>> sprintf(postfix, "_%d", RTE_MAX_MEMZONE)
>> RTE_MEMPOOL_NAMESIZE <= RTE_MEMZONE_NAMESIZE -
>>  strlen(RTE_MEMPOOL_MZ_PREFIX) - strlen(postfix)
>>
>> Setting all of them to 32 hides this restriction from the application.
>> This patch decreases the mempool and ring string size to accommodate for
>> these prefixes, but it doesn't apply the 3rd constraint. Applications
>> relying on these constants need to be recompiled, otherwise they'll run
>> into ENAMETOOLONG issues.
>> The size of the arrays are kept 32 for ABI compatibility, it can be
>> decreased next time the ABI changes.
>>
>> Signed-off-by: Zoltan Kiss 
>
> Looks like to be a good compromise for the 16.07 release. One question
> however: why not taking in account the 3rd constraint? Because it may
> not completly fix the issue if the mempool is fragmented.
>
> We could define RTE_MEMPOOL_NAMESIZE to 24
>  = 32 - len('mp_') - len('_0123'))

I was trying to figure out a compile time macro for strlen(postfix), but 
I could not. Your suggestion would work only until someone increases 
RTE_MAX_MEMZONE above . As the likelihood of fragmenting a pool over 
99 memzones seems small, I did not bother to fix this with an ugly hack, 
but if you think we should include it, let me know!

>
> Thanks,
> Olivier
>

[dpdk-dev] [PATCH] mempool: adjust name string size in related data types

2016-07-20 Thread Zoltan Kiss



On 20/07/16 14:37, Olivier Matz wrote:
> Hi,
>
> On 07/20/2016 02:41 PM, Zoltan Kiss wrote:
>>
>>
>> On 19/07/16 17:17, Olivier Matz wrote:
>>> Hi Zoltan,
>>>
>>> On 07/19/2016 05:59 PM, Zoltan Kiss wrote:
>>>>
>>>>
>>>> On 19/07/16 16:37, Olivier Matz wrote:
>>>>> Hi Zoltan,
>>>>>
>>>>> On 07/19/2016 04:37 PM, Zoltan Kiss wrote:
>>>>>> A recent fix brought up an issue about the size of the 'name' fields:
>>>>>>
>>>>>> 85cf0079 mem: avoid memzone/mempool/ring name truncation
>>>>>>
>>>>>> These relations should be observed:
>>>>>>
>>>>>> RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
>>>>>> RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE -
>>>>>> strlen(RTE_MEMPOOL_MZ_PREFIX)
>>>>>>
>>>>>> Setting all of them to 32 hides this restriction from the application.
>>>>>> This patch increases the memzone string size to accomodate for these
>>>>>> prefixes, and the same happens with the ring name string. The ABI
>>>>>> needs to
>>>>>> be broken to fix this API issue, this way doesn't break applications
>>>>>> previously not failing due to the truncating bug now fixed.
>>>>>>
>>>>>> Signed-off-by: Zoltan Kiss 
>>>>>
>>>>> I agree it is a problem for an application because it cannot know what
>>>>> is the maximum name length. On the other hand, breaking the ABI for
>>>>> this
>>>>> looks a bit overkill. Maybe we could reduce RTE_MEMPOOL_NAMESIZE and
>>>>> RTE_RING_NAMESIZE instead of increasing RTE_MEMZONE_NAMESIZE? That way,
>>>>> we could keep the ABI as is.
>>>>
>>>> But that would break the ABI too, wouldn't it? Unless you keep the array
>>>> the same size (32 bytes) by using RTE_MEMZONE_NAMESIZE.
>>>
>>> Yes, that was the idea.
>>>
>>>> And even then, the API breaks anyway. There are applications - I have at
>>>> least some - which use all 32 bytes to store the name. Decrease that
>>>> would cause headache to change the naming scheme, because it's a 30
>>>> character long id, and chopping the last few chars would cause name
>>>> collisions and annoying bugs.
>>>
>>> Before my patch (85cf0079), long names were silently truncated when
>>> mempool created its ring and/or memzones. Now, it returns an error.
>>
>> With 16.04 an application could operate as expected if the first 26
>> character were unique. Your patch revealed the problem that characters
>> after these were left out of the name. Now applications fail where this
>> never been a bug because their naming scheme guarantees the uniqueness
>> on the first 26 chars (or makes it very unlikely)
>> Where the first 26 is not unique, it failed earlier too, because at
>> memzone creation it checks for duplicate names.
>
> Yes, I understand that there is a behavior change for applications using
> names larger than 26 between 16.04 and 16.07. I also understand that
> there is no way for an application to know what is the maximum usable
> size for a mempool or a ring.
>
>
>>> I'm not getting why changing the struct to something like below would
>>> break the API, since it would already return an error today.
>>>
>>>#define RTE_MEMPOOL_NAMESIZE \
>>
>> Wait, this would mean applications need to recompile to use the smaller
>> value. AFAIK that's an ABI break too, right? At the moment I don't see a
>> way to fix this without breaking the ABI
>
> With this modification, if you don't recompile the application, its
> behavior will still be the same as today -> it will return ENAMETOOLONG.
> If you recompile it, the application will be aware of the maximum
> length. To me, it seems to be a acceptable compromise for this release.
>
> The patch you're proposing also changes the ABI of librte_ring and
> librte_eal, which cannot be accepted for the release.

Ok, I've sent a new version with this approach.

>
>
>>
>>>(RTE_MEMZONE_NAMESIZE - sizeof(pool_prefix) - sizeof(ring prefix))
>>>struct rte_mempool {
>>>union {
>>>  char name[RTE_MEMPOOL_NAMESIZE];
>>>  char pad[32];
>>>};
>>>...
>>>}
>>>
>>> Anyway, it may not be the pr

[dpdk-dev] [PATCH] memzone: allow full length name

2016-07-20 Thread Zoltan Kiss

(strlen(name) == sizeof(mz->name) - 1) is a valid case, change the
condition to reflect that.
Move it earlier to avoid lookup with invalid name.
Change errno to ENAMETOOLONG.

Fixes: 85cf0079 ("mem: avoid memzone/mempool/ring name truncation")

Signed-off-by: Zoltan Kiss 
---
 lib/librte_eal/common/eal_common_memzone.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memzone.c 
b/lib/librte_eal/common/eal_common_memzone.c
index 5d28341..1bd0a33 100644
--- a/lib/librte_eal/common/eal_common_memzone.c
+++ b/lib/librte_eal/common/eal_common_memzone.c
@@ -144,6 +144,13 @@ memzone_reserve_aligned_thread_unsafe(const char *name, 
size_t len,
return NULL;
}

+   if (strlen(name) > sizeof(mz->name) - 1) {
+   RTE_LOG(DEBUG, EAL, "%s(): memzone <%s>: name too long\n",
+   __func__, name);
+   rte_errno = ENAMETOOLONG;
+   return NULL;
+   }
+
/* zone already exist */
if ((memzone_lookup_thread_unsafe(name)) != NULL) {
RTE_LOG(DEBUG, EAL, "%s(): memzone <%s> already exists\n",
@@ -152,13 +159,6 @@ memzone_reserve_aligned_thread_unsafe(const char *name, 
size_t len,
return NULL;
}

-   if (strlen(name) >= sizeof(mz->name) - 1) {
-   RTE_LOG(DEBUG, EAL, "%s(): memzone <%s>: name too long\n",
-   __func__, name);
-   rte_errno = EEXIST;
-   return NULL;
-   }
-
/* if alignment is not a power of two */
if (align && !rte_is_power_of_2(align)) {
RTE_LOG(ERR, EAL, "%s(): Invalid alignment: %u\n", __func__,
-- 
1.9.1

[dpdk-dev] [PATCH v2] mempool: adjust name string size in related data types

2016-07-20 Thread Zoltan Kiss

A recent patch brought up an issue about the size of the 'name' fields:

85cf0079 mem: avoid memzone/mempool/ring name truncation

These relations should be observed:

1. Each ring creates a memzone with a prefixed name:
RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)

2. There are some mempool handlers which create a ring with a prefixed
name:
RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE - strlen(RTE_MEMPOOL_MZ_PREFIX)

3. A mempool can create up to RTE_MAX_MEMZONE pre and postfixed memzones:
sprintf(postfix, "_%d", RTE_MAX_MEMZONE)
RTE_MEMPOOL_NAMESIZE <= RTE_MEMZONE_NAMESIZE -
strlen(RTE_MEMPOOL_MZ_PREFIX) - strlen(postfix)

Setting all of them to 32 hides this restriction from the application.
This patch decreases the mempool and ring string size to accommodate for
these prefixes, but it doesn't apply the 3rd constraint. Applications
relying on these constants need to be recompiled, otherwise they'll run
into ENAMETOOLONG issues.
The size of the arrays are kept 32 for ABI compatibility, it can be
decreased next time the ABI changes.

Signed-off-by: Zoltan Kiss 
---

Notes:
v2: keep arrays 32 bytes and decrease the max sizes to maintain ABI
compatibility

 lib/librte_mempool/rte_mempool.h | 11 +--
 lib/librte_ring/rte_ring.h   | 12 ++--
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 4a8fbb1..059ad9e 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -123,7 +123,9 @@ struct rte_mempool_objsz {
/**< Total size of an object (header + elt + trailer). */
 };

-#define RTE_MEMPOOL_NAMESIZE 32 /**< Maximum length of a memory pool. */
+/**< Maximum length of a memory pool's name. */
+#define RTE_MEMPOOL_NAMESIZE (RTE_RING_NAMESIZE - \
+ sizeof(RTE_MEMPOOL_MZ_PREFIX) + 1)
 #define RTE_MEMPOOL_MZ_PREFIX "MP_"

 /* "MP_" */
@@ -208,7 +210,12 @@ struct rte_mempool_memhdr {
  * The RTE mempool structure.
  */
 struct rte_mempool {
-   char name[RTE_MEMPOOL_NAMESIZE]; /**< Name of mempool. */
+   /*
+* Note: this field kept the RTE_MEMZONE_NAMESIZE size due to ABI
+* compatibility requirements, it could be changed to
+* RTE_MEMPOOL_NAMESIZE next time the ABI changes
+*/
+   char name[RTE_MEMZONE_NAMESIZE]; /**< Name of mempool. */
union {
void *pool_data; /**< Ring or pool to store objects. */
uint64_t pool_id;/**< External mempool identifier. */
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index eb45e41..0e22e69 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -100,6 +100,7 @@ extern "C" {
 #include 
 #include 
 #include 
+#include 

 #define RTE_TAILQ_RING_NAME "RTE_RING"

@@ -126,8 +127,10 @@ struct rte_ring_debug_stats {
 } __rte_cache_aligned;
 #endif

-#define RTE_RING_NAMESIZE 32 /**< The maximum length of a ring name. */
 #define RTE_RING_MZ_PREFIX "RG_"
+/**< The maximum length of a ring name. */
+#define RTE_RING_NAMESIZE (RTE_MEMZONE_NAMESIZE - \
+  sizeof(RTE_RING_MZ_PREFIX) + 1)

 #ifndef RTE_RING_PAUSE_REP_COUNT
 #define RTE_RING_PAUSE_REP_COUNT 0 /**< Yield after pause num of times, no 
yield
@@ -147,7 +150,12 @@ struct rte_memzone; /* forward declaration, so as not to 
require memzone.h */
  * a problem.
  */
 struct rte_ring {
-   char name[RTE_RING_NAMESIZE];/**< Name of the ring. */
+   /*
+* Note: this field kept the RTE_MEMZONE_NAMESIZE size due to ABI
+* compatibility requirements, it could be changed to RTE_RING_NAMESIZE
+* next time the ABI changes
+*/
+   char name[RTE_MEMZONE_NAMESIZE];/**< Name of the ring. */
int flags;   /**< Flags supplied at creation. */
const struct rte_memzone *memzone;
/**< Memzone, if any, containing the rte_ring */
-- 
1.9.1

[dpdk-dev] [PATCH v2] mempool: fix lack of free registration

2016-07-20 Thread Zoltan Kiss

The new mempool handler interface forgets to register the free() function
of the ops. Introduced in this patch:

Fixes: 449c49b93a6b ("mempool: support handler operations")

Signed-off-by: Zoltan Kiss 
Acked-by: Olivier Matz 
---

Notes:
v2: fix commit message

 lib/librte_mempool/rte_mempool_ops.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_mempool/rte_mempool_ops.c 
b/lib/librte_mempool/rte_mempool_ops.c
index fd0b64c..5f24de2 100644
--- a/lib/librte_mempool/rte_mempool_ops.c
+++ b/lib/librte_mempool/rte_mempool_ops.c
@@ -81,6 +81,7 @@ rte_mempool_register_ops(const struct rte_mempool_ops *h)
ops = _mempool_ops_table.ops[ops_index];
snprintf(ops->name, sizeof(ops->name), "%s", h->name);
ops->alloc = h->alloc;
+   ops->free = h->free;
ops->enqueue = h->enqueue;
ops->dequeue = h->dequeue;
ops->get_count = h->get_count;
-- 
1.9.1

[dpdk-dev] [PATCH v2] net/i40e: remove weak symbols

2016-07-20 Thread Zoltan Kiss

Using weak symbols have a few issues with static linking:

- normally the linker searches the .o files already linked, if your weak
  one is there, it won't check if there is a strong version
- unless the symbol is directly referred, but it's not the right thing to
  rely on
- or --whole-archive specified in the command line, which pulls in a lot
  of unwanted stuff
- whole-archive also makes libtool dropping the library from the command
  line, which is what should happen with dynamic linking, but not with
  static (observed on Ubuntu 14.04). This is an important bug if you
  build a static library depending on DPDK

This patch merges the two version and make the behaviour rely on the
config.

If we can agree in the concept, I can send a series to fix this for the
other weak functions.

Signed-off-by: Zoltan Kiss 
---

Notes:
v2: fix commit message

 drivers/net/i40e/i40e_rxtx.c | 36 +++-
 drivers/net/i40e/i40e_rxtx_vec.c | 36 
 2 files changed, 35 insertions(+), 37 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index d3cfb98..ad34d3a 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3278,10 +3278,44 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
 }

 /* Stubs needed for linkage when CONFIG_RTE_I40E_INC_VECTOR is set to 'n' */
-int __attribute__((weak))
+int __attribute__((cold))
 i40e_rx_vec_dev_conf_condition_check(struct rte_eth_dev __rte_unused *dev)
 {
+#ifndef RTE_LIBRTE_I40E_INC_VECTOR
return -1;
+#else
+#ifndef RTE_LIBRTE_IEEE1588
+   struct rte_eth_rxmode *rxmode = >data->dev_conf.rxmode;
+   struct rte_fdir_conf *fconf = >data->dev_conf.fdir_conf;
+
+   /* need SSE4.1 support */
+   if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+   return -1;
+
+#ifndef RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE
+   /* whithout rx ol_flags, no VP flag report */
+   if (rxmode->hw_vlan_strip != 0 ||
+   rxmode->hw_vlan_extend != 0)
+   return -1;
+#endif /* RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE */
+
+   /* no fdir support */
+   if (fconf->mode != RTE_FDIR_MODE_NONE)
+   return -1;
+
+/* - no csum error report support
+* - no header split support
+*/
+   if (rxmode->hw_ip_checksum == 1 ||
+   rxmode->header_split == 1)
+   return -1;
+
+   return 0;
+#else
+   RTE_SET_USED(dev);
+   return -1;
+#endif /* RTE_LIBRTE_IEEE1588 */
+#endif /* RTE_LIBRTE_I40E_INC_VECTOR */
 }

 uint16_t __attribute__((weak))
diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c
index 05cb415..983b2c0 100644
--- a/drivers/net/i40e/i40e_rxtx_vec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec.c
@@ -723,39 +723,3 @@ i40e_txq_vec_setup(struct i40e_tx_queue __rte_unused *txq)
 {
return 0;
 }
-
-int __attribute__((cold))
-i40e_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev)
-{
-#ifndef RTE_LIBRTE_IEEE1588
-   struct rte_eth_rxmode *rxmode = >data->dev_conf.rxmode;
-   struct rte_fdir_conf *fconf = >data->dev_conf.fdir_conf;
-
-   /* need SSE4.1 support */
-   if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
-   return -1;
-
-#ifndef RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE
-   /* whithout rx ol_flags, no VP flag report */
-   if (rxmode->hw_vlan_strip != 0 ||
-   rxmode->hw_vlan_extend != 0)
-   return -1;
-#endif
-
-   /* no fdir support */
-   if (fconf->mode != RTE_FDIR_MODE_NONE)
-   return -1;
-
-/* - no csum error report support
-* - no header split support
-*/
-   if (rxmode->hw_ip_checksum == 1 ||
-   rxmode->header_split == 1)
-   return -1;
-
-   return 0;
-#else
-   RTE_SET_USED(dev);
-   return -1;
-#endif
-}
-- 
1.9.1

[dpdk-dev] [PATCH] mempool: adjust name string size in related data types

2016-07-20 Thread Zoltan Kiss

On 19/07/16 17:17, Olivier Matz wrote:
> Hi Zoltan,
>
> On 07/19/2016 05:59 PM, Zoltan Kiss wrote:
>>
>>
>> On 19/07/16 16:37, Olivier Matz wrote:
>>> Hi Zoltan,
>>>
>>> On 07/19/2016 04:37 PM, Zoltan Kiss wrote:
>>>> A recent fix brought up an issue about the size of the 'name' fields:
>>>>
>>>> 85cf0079 mem: avoid memzone/mempool/ring name truncation
>>>>
>>>> These relations should be observed:
>>>>
>>>> RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
>>>> RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE -
>>>> strlen(RTE_MEMPOOL_MZ_PREFIX)
>>>>
>>>> Setting all of them to 32 hides this restriction from the application.
>>>> This patch increases the memzone string size to accomodate for these
>>>> prefixes, and the same happens with the ring name string. The ABI
>>>> needs to
>>>> be broken to fix this API issue, this way doesn't break applications
>>>> previously not failing due to the truncating bug now fixed.
>>>>
>>>> Signed-off-by: Zoltan Kiss 
>>>
>>> I agree it is a problem for an application because it cannot know what
>>> is the maximum name length. On the other hand, breaking the ABI for this
>>> looks a bit overkill. Maybe we could reduce RTE_MEMPOOL_NAMESIZE and
>>> RTE_RING_NAMESIZE instead of increasing RTE_MEMZONE_NAMESIZE? That way,
>>> we could keep the ABI as is.
>>
>> But that would break the ABI too, wouldn't it? Unless you keep the array
>> the same size (32 bytes) by using RTE_MEMZONE_NAMESIZE.
>
> Yes, that was the idea.
>
>> And even then, the API breaks anyway. There are applications - I have at
>> least some - which use all 32 bytes to store the name. Decrease that
>> would cause headache to change the naming scheme, because it's a 30
>> character long id, and chopping the last few chars would cause name
>> collisions and annoying bugs.
>
> Before my patch (85cf0079), long names were silently truncated when
> mempool created its ring and/or memzones. Now, it returns an error.

With 16.04 an application could operate as expected if the first 26 
character were unique. Your patch revealed the problem that characters 
after these were left out of the name. Now applications fail where this 
never been a bug because their naming scheme guarantees the uniqueness 
on the first 26 chars (or makes it very unlikely)
Where the first 26 is not unique, it failed earlier too, because at 
memzone creation it checks for duplicate names.

>
> I'm not getting why changing the struct to something like below would
> break the API, since it would already return an error today.
>
>#define RTE_MEMPOOL_NAMESIZE \

Wait, this would mean applications need to recompile to use the smaller 
value. AFAIK that's an ABI break too, right? At the moment I don't see a 
way to fix this without breaking the ABI

>(RTE_MEMZONE_NAMESIZE - sizeof(pool_prefix) - sizeof(ring prefix))
>struct rte_mempool {
>union {
>  char name[RTE_MEMPOOL_NAMESIZE];
>  char pad[32];
>};
>...
>}
>
> Anyway, it may not be the proper solution since it supposes that a
> mempool includes a ring based on a memzone, which is not always true now
> with mempool handlers.

Oh, as we dug deeper it gets better!
Indeed, we don't necessarily have this ring + memzone pair underneath, 
but the user is not aware of that, and I think we should keep it that 
way. It should only care that the string passed shouldn't be bigger than 
a certain amount.
Also, even though we don't necessarily have the ring, we still reserve 
memzone's in rte_mempool_populate_default(). And their name has a 3 
letter prefix, and a "_%d" postfix, where the %d could be as much as 
RTE_MAX_MEMZONE in worst case (2560 by default) So actually:

RTE_MEMPOOL_NAMESIZE <= RTE_MEMZONE_NAMESIZE - 
strlen(RTE_MEMPOOL_MZ_PREFIX) - strlen("_2560")

As a side note, there is another bug around here: rte_ring_create() 
doesn't check for name duplications. However the user of the library can 
lookup based on the name with rte_ring_lookup(), and it will return the 
first ring with that name

>
>>> It would even be better to get rid of this static char[] for the
>>> structure names and replace it by an allocated const char *. I didn't
>>> check it's feasible for memzones. What do you think?
>>
>> It would work too, but I don't think it would help a lot. We would still
>> need max sizes for the names. Storing them somewhere else won't help us
>> in this problem.
>
> Why should we have a maximum length for the names?

What happens if an application loads DPDK and create a mempool with a 
name string 2 million characters long? Maybe nothing we should worry 
about, but in general I think unlimited length function parameters are 
problematic at the very least. The length should be passed at least 
(which also creates a max due to the size of the param). But I think it 
would be just easier to have these maximums set, observing the above 
constrains.

>
>
> Thanks,
> Olivier
>

[dpdk-dev] [PATCH] mempool: fix lack of free() registration

2016-07-19 Thread Zoltan Kiss



On 19/07/16 16:26, Olivier Matz wrote:
> Hi Zoltan,
>
> I ran ./scripts/check-git-log.sh on your patch, showing some minor
> styling issues:

Thanks, do you want me to resend it, or could Thomas fix them upon 
commiting?

>
> On 07/19/2016 04:37 PM, Zoltan Kiss wrote:
>> [PATCH] mempool: fix lack of free() registration
>
> "()" should be removed
>
>> The new mempool handler interface forgets to register the free() function
>> of the ops. Introduced in this patch:
>>
>> 449c49b9 mempool: support handler operations
>
> The format should be:
> Fixes: 449c49b93a6b ("mempool: support handler operations")
>
>
>>
>> Signed-off-by: Zoltan Kiss 
>> ---
>>   lib/librte_mempool/rte_mempool_ops.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/lib/librte_mempool/rte_mempool_ops.c 
>> b/lib/librte_mempool/rte_mempool_ops.c
>> index fd0b64c..5f24de2 100644
>> --- a/lib/librte_mempool/rte_mempool_ops.c
>> +++ b/lib/librte_mempool/rte_mempool_ops.c
>> @@ -81,6 +81,7 @@ rte_mempool_register_ops(const struct rte_mempool_ops *h)
>>  ops = _mempool_ops_table.ops[ops_index];
>>  snprintf(ops->name, sizeof(ops->name), "%s", h->name);
>>  ops->alloc = h->alloc;
>> +ops->free = h->free;
>>  ops->enqueue = h->enqueue;
>>  ops->dequeue = h->dequeue;
>>  ops->get_count = h->get_count;
>>
>
> Apart from that:
> Acked-by: Olivier Matz 
>
> +CC Thomas, I think it should be included in 16.07.
>
> Thanks!
>

[dpdk-dev] [PATCH] mempool: adjust name string size in related data types

2016-07-19 Thread Zoltan Kiss



On 19/07/16 16:37, Olivier Matz wrote:
> Hi Zoltan,
>
> On 07/19/2016 04:37 PM, Zoltan Kiss wrote:
>> A recent fix brought up an issue about the size of the 'name' fields:
>>
>> 85cf0079 mem: avoid memzone/mempool/ring name truncation
>>
>> These relations should be observed:
>>
>> RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
>> RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE - strlen(RTE_MEMPOOL_MZ_PREFIX)
>>
>> Setting all of them to 32 hides this restriction from the application.
>> This patch increases the memzone string size to accomodate for these
>> prefixes, and the same happens with the ring name string. The ABI needs to
>> be broken to fix this API issue, this way doesn't break applications
>> previously not failing due to the truncating bug now fixed.
>>
>> Signed-off-by: Zoltan Kiss 
>
> I agree it is a problem for an application because it cannot know what
> is the maximum name length. On the other hand, breaking the ABI for this
> looks a bit overkill. Maybe we could reduce RTE_MEMPOOL_NAMESIZE and
> RTE_RING_NAMESIZE instead of increasing RTE_MEMZONE_NAMESIZE? That way,
> we could keep the ABI as is.

But that would break the ABI too, wouldn't it? Unless you keep the array 
the same size (32 bytes) by using RTE_MEMZONE_NAMESIZE.
And even then, the API breaks anyway. There are applications - I have at 
least some - which use all 32 bytes to store the name. Decrease that 
would cause headache to change the naming scheme, because it's a 30 
character long id, and chopping the last few chars would cause name 
collisions and annoying bugs.

>
> It would even be better to get rid of this static char[] for the
> structure names and replace it by an allocated const char *. I didn't
> check it's feasible for memzones. What do you think?

It would work too, but I don't think it would help a lot. We would still 
need max sizes for the names. Storing them somewhere else won't help us 
in this problem.

>
> In any case, I think it's a bit late for 16.07 for this kind of fix.
>
> Regards,
> Olivier
>

[dpdk-dev] [PATCH] mempool: adjust name string size in related data types

2016-07-19 Thread Zoltan Kiss


A recent fix brought up an issue about the size of the 'name' fields:

85cf0079 mem: avoid memzone/mempool/ring name truncation

These relations should be observed:

RTE_RING_NAMESIZE <= RTE_MEMZONE_NAMESIZE - strlen(RTE_RING_MZ_PREFIX)
RTE_MEMPOOL_NAMESIZE <= RTE_RING_NAMESIZE - strlen(RTE_MEMPOOL_MZ_PREFIX)

Setting all of them to 32 hides this restriction from the application.
This patch increases the memzone string size to accomodate for these
prefixes, and the same happens with the ring name string. The ABI needs to
be broken to fix this API issue, this way doesn't break applications
previously not failing due to the truncating bug now fixed.

Signed-off-by: Zoltan Kiss 
---
 lib/librte_eal/common/include/rte_memzone.h | 2 +-
 lib/librte_mempool/rte_mempool.h| 4 +++-
 lib/librte_ring/rte_ring.h  | 5 -
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_memzone.h 
b/lib/librte_eal/common/include/rte_memzone.h
index f69b5a8..ba3a1f0 100644
--- a/lib/librte_eal/common/include/rte_memzone.h
+++ b/lib/librte_eal/common/include/rte_memzone.h
@@ -74,7 +74,7 @@ extern "C" {
  */
 struct rte_memzone {

-#define RTE_MEMZONE_NAMESIZE 32   /**< Maximum length of memory zone 
name.*/
+#define RTE_MEMZONE_NAMESIZE (32 + 6) /**< Maximum length of memory zone 
name.*/
char name[RTE_MEMZONE_NAMESIZE];  /**< Name of the memory zone. */

phys_addr_t phys_addr;/**< Start physical address. */
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 4a8fbb1..61e8d19 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -123,7 +123,9 @@ struct rte_mempool_objsz {
/**< Total size of an object (header + elt + trailer). */
 };

-#define RTE_MEMPOOL_NAMESIZE 32 /**< Maximum length of a memory pool. */
+/**< Maximum length of a memory pool's name. */
+#define RTE_MEMPOOL_NAMESIZE (RTE_RING_NAMESIZE - \
+ sizeof(RTE_MEMPOOL_MZ_PREFIX) + 1)
 #define RTE_MEMPOOL_MZ_PREFIX "MP_"

 /* "MP_" */
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index eb45e41..d6185de 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -100,6 +100,7 @@ extern "C" {
 #include 
 #include 
 #include 
+#include 

 #define RTE_TAILQ_RING_NAME "RTE_RING"

@@ -126,8 +127,10 @@ struct rte_ring_debug_stats {
 } __rte_cache_aligned;
 #endif

-#define RTE_RING_NAMESIZE 32 /**< The maximum length of a ring name. */
 #define RTE_RING_MZ_PREFIX "RG_"
+/**< The maximum length of a ring name. */
+#define RTE_RING_NAMESIZE (RTE_MEMZONE_NAMESIZE - \
+  sizeof(RTE_RING_MZ_PREFIX) + 1)

 #ifndef RTE_RING_PAUSE_REP_COUNT
 #define RTE_RING_PAUSE_REP_COUNT 0 /**< Yield after pause num of times, no 
yield
-- 
1.9.1

[dpdk-dev] [PATCH] mempool: fix lack of free() registration

2016-07-19 Thread Zoltan Kiss

The new mempool handler interface forgets to register the free() function
of the ops. Introduced in this patch:

449c49b9 mempool: support handler operations

Signed-off-by: Zoltan Kiss 
---
 lib/librte_mempool/rte_mempool_ops.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_mempool/rte_mempool_ops.c 
b/lib/librte_mempool/rte_mempool_ops.c
index fd0b64c..5f24de2 100644
--- a/lib/librte_mempool/rte_mempool_ops.c
+++ b/lib/librte_mempool/rte_mempool_ops.c
@@ -81,6 +81,7 @@ rte_mempool_register_ops(const struct rte_mempool_ops *h)
ops = _mempool_ops_table.ops[ops_index];
snprintf(ops->name, sizeof(ops->name), "%s", h->name);
ops->alloc = h->alloc;
+   ops->free = h->free;
ops->enqueue = h->enqueue;
ops->dequeue = h->dequeue;
ops->get_count = h->get_count;
-- 
1.9.1

[dpdk-dev] [RFC PATCH] i40: remove weak version of i40e_rx_vec_dev_conf_condition_check()

2016-07-19 Thread Zoltan Kiss

Using weak symbols have a few issues with static linking:

- normally the linker searches the .o files already linked, if your weak
  one is there, it won't check if there is a strong version
- unless the symbol is directly referred, but it's not the right thing to
  rely on
- or --whole-archive specified in the command line, which pulls in a lot
  of unwanted stuff
- whole-archive also makes libtool dropping the library from the command
  line, which is what should happen with dynamic linking, but not with
  static (observed on Ubuntu 14.04). This is an important bug if you
  build a static library depending on DPDK

This patch merges the two version and make the behaviour rely on the
config.

If we can agree in the concept, I can send a series to fix this for the
other weak functions.

Signed-off-by: Zoltan Kiss 
---
 drivers/net/i40e/i40e_rxtx.c | 36 +++-
 drivers/net/i40e/i40e_rxtx_vec.c | 36 
 2 files changed, 35 insertions(+), 37 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index d3cfb98..ad34d3a 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3278,10 +3278,44 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
 }

 /* Stubs needed for linkage when CONFIG_RTE_I40E_INC_VECTOR is set to 'n' */
-int __attribute__((weak))
+int __attribute__((cold))
 i40e_rx_vec_dev_conf_condition_check(struct rte_eth_dev __rte_unused *dev)
 {
+#ifndef RTE_LIBRTE_I40E_INC_VECTOR
return -1;
+#else
+#ifndef RTE_LIBRTE_IEEE1588
+   struct rte_eth_rxmode *rxmode = >data->dev_conf.rxmode;
+   struct rte_fdir_conf *fconf = >data->dev_conf.fdir_conf;
+
+   /* need SSE4.1 support */
+   if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+   return -1;
+
+#ifndef RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE
+   /* whithout rx ol_flags, no VP flag report */
+   if (rxmode->hw_vlan_strip != 0 ||
+   rxmode->hw_vlan_extend != 0)
+   return -1;
+#endif /* RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE */
+
+   /* no fdir support */
+   if (fconf->mode != RTE_FDIR_MODE_NONE)
+   return -1;
+
+/* - no csum error report support
+* - no header split support
+*/
+   if (rxmode->hw_ip_checksum == 1 ||
+   rxmode->header_split == 1)
+   return -1;
+
+   return 0;
+#else
+   RTE_SET_USED(dev);
+   return -1;
+#endif /* RTE_LIBRTE_IEEE1588 */
+#endif /* RTE_LIBRTE_I40E_INC_VECTOR */
 }

 uint16_t __attribute__((weak))
diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c
index 05cb415..983b2c0 100644
--- a/drivers/net/i40e/i40e_rxtx_vec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec.c
@@ -723,39 +723,3 @@ i40e_txq_vec_setup(struct i40e_tx_queue __rte_unused *txq)
 {
return 0;
 }
-
-int __attribute__((cold))
-i40e_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev)
-{
-#ifndef RTE_LIBRTE_IEEE1588
-   struct rte_eth_rxmode *rxmode = >data->dev_conf.rxmode;
-   struct rte_fdir_conf *fconf = >data->dev_conf.fdir_conf;
-
-   /* need SSE4.1 support */
-   if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
-   return -1;
-
-#ifndef RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE
-   /* whithout rx ol_flags, no VP flag report */
-   if (rxmode->hw_vlan_strip != 0 ||
-   rxmode->hw_vlan_extend != 0)
-   return -1;
-#endif
-
-   /* no fdir support */
-   if (fconf->mode != RTE_FDIR_MODE_NONE)
-   return -1;
-
-/* - no csum error report support
-* - no header split support
-*/
-   if (rxmode->hw_ip_checksum == 1 ||
-   rxmode->header_split == 1)
-   return -1;
-
-   return 0;
-#else
-   RTE_SET_USED(dev);
-   return -1;
-#endif
-}
-- 
1.9.1

[dpdk-dev] weak functions in some drivers

2016-07-14 Thread Zoltan Kiss

Hi,

On 01/07/16 11:19, Sergio Gonzalez Monroy wrote:
> On 01/07/2016 11:16, Elo, Matias (Nokia - FI/Espoo) wrote:
>>> -Original Message-
>>> From: Sergio Gonzalez Monroy [mailto:sergio.gonzalez.monroy at intel.com]
>>> Sent: Friday, July 01, 2016 1:05 PM
>>> To: Elo, Matias (Nokia - FI/Espoo) ;
>>> dev at dpdk.org
>>> Cc: ferruh.yigit at intel.com; damarion at cisco.com
>>> Subject: Re: [dpdk-dev] weak functions in some drivers
>>>
>>> On 01/07/2016 10:42, Elo, Matias (Nokia - FI/Espoo) wrote:
> What is not clear to me is motivation to use weak here instead
> of simply
>> using >CONFIG_RTE_I40E_INC_VECTOR
> macro to exclude stubs in i40e_rxtx.c. It will make library
> smaller and
>>> avoid
>> issues like this one
> which are quite hard to troubleshoot.
 Since this issue seen in fd.io, I didn't investigated more, but
 I don't
 want to clock your valid question, this is an attempt to
 resurrect the
 question ...
>>> Hi,
>>>
>>> We are having exactly the same problem. For us the aforementioned
>> workaround doesn't seem to work and vector mode is always disabled
>> with
>>> the
>> i40e drivers. If I modify i40e_rxtx.c and exclude the stub
>> functions using
>> CONFIG_RTE_I40E_INC_VECTOR everything works as expected.
>>> We are building DPDK with the CONFIG_RTE_BUILD_COMBINE_LIBS option
>> enabled and link DPDK library to our application.
>>> Any other ideas how this could be fixed?
>>>
>>> Regards,
>>> Matias
>>>
>> So you have tried to link a combined static lib with --whole-archive
>> -ldpdk --no-whole-archive and still get the wrong/weak function
>> definition?
>>
>> Sergio
> I actually just managed to fix the problem. In our case I had to add
> '-Wl,--whole-archive,-ldpdk,--no-whole-archive'  to the end of
> AM_LDFLAGS.
>
 It turned out that the problem actually wasn't fixed.

 DPDK is built with CONFIG_RTE_BUILD_COMBINE_LIBS=y and
>>> EXTRA_CFLAGS="-fPIC"
 What we are linking originally:
 -l:libdpdk.a

 This works otherwise but vector mode i40e is not enabled.

 When trying:
 -Wl,--whole-archive,-l:libdpdk.a,--no-whole-archive

 Linking fails with ' undefined reference' errors to several dpdk
 functions
>>> (rte_eal_init, rte_cpu_get_flag_enabled, rte_eth_stats_get...).
 Btw. there seems to be a Stack Overflow question related to this:
>>> http://stackoverflow.com/questions/38064021/dpdk-include-libraries-in-dpdk-
>>>
>>> application-compiled-as-shared-library
 -Matias
>>> What DPDK version are you using?
>> v16.04
>
> Ok. I was asking because there is no CONFIG_RTE_BUILD_COMBINE_LIBS in
> 16.04.
>
>
> Could you provide full link/compile command line? I'm not able to
> reproduce the issue so far

I've dug into this issue a bit, let me explain it with some example code:

main.c:
void bar(void);
void foo(void);

int main() {
bar();
//foo();
}
==
weak.c:
#include 

void __attribute__((weak)) foo(void) {
printf("Weak\n");
}

void bar(void) {
foo();
printf("No call\n");
}
==
strong.c:
#include 

void foo(void) {
printf("Strong\n");
}

Compile the second two into a library and link it with main:

gcc -o strong.o -c strong.c
ar rcs libbar.a strong.o weak.o
gcc main.c -L. -lbar -o main

Then look which foo were used:

objdump -x libbar.a | grep foo
 g F .text  0010 foo
  wF .text  0010 foo
0015 R_X86_64_PC32 foo-0x0004

objdump -x main | grep foo
00400538 w F .text 0010 foo

So it got the weak version, simply because bar() was referred, and that 
already pulled the other symbols from that .o file, so ld stop looking 
for the another ones. If you uncomment the foo() call in main(), it will 
do a proper search, and pulls in the strong symbol.

I think there are only workarounds here:
- use --whole-archive, but that pulls in a lot of stuff you don't 
necessarily need. And if you feed that stuff to libtool, it seems to 
misbehave when it sees the -Wl parameter. The Ubuntu 14.04 libtool 
definitely has a bug around that. Matias ran into this with ODP-DPDK: it 
fixed the weak symbol problem in the .so, but then libtool forgot to add 
-ldpdk when doing statical linking
- use only dynamic linking: I guess that's not a viable restriction

The best would be to get rid of these weak functions, as they don't work 
the same with dynamic and static linking, especially the latter is 
problematic. I think Damjan suggested something like that in his 
original message.

I propose something like this:

@@ -3268,10 +3268,40 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
  }

  /* Stubs needed for linkage when CONFIG_RTE_I40E_INC_VECTOR is set to

[dpdk-dev] [PATCH] ixgbe:enable configuration for old ptype behavior

2016-06-27 Thread Zoltan Kiss

Hi,

On 27/06/16 09:00, Ananyev, Konstantin wrote:
>> The default behavior is to NOT support the old ptype behavior,
>> but enabling the configuration option the old ptype style
>> can be supported.
>>
>> Add support for old behaviour until we have a cleaner solution using
>> a configuration option CONFIG_RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOUR,
>> which is defaulted to not set.
>
> I think it was explained why we had to diable ptype recognition for vRX, 
> please see here:
> http://dpdk.org/browse/dpdk/commit/drivers/net/ixgbe/ixgbe_rxtx_vec.c?id=d9a2009a81089093645fea2e04b51dd37edf3e6f
> I think that introducing a compile time option to re-enable incomplete
> and not fully correct functionality is a very bad approach.

Btw. I wanted to ask, is there a plan to fix vector RX to set ptype?



> So NACK.
> Konstantin
>
>>
>> Signed-off-by: Keith Wiles 
>> ---
>>   config/common_base |  2 ++
>>   drivers/net/ixgbe/ixgbe_ethdev.c   |  6 +
>>   drivers/net/ixgbe/ixgbe_rxtx_vec.c | 52 
>> +++---
>>   3 files changed, 57 insertions(+), 3 deletions(-)
>>
>> diff --git a/config/common_base b/config/common_base
>> index bdde2e7..05e69bc 100644
>> --- a/config/common_base
>> +++ b/config/common_base
>> @@ -160,6 +160,8 @@ CONFIG_RTE_LIBRTE_IXGBE_DEBUG_DRIVER=n
>>   CONFIG_RTE_LIBRTE_IXGBE_PF_DISABLE_STRIP_CRC=n
>>   CONFIG_RTE_IXGBE_INC_VECTOR=y
>>   CONFIG_RTE_IXGBE_RX_OLFLAGS_ENABLE=y
>> +# Enable to restore old ptype behavior
>> +CONFIG_RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR=n
>>
>>   #
>>   # Compile burst-oriented I40E PMD driver
>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c 
>> b/drivers/net/ixgbe/ixgbe_ethdev.c
>> index e11a431..068b92b 100644
>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>> @@ -3076,7 +3076,13 @@ ixgbe_dev_supported_ptypes_get(struct rte_eth_dev 
>> *dev)
>>  if (dev->rx_pkt_burst == ixgbe_recv_pkts ||
>>  dev->rx_pkt_burst == ixgbe_recv_pkts_lro_single_alloc ||
>>  dev->rx_pkt_burst == ixgbe_recv_pkts_lro_bulk_alloc ||
>> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
>> +dev->rx_pkt_burst == ixgbe_recv_pkts_bulk_alloc ||
>> +dev->rx_pkt_burst == ixgbe_recv_pkts_vec ||
>> +dev->rx_pkt_burst == ixgbe_recv_scattered_pkts_vec)
>> +#else
>>  dev->rx_pkt_burst == ixgbe_recv_pkts_bulk_alloc)
>> +#endif
>>  return ptypes;
>>  return NULL;
>>   }
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
>> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>> index 12190d2..2e0d50b 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>> @@ -228,6 +228,10 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
>> rte_mbuf **rx_pkts,
>>  );
>>  __m128i dd_check, eop_check;
>>  uint8_t vlan_flags;
>> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
>> +__m128i desc_mask = _mm_set_epi32(0x, 0x,
>> +  0x, 0x07F0);
>> +#endif
>>
>>  /* nb_pkts shall be less equal than RTE_IXGBE_MAX_RX_BURST */
>>  nb_pkts = RTE_MIN(nb_pkts, RTE_IXGBE_MAX_RX_BURST);
>> @@ -268,8 +272,14 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
>> rte_mbuf **rx_pkts,
>>  13, 12,  /* octet 12~13, 16 bits data_len */
>>  0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
>>  13, 12,  /* octet 12~13, low 16 bits pkt_len */
>> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
>> +0xFF, 0xFF,  /* skip high 16 bits pkt_type */
>> +1,   /* octet 1, 8 bits pkt_type field */
>> +0/* octet 0, 4 bits offset 4 pkt_type field */
>> +#else
>>  0xFF, 0xFF,  /* skip 32 bit pkt_type */
>>  0xFF, 0xFF
>> +#endif
>>  );
>>
>>  /* Cache is empty -> need to scan the buffer rings, but first move
>> @@ -291,6 +301,9 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
>> rte_mbuf **rx_pkts,
>>  for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
>>  pos += RTE_IXGBE_DESCS_PER_LOOP,
>>  rxdp += RTE_IXGBE_DESCS_PER_LOOP) {
>> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
>> +__m128i descs0[RTE_IXGBE_DESCS_PER_LOOP];
>> +#endif
>>  __m128i descs[RTE_IXGBE_DESCS_PER_LOOP];
>>  __m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
>>  __m128i zero, staterr, sterr_tmp1, sterr_tmp2;
>> @@ -301,18 +314,30 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
>> rte_mbuf **rx_pkts,
>>
>>  /* Read desc statuses backwards to avoid race condition */
>>  /* A.1 load 4 pkts desc */
>> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
>> +descs0[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
>> +#else
>>  descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
>> -
>> +#endif
>>  /* B.2 copy 2 mbuf point into rx_pkts  */
>>

[dpdk-dev] [PATCH] mbuf: make rearm_data address naturally aligned

2016-05-20 Thread Zoltan Kiss

Hi,

On 19/05/16 13:18, Ananyev, Konstantin wrote:
> I wonder does anyone really use mbuf port field?
> My though was - could we to drop it completely?

There are a few example codes which are reading the port field. Although 
they can retain this metadata in the private area of the mbuf, right 
after receiving, it would cause them a minor perf drop to do it 
separately. I'm not sure which one is more important: this perf drop of 
the gain everyone else has by relieving the drivers to do it.

Zoli

[dpdk-dev] ixgbe TX function selection

2016-03-18 Thread Zoltan Kiss



On 18/03/16 00:45, Lu, Wenzhuo wrote:
> Hi Zoltan,
>
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Friday, March 18, 2016 1:11 AM
>> To: Wu, Jingjing; dev at dpdk.org
>> Subject: Re: [dpdk-dev] ixgbe TX function selection
>>
>>
>>
>> On 10/03/16 07:51, Wu, Jingjing wrote:
>>> Hi, Zoltan
>>>
>>>> -----Original Message-
>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>>>> Sent: Wednesday, March 2, 2016 3:19 AM
>>>> To: dev at dpdk.org
>>>> Subject: [dpdk-dev] ixgbe TX function selection
>>>>
>>>> Hi,
>>>>
>>>> I've noticed that ixgbe_set_tx_function() selects the non-SG function
>>>> even if (dev->data->scattered_rx == 1). That seems a bit dangerous,
>>>> as you can turn that on inadvertently when you don't set
>>>> max_rx_pkt_len and buffer size in certain ways. I've learnt it in the
>>>> hard way, as my segmented packets were leaking memory on the TX path,
>>>> which doesn't cries if you send out segmented packets.
>>>> How should this case be treated? Assert on the non-SG TX side for the
>>>> 'next' pointer? Or turning on SG if RX has it? It doesn't seem to be
>>>> a solid way as other interfaces still can have SG turned on.
>>>>
>>>
>>> If you look into the ixgbe_set_tx_function, you will find tx function
>>> selection is decided by the tx_flags on queue configure, which is
>>> passed by rte_eth_txconf. So even you set dev->data->scattered_rx to
>>> 1, if the tx_flags is ETH_TXQ_FLAGS_NOMULTSEGS, ixgbe_xmit_pkts_simple
>>> is still selected as tx function. So, you'd better to set tx_flags=0, and 
>>> have a try.
>>
>> You mean getting default_txconf from rte_eth_dev_info_get() and explicitly 
>> turn
>> ETH_TXQ_FLAGS_NOMULTSEGS to 0? (filling tx_flags with zeros doesn't work
>> very well) That's a way to solve it for me, but I'm rather talking about 
>> using
>> defaults which doesn't cause memory leak quite easily.
> Yes, ETH_TXQ_FLAGS_NOMULTSEGS only can be set to 1 when you know all your 
> packets will not be segmented.
> I think that means normally we should use full function path for TX, for we 
> have no knowledge about if the packets will be segmented or not.
> You don't need to set tx_flags to 0, only the ETH_TXQ_FLAGS_NOMULTSEGS bit 
> should be 0, the other bits can be 1 if needed.

So can we agree that the default settings should set 
ETH_TXQ_FLAGS_NOMULTSEGS to 0?


>
>>
>>>
>>>> Regards,
>>>>
>>>> Zoltan

[dpdk-dev] Huge pages to be allocated based on number of mbufs

2016-03-17 Thread Zoltan Kiss

On 14/03/16 17:54, Saurabh Mishra wrote:
> Hi,
>
> We are planning to support virtio, vmxnet3, ixgbe, i40e, bxn2x and SR-IOV
> on some of them with DPDK.
>
> We have seen that even if we give correct number of mbufs given the number
> hugepages reserved, rte_eth_tx_queue_setup() may still fail with no enough
> memory (I saw this on i40evf but worked on virtio and vmxnet3).
>
> We like to know what's the recommended way to determine how many hugepages
> we should allocate given the number of mbufs such that queue setup APIs
> also don't fail.

I think you ran into a fragmentation problem. If you allocate the 
hugepages later on after startup, chances are they are fragmented in the 
memory. When you allocate a pool, DPDK needs a continuous area of memory 
on the hugepages.
You should allocate them through the kernel boot params so they'll be as 
continuous as possible.

>
> Since we will be running on low-end systems too we need to be careful about
> reserving hugepages.
>
> Thanks,
> /Saurabh
>

[dpdk-dev] ixgbe TX function selection

2016-03-17 Thread Zoltan Kiss



On 10/03/16 07:51, Wu, Jingjing wrote:
> Hi, Zoltan
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Wednesday, March 2, 2016 3:19 AM
>> To: dev at dpdk.org
>> Subject: [dpdk-dev] ixgbe TX function selection
>>
>> Hi,
>>
>> I've noticed that ixgbe_set_tx_function() selects the non-SG function
>> even if (dev->data->scattered_rx == 1). That seems a bit dangerous, as
>> you can turn that on inadvertently when you don't set max_rx_pkt_len and
>> buffer size in certain ways. I've learnt it in the hard way, as my
>> segmented packets were leaking memory on the TX path, which doesn't
>> cries if you send out segmented packets.
>> How should this case be treated? Assert on the non-SG TX side for the
>> 'next' pointer? Or turning on SG if RX has it? It doesn't seem to be a
>> solid way as other interfaces still can have SG turned on.
>>
>
> If you look into the ixgbe_set_tx_function, you will find tx function
> selection is decided by the tx_flags on queue configure, which is
> passed by rte_eth_txconf. So even you set dev->data->scattered_rx to 1,
> if the tx_flags is ETH_TXQ_FLAGS_NOMULTSEGS, ixgbe_xmit_pkts_simple is
> still selected as tx function. So, you'd better to set tx_flags=0, and have a 
> try.

You mean getting default_txconf from rte_eth_dev_info_get() and 
explicitly turn ETH_TXQ_FLAGS_NOMULTSEGS to 0? (filling tx_flags with 
zeros doesn't work very well) That's a way to solve it for me, but I'm 
rather talking about using defaults which doesn't cause memory leak 
quite easily.

>
>> Regards,
>>
>> Zoltan

[dpdk-dev] rte_mbuf's packet_type field

2016-03-04 Thread Zoltan Kiss



On 04/03/16 10:58, Ananyev, Konstantin wrote:
> Hi,
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier MATZ
>> Sent: Friday, March 04, 2016 9:29 AM
>> To: Zoltan Kiss; dev at dpdk.org
>> Subject: Re: [dpdk-dev] rte_mbuf's packet_type field
>>
>> Hi Zoltan,
>>
>> On 03/01/2016 06:15 PM, Zoltan Kiss wrote:
>>> I have a quick question about this field: how do I know if the
>>> underlying PMD supports a particular protocol parsing. Let's say I want
>>> to check for SCTP packets, looking at this field tells me EITHER the
>>> packet is SCTP (or not), OR that the hardware has no idea about SCTP. Is
>>> there a way to figure that support out?
>>
>> I'm not aware of such a feature. I guess you want to avoid to re-check
>> all protocols in software if the hardware supports some of them and
>> did not recognize them? In that case it may be interesting, but it would
>> result in a lot of feature flags.
>
> Nothing in the mainline, but there is a patches in-flight:
> http://dpdk.org/dev/patchwork/patch/10921/

Thanks, this would do it!

Regards,

Zoltan

>
> Konstantin
>
>>
>> Regards,
>> Olivier
>

[dpdk-dev] ixgbe TX function selection

2016-03-04 Thread Zoltan Kiss

On 04/03/16 01:47, Lu, Wenzhuo wrote:
> Hi Zoltan,
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Wednesday, March 2, 2016 3:19 AM
>> To: dev at dpdk.org
>> Subject: [dpdk-dev] ixgbe TX function selection
>>
>> Hi,
>>
>> I've noticed that ixgbe_set_tx_function() selects the non-SG function even if
> Thanks for let us know the problem. But I don't catch your point. Do you 
> really mean TX here? After a quick look at the code, I don?t find the 
> SG/non-SG functions for TX. Do I miss something?

The simple code path doesn't handle multisegmented packets. 
ixgbe_txq_vec_setup() and ixgbe_xmit_pkts_simple() doesn't even check 
the next pointer of the mbuf, just put the first one on the descriptor 
ring, and when TX completion happens, the memory is leaked because it 
just sets ->next to NULL, and calls rte_mempool_put[_bulk]
ixgbe_xmit_pkts() puts all the segments on the descriptor ring, 
therefore when the descriptors are released they are released as well.
This is what these functions supposed to do, but my point is it's very 
easy to send a multisegmented packet to the simple code path.

>
>> (dev->data->scattered_rx == 1). That seems a bit dangerous, as you can turn
>> that on inadvertently when you don't set max_rx_pkt_len and buffer size in
>> certain ways. I've learnt it in the hard way, as my segmented packets were
>> leaking memory on the TX path, which doesn't cries if you send out segmented
>> packets.
> Which one will cause problem? SG or non-SG packets? And where does the memory 
> leak happen?
>
>> How should this case be treated? Assert on the non-SG TX side for the 'next'
>> pointer? Or turning on SG if RX has it? It doesn't seem to be a solid way as 
>> other
>> interfaces still can have SG turned on.
>>
>> Regards,
>>
>> Zoltan

[dpdk-dev] ixgbe TX function selection

2016-03-01 Thread Zoltan Kiss

Hi,

I've noticed that ixgbe_set_tx_function() selects the non-SG function 
even if (dev->data->scattered_rx == 1). That seems a bit dangerous, as 
you can turn that on inadvertently when you don't set max_rx_pkt_len and 
buffer size in certain ways. I've learnt it in the hard way, as my 
segmented packets were leaking memory on the TX path, which doesn't 
cries if you send out segmented packets.
How should this case be treated? Assert on the non-SG TX side for the 
'next' pointer? Or turning on SG if RX has it? It doesn't seem to be a 
solid way as other interfaces still can have SG turned on.

Regards,

Zoltan

[dpdk-dev] rte_mbuf's packet_type field

2016-03-01 Thread Zoltan Kiss

Hi,

I have a quick question about this field: how do I know if the 
underlying PMD supports a particular protocol parsing. Let's say I want 
to check for SCTP packets, looking at this field tells me EITHER the 
packet is SCTP (or not), OR that the hardware has no idea about SCTP. Is 
there a way to figure that support out?

Zoli

[dpdk-dev] Sending and receiving on the same port at the same time, from different threads

2016-01-15 Thread Zoltan Kiss

Hi,

I've been asked this question, and I realized I'm not sure about the 
answer. In other words: can you call rte_eth_tx_burst() and 
rte_eth_tx_burst() on the same port at the same time from different 
threads? In theory it seems possible, as you still access different 
queues (an RX and a TX one), and at least taking a glance at ixgbe 
vector functions, they don't seem to use common resources while doing RX 
or TX. But I'm not sure that it's generally true, although I always 
assumed that it should be true. Have anyone seen a device where it 
wasn't true?

Regards,

Zoltan

[dpdk-dev] Reshuffling of rte_mbuf structure.

2015-11-03 Thread Zoltan Kiss

Also, there could be places in the code where we change a set of 
continuous fields in the mbuf. E.g. ixgbe vector pmd receive function 
takes advantage of 128 bit vector registers and fill out 
rx_descriptor_fields1 with one instruction. But I guess there are other 
places too, and they are really hard to find with code analysis. A 
change in the mbuf structure would probably bring a plethora of nasty 
bugs due to this.

On 03/11/15 10:20, Bruce Richardson wrote:
> On Mon, Nov 02, 2015 at 07:21:17PM -0500, Matthew Hall wrote:
>> On Mon, Nov 02, 2015 at 11:51:23PM +0100, Thomas Monjalon wrote:
>>> But it is simpler to say that having an API depending of some options
>>> is a "no-design" which could seriously slow down the DPDK adoption.
>>
>> What about something similar to how Java JNI works? It needed to support
>> multiple Java JRE / JDK brands, implementations etc. Upon initialization, a
>> function pointer array is created, and specific slots are filled with 
>> pointers
>> to the real implementation of some native API functions you can call from
>> inside your library to perform operations.
>>
>> In the DPDK case, we need flexible data instead of flexible function
>> implementations.
>>
>> To do this there would be some pointer slots in the mbuf that are are filled
>> with pointers to metadata for required DPDK features. The data could be 
>> placed
>> in the following cachelines, using some reserved tailroom between the mbuf
>> control block and the packet data block. Then the prefetch could be set up to
>> prefetch only the used parts of the tailroom at any given point, to prevent
>> unwanted slowdowns.
>>
>> Matthew.
>
> The trouble is that a lot of the metadata comes from the receive descriptor on
> the RX code path, which is extremely sensitive to cache line usage. This is 
> why
> in the 1.8 changes to the mbuf, the data used by the RX code paths were all 
> put
> on the first cacheline.
>
> /Bruce
>

[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-10-20 Thread Zoltan Kiss



On 19/10/15 19:57, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Monday, October 19, 2015 5:30 PM
>> To: Ananyev, Konstantin; Richardson, Bruce; dev at dpdk.org
>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive 
>> function
>>
>>
>>
>> On 15/10/15 16:43, Ananyev, Konstantin wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>> Sent: Thursday, October 15, 2015 11:33 AM
>>>> To: Ananyev, Konstantin; Richardson, Bruce; dev at dpdk.org
>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive 
>>>> function
>>>>
>>>>
>>>>
>>>> On 15/10/15 00:23, Ananyev, Konstantin wrote:
>>>>>
>>>>>
>>>>>> -Original Message-
>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>> Sent: Wednesday, October 14, 2015 5:11 PM
>>>>>> To: Ananyev, Konstantin; Richardson, Bruce; dev at dpdk.org
>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD 
>>>>>> receive function
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 28/09/15 00:19, Ananyev, Konstantin wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -Original Message-
>>>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>>>> Sent: Friday, September 25, 2015 7:29 PM
>>>>>>>> To: Richardson, Bruce; dev at dpdk.org
>>>>>>>> Cc: Ananyev, Konstantin
>>>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD 
>>>>>>>> receive function
>>>>>>>>
>>>>>>>> On 07/09/15 07:41, Richardson, Bruce wrote:
>>>>>>>>>
>>>>>>>>>> -Original Message-
>>>>>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>>>>>> Sent: Monday, September 7, 2015 3:15 PM
>>>>>>>>>> To: Richardson, Bruce; dev at dpdk.org
>>>>>>>>>> Cc: Ananyev, Konstantin
>>>>>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD 
>>>>>>>>>> receive
>>>>>>>>>> function
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 07/09/15 13:57, Richardson, Bruce wrote:
>>>>>>>>>>>
>>>>>>>>>>>> -Original Message-
>>>>>>>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>>>>>>>> Sent: Monday, September 7, 2015 1:26 PM
>>>>>>>>>>>> To: dev at dpdk.org
>>>>>>>>>>>> Cc: Ananyev, Konstantin; Richardson, Bruce
>>>>>>>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD
>>>>>>>>>>>> receive function
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I just realized I've missed the "[PATCH]" tag from the subject. Did
>>>>>>>>>>>> anyone had time to review this?
>>>>>>>>>>>>
>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>
>>>>>>>>>>> the big thing that concerns me with this is the addition of new
>>>>>>>>>>> instructions for each packet in the fast path. Ideally, this
>>>>>>>>>>> prefetching would be better handled in the application itself, as 
>>>>>>>>>>> for
>>>>>>>>>>> some apps, e.g. those using pipelining, the core doing the RX from 
>>>>>>>>>>> the
>>>>>>>>>>> NIC may not touch the packet data at all, and the prefetches will
>>>>>>>>>> instead cause a performance slowdown.
>>>>>>>

[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-10-19 Thread Zoltan Kiss



On 15/10/15 16:43, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Thursday, October 15, 2015 11:33 AM
>> To: Ananyev, Konstantin; Richardson, Bruce; dev at dpdk.org
>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive 
>> function
>>
>>
>>
>> On 15/10/15 00:23, Ananyev, Konstantin wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>> Sent: Wednesday, October 14, 2015 5:11 PM
>>>> To: Ananyev, Konstantin; Richardson, Bruce; dev at dpdk.org
>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive 
>>>> function
>>>>
>>>>
>>>>
>>>> On 28/09/15 00:19, Ananyev, Konstantin wrote:
>>>>>
>>>>>
>>>>>> -Original Message-
>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>> Sent: Friday, September 25, 2015 7:29 PM
>>>>>> To: Richardson, Bruce; dev at dpdk.org
>>>>>> Cc: Ananyev, Konstantin
>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD 
>>>>>> receive function
>>>>>>
>>>>>> On 07/09/15 07:41, Richardson, Bruce wrote:
>>>>>>>
>>>>>>>> -Original Message-
>>>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>>>> Sent: Monday, September 7, 2015 3:15 PM
>>>>>>>> To: Richardson, Bruce; dev at dpdk.org
>>>>>>>> Cc: Ananyev, Konstantin
>>>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD 
>>>>>>>> receive
>>>>>>>> function
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 07/09/15 13:57, Richardson, Bruce wrote:
>>>>>>>>>
>>>>>>>>>> -Original Message-
>>>>>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>>>>>> Sent: Monday, September 7, 2015 1:26 PM
>>>>>>>>>> To: dev at dpdk.org
>>>>>>>>>> Cc: Ananyev, Konstantin; Richardson, Bruce
>>>>>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD
>>>>>>>>>> receive function
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I just realized I've missed the "[PATCH]" tag from the subject. Did
>>>>>>>>>> anyone had time to review this?
>>>>>>>>>>
>>>>>>>>> Hi Zoltan,
>>>>>>>>>
>>>>>>>>> the big thing that concerns me with this is the addition of new
>>>>>>>>> instructions for each packet in the fast path. Ideally, this
>>>>>>>>> prefetching would be better handled in the application itself, as for
>>>>>>>>> some apps, e.g. those using pipelining, the core doing the RX from the
>>>>>>>>> NIC may not touch the packet data at all, and the prefetches will
>>>>>>>> instead cause a performance slowdown.
>>>>>>>>> Is it possible to get the same performance increase - or something
>>>>>>>>> close to it - by making changes in OVS?
>>>>>>>> OVS already does a prefetch when it's processing the previous packet, 
>>>>>>>> but
>>>>>>>> apparently it's not early enough. At least for my test scenario, where 
>>>>>>>> I'm
>>>>>>>> forwarding UDP packets with the least possible overhead. I guess in 
>>>>>>>> tests
>>>>>>>> where OVS does more complex processing it should be fine.
>>>>>>>> I'll try to move the prefetch earlier in OVS codebase, but I'm not 
>>>>>>>> sure if
>>>>>>>> it'll help.
>>>>>>> I would suggest trying to prefetch more than one packet ahead. 
>>>>>>> Prefetching 4 or
>>>>>>> 8 ahead might work better, depending

[dpdk-dev] Calling rte_eth_rx_burst() multiple times

2015-10-15 Thread Zoltan Kiss



On 15/10/15 11:43, Younghwan Go wrote:
> Hi Zoltan,
>
> Thanks for the email.
>
> 2015-10-15 ?? 7:23? Zoltan Kiss ?(?) ? ?:
>>
>>
>> On 15/10/15 09:32, Younghwan Go wrote:
>>> Hi,
>>>
>>> I'm pretty new to playing with DPDK. I was trying to see if I can always
>>> receive MAX_BURST packets by calling rte_eth_rx_burst() multiple times
>>> on same <port, queue> pair (code shown below). I'm using DPDK-2.1.0 on 2
>>> dual-port Intel 82599ES 10Gbps NICs with Ubuntu 14.04.3 (kernel
>>> 3.13.0-63-generic).
>>>
>>> Since packet processing is slower (~10 Gbps) than pure RX speed (~40
>>> Gbps), I assumed rte_eth_rx_burst() would always receive some number of
>>> packets, eventually filling up MAX_BURST. But for multi-core case (4
>>> CPUs, 4 ports), rte_eth_rx_burst() starts to always return 0 after some
>>> time, causing all cores to be blocked forever. Analyzing the DPDK code
>>> (drivers/net/ixgbe/ixgbe_rxtx.c), I'm seeing that inside
>>> ixgbe_rx_scan_hw_ring() function, "rxdp->wb.upper.status.error" always
>>> returns 0 (where is this value set by the way?).
>>
>> I think it is set by the hardware.
>>
>>>
>>> I didn't see this problem for single-core case, in which it returned
>>> MAX_BURST packets at every rte_eth_rx_burst() call. Also, if I break out
>>> of while loop when I receive 0, I keep receiving packets in next <port,
>>> queue> pairs. Does anyone know why this block might happen? Or am I not
>>> allowed to call rte_eth_rx_burst() multiple times on same <port, queue>
>>> pair if I get 0? Any help will be great! Thank you!
>>
>> Although not mentioned in the documentation itself, as far as I know
>> rte_eth_rx_burst() is not thread-safe. If you look in to receive
>> functions, there are no locking anywhere. You should call it on
>> separate queues from different threads, and configure e.g RSS to
>> distribute the traffic by the hardware.
>
> I'm calling rte_eth_rx_burst() on separate queue ids for each thread.
> I'm actually using lcore_id (= 0, 1, 2, 3 for 4 threads pinned to each
> separate CPU core) as queue_id. I also made sure that this problem is
> not caused by threads conflicting by locking before calling
> rte_eth_rx_burst().
>
> For RSS, I configured with ETH_RSS_IP for load balancing traffic to each
> port and queue. But even if RSS wasn't set, shouldn't at least one core
> be receiving packets? What I'm seeing is all threads getting stuck at
> rte_eth_rx_burst() with return value of 0s indefinitely.
>
>>>
>>> 
>>> int cnt = MAX_BURST; // MAX_BURST = 32
>>> int off = 0;
>>> do {
>>>  ret = rte_eth_rx_burst(port_id, queue_id, _table[off], cnt);

Another thing which might cause your problem is that I don't see where 
do you release the buffers after received into m_table. You need to call 
rte_pktmbuf_free() on them at some point, otherwise your pool can get 
depleted, and the receive function can't refill the descriptor rings.


>>>  if (ret == 0) {
>>>  // don't break out but continue
>>>  } else if (ret > 0) {
>>>  off += ret;
>>>  cnt -= ret;
>>>  }
>>> } while (cnt > 0);
>>> 
>>>
>>> Best,
>>> Younghwan
>
> Thanks,
> Younghwan

[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-10-15 Thread Zoltan Kiss



On 15/10/15 00:23, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Wednesday, October 14, 2015 5:11 PM
>> To: Ananyev, Konstantin; Richardson, Bruce; dev at dpdk.org
>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive 
>> function
>>
>>
>>
>> On 28/09/15 00:19, Ananyev, Konstantin wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>> Sent: Friday, September 25, 2015 7:29 PM
>>>> To: Richardson, Bruce; dev at dpdk.org
>>>> Cc: Ananyev, Konstantin
>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive 
>>>> function
>>>>
>>>> On 07/09/15 07:41, Richardson, Bruce wrote:
>>>>>
>>>>>> -Original Message-
>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>> Sent: Monday, September 7, 2015 3:15 PM
>>>>>> To: Richardson, Bruce; dev at dpdk.org
>>>>>> Cc: Ananyev, Konstantin
>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive
>>>>>> function
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 07/09/15 13:57, Richardson, Bruce wrote:
>>>>>>>
>>>>>>>> -Original Message-
>>>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>>>> Sent: Monday, September 7, 2015 1:26 PM
>>>>>>>> To: dev at dpdk.org
>>>>>>>> Cc: Ananyev, Konstantin; Richardson, Bruce
>>>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD
>>>>>>>> receive function
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I just realized I've missed the "[PATCH]" tag from the subject. Did
>>>>>>>> anyone had time to review this?
>>>>>>>>
>>>>>>> Hi Zoltan,
>>>>>>>
>>>>>>> the big thing that concerns me with this is the addition of new
>>>>>>> instructions for each packet in the fast path. Ideally, this
>>>>>>> prefetching would be better handled in the application itself, as for
>>>>>>> some apps, e.g. those using pipelining, the core doing the RX from the
>>>>>>> NIC may not touch the packet data at all, and the prefetches will
>>>>>> instead cause a performance slowdown.
>>>>>>> Is it possible to get the same performance increase - or something
>>>>>>> close to it - by making changes in OVS?
>>>>>> OVS already does a prefetch when it's processing the previous packet, but
>>>>>> apparently it's not early enough. At least for my test scenario, where 
>>>>>> I'm
>>>>>> forwarding UDP packets with the least possible overhead. I guess in tests
>>>>>> where OVS does more complex processing it should be fine.
>>>>>> I'll try to move the prefetch earlier in OVS codebase, but I'm not sure 
>>>>>> if
>>>>>> it'll help.
>>>>> I would suggest trying to prefetch more than one packet ahead. 
>>>>> Prefetching 4 or
>>>>> 8 ahead might work better, depending on the processing being done.
>>>>
>>>> I've moved the prefetch earlier, and it seems to work:
>>>>
>>>> https://patchwork.ozlabs.org/patch/519017/
>>>>
>>>> However it raises the question: should we remove header prefetch from
>>>> all the other drivers, and the CONFIG_RTE_PMD_PACKET_PREFETCH option?
>>>
>>> My vote would be for that.
>>> Konstantin
>>
>> After some further thinking I would rather support the
>> rte_packet_prefetch() macro (prefetch the header in the driver, and
>> configure it through CONFIG_RTE_PMD_PACKET_PREFETCH)
>>
>> - the prefetch can happen earlier, so if an application needs the header
>> right away, this is the fastest
>> - if the application doesn't need header prefetch, it can turn it off
>> compile time. Same as if we wouldn't have this option.
>> - if the application has mixed needs (sometimes it needs the header
>> right away, sometimes it doesn'

[dpdk-dev] Calling rte_eth_rx_burst() multiple times

2015-10-15 Thread Zoltan Kiss



On 15/10/15 09:32, Younghwan Go wrote:
> Hi,
>
> I'm pretty new to playing with DPDK. I was trying to see if I can always
> receive MAX_BURST packets by calling rte_eth_rx_burst() multiple times
> on same  pair (code shown below). I'm using DPDK-2.1.0 on 2
> dual-port Intel 82599ES 10Gbps NICs with Ubuntu 14.04.3 (kernel
> 3.13.0-63-generic).
>
> Since packet processing is slower (~10 Gbps) than pure RX speed (~40
> Gbps), I assumed rte_eth_rx_burst() would always receive some number of
> packets, eventually filling up MAX_BURST. But for multi-core case (4
> CPUs, 4 ports), rte_eth_rx_burst() starts to always return 0 after some
> time, causing all cores to be blocked forever. Analyzing the DPDK code
> (drivers/net/ixgbe/ixgbe_rxtx.c), I'm seeing that inside
> ixgbe_rx_scan_hw_ring() function, "rxdp->wb.upper.status.error" always
> returns 0 (where is this value set by the way?).

I think it is set by the hardware.

>
> I didn't see this problem for single-core case, in which it returned
> MAX_BURST packets at every rte_eth_rx_burst() call. Also, if I break out
> of while loop when I receive 0, I keep receiving packets in next  queue> pairs. Does anyone know why this block might happen? Or am I not
> allowed to call rte_eth_rx_burst() multiple times on same 
> pair if I get 0? Any help will be great! Thank you!

Although not mentioned in the documentation itself, as far as I know 
rte_eth_rx_burst() is not thread-safe. If you look in to receive 
functions, there are no locking anywhere. You should call it on separate 
queues from different threads, and configure e.g RSS to distribute the 
traffic by the hardware.

>
> 
> int cnt = MAX_BURST; // MAX_BURST = 32
> int off = 0;
> do {
>  ret = rte_eth_rx_burst(port_id, queue_id, _table[off], cnt);
>  if (ret == 0) {
>  // don't break out but continue
>  } else if (ret > 0) {
>  off += ret;
>  cnt -= ret;
>  }
> } while (cnt > 0);
> 
>
> Best,
> Younghwan

[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-10-14 Thread Zoltan Kiss



On 28/09/15 00:19, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Friday, September 25, 2015 7:29 PM
>> To: Richardson, Bruce; dev at dpdk.org
>> Cc: Ananyev, Konstantin
>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive 
>> function
>>
>> On 07/09/15 07:41, Richardson, Bruce wrote:
>>>
>>>> -Original Message-
>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>> Sent: Monday, September 7, 2015 3:15 PM
>>>> To: Richardson, Bruce; dev at dpdk.org
>>>> Cc: Ananyev, Konstantin
>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive
>>>> function
>>>>
>>>>
>>>>
>>>> On 07/09/15 13:57, Richardson, Bruce wrote:
>>>>>
>>>>>> -Original Message-
>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>> Sent: Monday, September 7, 2015 1:26 PM
>>>>>> To: dev at dpdk.org
>>>>>> Cc: Ananyev, Konstantin; Richardson, Bruce
>>>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD
>>>>>> receive function
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I just realized I've missed the "[PATCH]" tag from the subject. Did
>>>>>> anyone had time to review this?
>>>>>>
>>>>> Hi Zoltan,
>>>>>
>>>>> the big thing that concerns me with this is the addition of new
>>>>> instructions for each packet in the fast path. Ideally, this
>>>>> prefetching would be better handled in the application itself, as for
>>>>> some apps, e.g. those using pipelining, the core doing the RX from the
>>>>> NIC may not touch the packet data at all, and the prefetches will
>>>> instead cause a performance slowdown.
>>>>> Is it possible to get the same performance increase - or something
>>>>> close to it - by making changes in OVS?
>>>> OVS already does a prefetch when it's processing the previous packet, but
>>>> apparently it's not early enough. At least for my test scenario, where I'm
>>>> forwarding UDP packets with the least possible overhead. I guess in tests
>>>> where OVS does more complex processing it should be fine.
>>>> I'll try to move the prefetch earlier in OVS codebase, but I'm not sure if
>>>> it'll help.
>>> I would suggest trying to prefetch more than one packet ahead. Prefetching 
>>> 4 or
>>> 8 ahead might work better, depending on the processing being done.
>>
>> I've moved the prefetch earlier, and it seems to work:
>>
>> https://patchwork.ozlabs.org/patch/519017/
>>
>> However it raises the question: should we remove header prefetch from
>> all the other drivers, and the CONFIG_RTE_PMD_PACKET_PREFETCH option?
>
> My vote would be for that.
> Konstantin

After some further thinking I would rather support the 
rte_packet_prefetch() macro (prefetch the header in the driver, and 
configure it through CONFIG_RTE_PMD_PACKET_PREFETCH)

- the prefetch can happen earlier, so if an application needs the header 
right away, this is the fastest
- if the application doesn't need header prefetch, it can turn it off 
compile time. Same as if we wouldn't have this option.
- if the application has mixed needs (sometimes it needs the header 
right away, sometimes it doesn't), it can turn it off and do what it 
needs. Same as if we wouldn't have this option.

A harder decision would be whether it should be turned on or off by 
default. Currently it's on, and half of the receive functions don't use it.

>
>
>>
>>>
>>> /Bruce
>

[dpdk-dev] [PATCH] acl: fix target arch detection

2015-09-25 Thread Zoltan Kiss

This test selects AVX2 code even if the target architecture doesn't support it.

Signed-off-by: Zoltan Kiss 
---
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 46acc2b..17a9f96 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -57,7 +57,7 @@ CFLAGS_acl_run_sse.o += -msse4.1
 # then add support for AVX2 classify method.
 #

-CC_AVX2_SUPPORT=$(shell $(CC) -march=core-avx2 -dM -E - &1 | \
+CC_AVX2_SUPPORT=$(shell $(CC) $(MACHINE_FLAGS) -dM -E - &1 | \
 grep -q AVX2 && echo 1)

 ifeq ($(CC_AVX2_SUPPORT), 1)

[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-09-25 Thread Zoltan Kiss

On 07/09/15 07:41, Richardson, Bruce wrote:
>
>> -Original Message-
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Monday, September 7, 2015 3:15 PM
>> To: Richardson, Bruce; dev at dpdk.org
>> Cc: Ananyev, Konstantin
>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive
>> function
>>
>>
>>
>> On 07/09/15 13:57, Richardson, Bruce wrote:
>>>
>>>> -Original Message-
>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>> Sent: Monday, September 7, 2015 1:26 PM
>>>> To: dev at dpdk.org
>>>> Cc: Ananyev, Konstantin; Richardson, Bruce
>>>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD
>>>> receive function
>>>>
>>>> Hi,
>>>>
>>>> I just realized I've missed the "[PATCH]" tag from the subject. Did
>>>> anyone had time to review this?
>>>>
>>> Hi Zoltan,
>>>
>>> the big thing that concerns me with this is the addition of new
>>> instructions for each packet in the fast path. Ideally, this
>>> prefetching would be better handled in the application itself, as for
>>> some apps, e.g. those using pipelining, the core doing the RX from the
>>> NIC may not touch the packet data at all, and the prefetches will
>> instead cause a performance slowdown.
>>> Is it possible to get the same performance increase - or something
>>> close to it - by making changes in OVS?
>> OVS already does a prefetch when it's processing the previous packet, but
>> apparently it's not early enough. At least for my test scenario, where I'm
>> forwarding UDP packets with the least possible overhead. I guess in tests
>> where OVS does more complex processing it should be fine.
>> I'll try to move the prefetch earlier in OVS codebase, but I'm not sure if
>> it'll help.
> I would suggest trying to prefetch more than one packet ahead. Prefetching 4 
> or
> 8 ahead might work better, depending on the processing being done.

I've moved the prefetch earlier, and it seems to work:

https://patchwork.ozlabs.org/patch/519017/

However it raises the question: should we remove header prefetch from 
all the other drivers, and the CONFIG_RTE_PMD_PACKET_PREFETCH option?

>
> /Bruce

[dpdk-dev] [PATCH v2] ixgbe: prefetch cacheline after pointer becomes valid

2015-09-25 Thread Zoltan Kiss

At the original point the rx_pkts[pos( + n)] pointers are not initialized, so
the code is prefetching random data.

Signed-off-by: Zoltan Kiss 
---
v2: fixing tabs

diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
index 3c6d8c5..ccd93c7 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
@@ -284,13 +284,6 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
rte_mbuf **rx_pkts,
__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
__m128i mbp1, mbp2; /* two mbuf pointer in one XMM reg. */

-   if (split_packet) {
-   rte_prefetch0(_pkts[pos]->cacheline1);
-   rte_prefetch0(_pkts[pos + 1]->cacheline1);
-   rte_prefetch0(_pkts[pos + 2]->cacheline1);
-   rte_prefetch0(_pkts[pos + 3]->cacheline1);
-   }
-
/* B.1 load 1 mbuf point */
mbp1 = _mm_loadu_si128((__m128i *)_ring[pos]);

@@ -312,6 +305,13 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
rte_mbuf **rx_pkts,
/* B.2 copy 2 mbuf point into rx_pkts  */
_mm_storeu_si128((__m128i *)_pkts[pos+2], mbp2);

+   if (split_packet) {
+   rte_prefetch0(_pkts[pos]->cacheline1);
+   rte_prefetch0(_pkts[pos + 1]->cacheline1);
+   rte_prefetch0(_pkts[pos + 2]->cacheline1);
+   rte_prefetch0(_pkts[pos + 3]->cacheline1);
+   }
+
/* A* mask out 0~3 bits RSS type */
descs[3] = _mm_and_si128(descs0[3], desc_mask);
descs[2] = _mm_and_si128(descs0[2], desc_mask);

[dpdk-dev] [PATCH] ixgbe: prefetch cacheline after pointer becomes valid

2015-09-25 Thread Zoltan Kiss

On 25/09/15 04:47, Mcnamara, John wrote:
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Friday, September 25, 2015 1:23 AM
>> To: dev at dpdk.org
>> Subject: [dpdk-dev] [PATCH] ixgbe: prefetch cacheline after pointer
>> becomes valid
>>
>> +   if (split_packet) {
>> +   rte_prefetch0(_pkts[pos]->cacheline1);
>> +   rte_prefetch0(_pkts[pos + 1]->cacheline1);
>> +   rte_prefetch0(_pkts[pos + 2]->cacheline1);
>> +   rte_prefetch0(_pkts[pos + 3]->cacheline1);
>> +   }
>> +
>>  /* A* mask out 0~3 bits RSS type */
>>  descs[3] = _mm_and_si128(descs0[3], desc_mask);
>>  descs[2] = _mm_and_si128(descs0[2], desc_mask);
> Hi,
>
> This patch doesn't apply cleanly. It looks like all the tabs have been 
> replaced with spaces.

Sorry, my bad, I'll resend
>
> John.
>
>

[dpdk-dev] [PATCH] ixgbe: prefetch cacheline after pointer becomes valid

2015-09-24 Thread Zoltan Kiss

At the original point the rx_pkts[pos( + n)] pointers are not initialized, so
the code is prefetching random data.

Signed-off-by: Zoltan Kiss 

diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
index 3c6d8c5..ccd93c7 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
@@ -284,13 +284,6 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
rte_mbuf **rx_pkts,
__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
__m128i mbp1, mbp2; /* two mbuf pointer in one XMM reg. */

-   if (split_packet) {
-   rte_prefetch0(_pkts[pos]->cacheline1);
-   rte_prefetch0(_pkts[pos + 1]->cacheline1);
-   rte_prefetch0(_pkts[pos + 2]->cacheline1);
-   rte_prefetch0(_pkts[pos + 3]->cacheline1);
-   }
-
/* B.1 load 1 mbuf point */
mbp1 = _mm_loadu_si128((__m128i *)_ring[pos]);

@@ -312,6 +305,13 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
rte_mbuf **rx_pkts,
/* B.2 copy 2 mbuf point into rx_pkts  */
_mm_storeu_si128((__m128i *)_pkts[pos+2], mbp2);

+   if (split_packet) {
+   rte_prefetch0(_pkts[pos]->cacheline1);
+   rte_prefetch0(_pkts[pos + 1]->cacheline1);
+   rte_prefetch0(_pkts[pos + 2]->cacheline1);
+   rte_prefetch0(_pkts[pos + 3]->cacheline1);
+   }
+
/* A* mask out 0~3 bits RSS type */
descs[3] = _mm_and_si128(descs0[3], desc_mask);
descs[2] = _mm_and_si128(descs0[2], desc_mask);

[dpdk-dev] ixgbe vPMD question

2015-09-17 Thread Zoltan Kiss

Hi,

The recv function does a prefetch on cacheline1, however it seems to me 
that rx_pkts[pos] should be uninitialized pointer at that time:

http://dpdk.org/browse/dpdk/tree/drivers/net/ixgbe/ixgbe_rxtx_vec.c#n287

So I guess it prefetches only random value. Or am I missing something?

Regards,

Zoltan

[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-09-07 Thread Zoltan Kiss



On 07/09/15 13:57, Richardson, Bruce wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Monday, September 7, 2015 1:26 PM
>> To: dev at dpdk.org
>> Cc: Ananyev, Konstantin; Richardson, Bruce
>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive
>> function
>>
>> Hi,
>>
>> I just realized I've missed the "[PATCH]" tag from the subject. Did anyone
>> had time to review this?
>>
>
> Hi Zoltan,
>
> the big thing that concerns me with this is the addition of new instructions 
> for
> each packet in the fast path. Ideally, this prefetching would be better 
> handled
> in the application itself, as for some apps, e.g. those using pipelining, the
> core doing the RX from the NIC may not touch the packet data at all, and the
> prefetches will instead cause a performance slowdown.
>
> Is it possible to get the same performance increase - or something close to 
> it -
> by making changes in OVS?

OVS already does a prefetch when it's processing the previous packet, 
but apparently it's not early enough. At least for my test scenario, 
where I'm forwarding UDP packets with the least possible overhead. I 
guess in tests where OVS does more complex processing it should be fine.
I'll try to move the prefetch earlier in OVS codebase, but I'm not sure 
if it'll help.
Also, I've checked the PMD receive functions, and generally it's quite 
mixed whether they prefetch the header or not. All the other 3 ixgbe 
receive functions do that for example, as well as the following drivers:

bnx2x
e1000
fm10k (scattered)
i40e
igb
virtio

While these drivers don't do that:

cxgbe
enic
fm10k (non-scattered)
mlx4

I think it would be better to add rte_packet_prefetch() everywhere, 
because then applications can turn that off with 
CONFIG_RTE_PMD_PACKET_PREFETCH.

>
> Regards,
> /Bruce
>
>> Regards,
>>
>> Zoltan
>>
>> On 01/09/15 20:17, Zoltan Kiss wrote:
>>> The lack of this prefetch causes a significant performance drop in
>>> OVS-DPDK: 13.3 Mpps instead of 14 when forwarding 64 byte packets.
>>> Even though OVS prefetches the next packet's header before it starts
>>> processing the current one, it doesn't get there fast enough. This
>>> aligns with the behaviour of other receive functions.
>>>
>>> Signed-off-by: Zoltan Kiss 
>>> ---
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> index cf25a53..51299fa 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> @@ -502,6 +502,15 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq,
>> struct rte_mbuf **rx_pkts,
>>>   _mm_storeu_si128((void *)_pkts[pos]-
>>> rx_descriptor_fields1,
>>>   pkt_mb1);
>>>
>>> +   rte_packet_prefetch((char*)(rx_pkts[pos]->buf_addr) +
>>> +   RTE_PKTMBUF_HEADROOM);
>>> +   rte_packet_prefetch((char*)(rx_pkts[pos + 1]->buf_addr)
>> +
>>> +   RTE_PKTMBUF_HEADROOM);
>>> +   rte_packet_prefetch((char*)(rx_pkts[pos + 2]->buf_addr)
>> +
>>> +   RTE_PKTMBUF_HEADROOM);
>>> +   rte_packet_prefetch((char*)(rx_pkts[pos + 3]->buf_addr)
>> +
>>> +   RTE_PKTMBUF_HEADROOM);
>>> +
>>>   /* C.4 calc avaialbe number of desc */
>>>   var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
>>>   nb_pkts_recd += var;
>>>

[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-09-07 Thread Zoltan Kiss

Hi,

I just realized I've missed the "[PATCH]" tag from the subject. Did 
anyone had time to review this?

Regards,

Zoltan

On 01/09/15 20:17, Zoltan Kiss wrote:
> The lack of this prefetch causes a significant performance drop in
> OVS-DPDK: 13.3 Mpps instead of 14 when forwarding 64 byte packets. Even
> though OVS prefetches the next packet's header before it starts processing
> the current one, it doesn't get there fast enough. This aligns with the
> behaviour of other receive functions.
>
> Signed-off-by: Zoltan Kiss 
> ---
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> index cf25a53..51299fa 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> @@ -502,6 +502,15 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>  _mm_storeu_si128((void 
> *)_pkts[pos]->rx_descriptor_fields1,
>  pkt_mb1);
>
> +   rte_packet_prefetch((char*)(rx_pkts[pos]->buf_addr) +
> +   RTE_PKTMBUF_HEADROOM);
> +   rte_packet_prefetch((char*)(rx_pkts[pos + 1]->buf_addr) +
> +   RTE_PKTMBUF_HEADROOM);
> +   rte_packet_prefetch((char*)(rx_pkts[pos + 2]->buf_addr) +
> +   RTE_PKTMBUF_HEADROOM);
> +   rte_packet_prefetch((char*)(rx_pkts[pos + 3]->buf_addr) +
> +   RTE_PKTMBUF_HEADROOM);
> +
>  /* C.4 calc avaialbe number of desc */
>  var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
>  nb_pkts_recd += var;
>

[dpdk-dev] rte_eal_init() alternative?

2015-09-02 Thread Zoltan Kiss



On 02/09/15 15:08, Jay Rolette wrote:
> On Wed, Sep 2, 2015 at 7:56 AM, Bruce Richardson  intel.com
>> wrote:
>
>> On Wed, Sep 02, 2015 at 12:49:40PM +, Montorsi, Francesco wrote:
>>> Hi all,
>>>
>>> Currently it seems that the only way to initialize EAL is using
>> rte_eal_init() function, correct?
>>>
>>> I have the problem that rte_eal_init() will call rte_panic() whenever
>> something fails to initialize or in other cases it will call exit().
>>> In my application, I would rather like to attempt DPDK initialization.
>> If it fails I don't want to exit.
>>> Unfortunately I cannot even copy the rte_eal_init() code into my
>> application (removing rte_panic and exit calls) since it uses a lot of DPDK
>> internal private functions.
>>>
>>> I think that my requirements (avoid abort/exit calls when init fails) is
>> a basic requirement... would you accept a patch that adds an alternative
>> rte_eal_init() function that just returns an error code upon failure,
>> instead of immediately exiting?
>>>
>>> Thanks for your hard work!
>>>
>>> Francesco Montorsi
>>>
>> I, for one, would welcome such a patch. I think the code is overly quick in
>> many places to panic or exit the app, when an error code would be more
>> appropriate.
>> Feel free to also look at other libraries in DPDK too, if you like :-)
>>
>> Regards,
>> /Bruce
>>
>
> +1
>

+1

[dpdk-dev] ixgbe: prefetch packet headers in vector PMD receive function

2015-09-01 Thread Zoltan Kiss

The lack of this prefetch causes a significant performance drop in
OVS-DPDK: 13.3 Mpps instead of 14 when forwarding 64 byte packets. Even
though OVS prefetches the next packet's header before it starts processing
the current one, it doesn't get there fast enough. This aligns with the
behaviour of other receive functions.

Signed-off-by: Zoltan Kiss 
---
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
index cf25a53..51299fa 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
@@ -502,6 +502,15 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
rte_mbuf **rx_pkts,
_mm_storeu_si128((void *)_pkts[pos]->rx_descriptor_fields1,
pkt_mb1);

+   rte_packet_prefetch((char*)(rx_pkts[pos]->buf_addr) +
+   RTE_PKTMBUF_HEADROOM);
+   rte_packet_prefetch((char*)(rx_pkts[pos + 1]->buf_addr) +
+   RTE_PKTMBUF_HEADROOM);
+   rte_packet_prefetch((char*)(rx_pkts[pos + 2]->buf_addr) +
+   RTE_PKTMBUF_HEADROOM);
+   rte_packet_prefetch((char*)(rx_pkts[pos + 3]->buf_addr) +
+   RTE_PKTMBUF_HEADROOM);
+
/* C.4 calc avaialbe number of desc */
var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
nb_pkts_recd += var;

[dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD

2015-08-26 Thread Zoltan Kiss

Hi,

I've checked it further, based on Stephen's suggestion I've tried perf 
top as well. The results were the same, it spends a lot of time in that 
part of the code, and there are high number of branch load misses 
(BR_MISS_PRED_RETIRED) around there too.
I've also started to strip down miniflow_extract() to remove parts which 
are not relevant to this very simple testcase. I've removed the metadata 
checking branches and the "size < sizeof(struct eth_header)". I've 
removed the size check from emc_processing, and placed log messages in 
flow_extract and netdev_flow_key_from_flow, to make sure the excessive 
time spent in miniflow_extract is not because these two are somehow 
calling it.
That way I've closed out all of the branches preceding this instruction. 
Oddly the high sample number now moved down a few instructions:
...
dp_packet_reset_offsets
   5113eb:   b8 ff ff ff ff  mov$0x,%eax
   5113f0:   66 89 8f 86 00 00 00mov%cx,0x86(%rdi)
   5113f7:   c6 87 81 00 00 00 00movb   $0x0,0x81(%rdi)
   5113fe:   66 89 87 82 00 00 00mov%ax,0x82(%rdi)
data_pull
   511405:   48 8d 4d 0c lea0xc(%rbp),%rcx
dp_packet_reset_offsets
   511409:   66 89 97 84 00 00 00mov%dx,0x84(%rdi)
memcpy
   511410:   48 8b 45 00 mov0x0(%rbp),%rax
   511414:   48 89 46 18 mov%rax,0x18(%rsi)

This last instruction moves the first 8 bytes of the MAC address (coming 
from 0x0(%rbp)) to 0x18(%rsi), which is basically memory pointed by 
parameter "struct miniflow *dst". It is allocated on the stack by 
emc_processing.
I couldn't find any branch which can cause this miss, but then I've 
checked the PMD stats:

pmd thread numa_id 0 core_id 1:
emc hits:4395834176
megaflow hits:1
miss:1
lost:0
polling cycles:166083129380 (16.65%)
processing cycles:831536059972 (83.35%)
avg cycles per packet: 226.95 (997619189352/4395834178)
avg processing cycles per packet: 189.16 (831536059972/4395834178)

So everything hits EMC, when I measured the change of that counter for 
10 seconds, the result was around ~13.3 Mpps too. The cycle statistics 
shows that it should be able to handle more than 15M packets per second, 
yet it doesn't receive that much, while with the non-vector PMD it can 
max out the link.
Any more suggestions?

Regards,

Zoltan

On 21/08/15 19:05, Zoltan Kiss wrote:
> Hi,
>
> I've set up a simple packet forwarding perf test on a dual-port 10G
> 82599ES: one port receives 64 byte UDP packets, the other sends it out,
> one core used. I've used latest OVS with DPDK 2.1, and the first result
> was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last
> year with the same test. The first thing I've changed was to revert back
> to the old behaviour about this issue:
>
> http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731
>
> So instead of the new default I've passed 2048 + RTE_PKTMBUF_HEADROOM.
> That increased the performance to 13.5, but to figure out what's wrong
> started to play with the receive functions. First I've disabled vector
> PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So
> then I've enabled scattered RX, and with
> ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which
> is I guess as close as possible to the 14.2 line rate (on my HW at
> least, with one core)
> Does anyone has a good explanation about why the vector PMD performs so
> significantly worse? I would expect that on a 3.2 GHz i5-4570 one core
> should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a
> difference.
> I've tried to look into it with oprofile, but the results were quite
> strange: 35% of the samples were from miniflow_extract, the part where
> parse_vlan calls data_pull to jump after the MAC addresses. The oprofile
> snippet (1M samples):
>
>511454 190.0037  flow.c:511
>511458 149   0.0292  dp-packet.h:266
>51145f 4264  0.8357  dp-packet.h:267
>511466 180.0035  dp-packet.h:268
>51146d 430.0084  dp-packet.h:269
>511474 172   0.0337  flow.c:511
>51147a 4320  0.8467  string3.h:51
>51147e 358763   70.3176  flow.c:99
>511482 23.9e-04  string3.h:51
>511485 3060  0.5998  string3.h:51
>511488 1693  0.3318  string3.h:51
>51148c 2933  0.5749  flow.c:326
>511491 470.0092  flow.c:326
>
> And the corresponding disassembled code:
>
>511454:   49 83 f9 0d cmpr9,0xd
>511458:   c6 83 81 00 00 00 00movBYTE PTR [rbx+0x81],0x0
>51145f:   66 89 83 82 00 00 00movWORD PTR [rbx+0x82],ax
>511466:   66 89 93 84 00 00 00movWORD PTR [rbx+0x84],dx
>511

[dpdk-dev] [ovs-dev] OVS-DPDK performance problem on ixgbe vector PMD

2015-08-26 Thread Zoltan Kiss

Hi,

On 24/08/15 12:43, Traynor, Kevin wrote:
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at openvswitch.org] On Behalf Of Zoltan Kiss
>> Sent: Friday, August 21, 2015 7:05 PM
>> To: dev at dpdk.org; dev at openvswitch.org
>> Cc: Richardson, Bruce; Ananyev, Konstantin
>> Subject: [ovs-dev] OVS-DPDK performance problem on ixgbe vector PMD
>>
>> Hi,
>>
>> I've set up a simple packet forwarding perf test on a dual-port 10G
>> 82599ES: one port receives 64 byte UDP packets, the other sends it out,
>> one core used. I've used latest OVS with DPDK 2.1, and the first result
>> was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last
>> year with the same test. The first thing I've changed was to revert back
>> to the old behaviour about this issue:
>>
>> http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731
>>
>> So instead of the new default I've passed 2048 + RTE_PKTMBUF_HEADROOM.
>> That increased the performance to 13.5, but to figure out what's wrong
>> started to play with the receive functions. First I've disabled vector
>> PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So
>> then I've enabled scattered RX, and with
>> ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which
>> is I guess as close as possible to the 14.2 line rate (on my HW at
>> least, with one core)
>> Does anyone has a good explanation about why the vector PMD performs so
>> significantly worse? I would expect that on a 3.2 GHz i5-4570 one core
>> should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a
>> difference.
>
> I've previously turned on/off vectorisation and found that for tx it makes
> a significant difference. For Rx it didn't make a much of a difference but
> rx bulk allocation which gets enabled with it did improve performance.
>
> Is there is something else also running on the current pmd core? did you
> try moving it to another?
I've tied the pmd to the second core, as far as I can see from top and 
profiling outputs hardly anything else runs there.

Also, did you compile OVS with -O3/-Ofast, they
> tend to give a performance boost.
Yes

>
> Are you hitting 3.2 GHz for the core with the pmd? I think that is only
> with turbo boost, so it may not be achievable all the time.
The turbo boost freq is 3.6 GHz.

>
>> I've tried to look into it with oprofile, but the results were quite
>> strange: 35% of the samples were from miniflow_extract, the part where
>> parse_vlan calls data_pull to jump after the MAC addresses. The oprofile
>> snippet (1M samples):
>>
>> 511454 190.0037  flow.c:511
>> 511458 149   0.0292  dp-packet.h:266
>> 51145f 4264  0.8357  dp-packet.h:267
>> 511466 180.0035  dp-packet.h:268
>> 51146d 430.0084  dp-packet.h:269
>> 511474 172   0.0337  flow.c:511
>> 51147a 4320  0.8467  string3.h:51
>> 51147e 358763   70.3176  flow.c:99
>> 511482 23.9e-04  string3.h:51
>> 511485 3060  0.5998  string3.h:51
>> 511488 1693  0.3318  string3.h:51
>> 51148c 2933  0.5749  flow.c:326
>> 511491 470.0092  flow.c:326
>>
>> And the corresponding disassembled code:
>>
>> 511454:   49 83 f9 0d cmpr9,0xd
>> 511458:   c6 83 81 00 00 00 00movBYTE PTR [rbx+0x81],0x0
>> 51145f:   66 89 83 82 00 00 00movWORD PTR [rbx+0x82],ax
>> 511466:   66 89 93 84 00 00 00movWORD PTR [rbx+0x84],dx
>> 51146d:   66 89 8b 86 00 00 00movWORD PTR [rbx+0x86],cx
>> 511474:   0f 86 af 01 00 00   jbe511629
>> <miniflow_extract+0x279>
>> 51147a:   48 8b 45 00 movrax,QWORD PTR [rbp+0x0]
>> 51147e:   4c 8d 5d 0c lear11,[rbp+0xc]
>> 511482:   49 89 00movQWORD PTR [r8],rax
>> 511485:   8b 45 08moveax,DWORD PTR [rbp+0x8]
>> 511488:   41 89 40 08 movDWORD PTR [r8+0x8],eax
>> 51148c:   44 0f b7 55 0c  movzx  r10d,WORD PTR [rbp+0xc]
>> 511491:   66 41 81 fa 81 00   cmpr10w,0x81
>>
>> My only explanation to this so far is that I misunderstand something
>> about the oprofile results.
>>
>> Regards,
>>
>> Zoltan
>> ___
>> dev mailing list
>> dev at openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev

[dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD

2015-08-21 Thread Zoltan Kiss

Hi,

I've set up a simple packet forwarding perf test on a dual-port 10G 
82599ES: one port receives 64 byte UDP packets, the other sends it out, 
one core used. I've used latest OVS with DPDK 2.1, and the first result 
was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last 
year with the same test. The first thing I've changed was to revert back 
to the old behaviour about this issue:

http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731

So instead of the new default I've passed 2048 + RTE_PKTMBUF_HEADROOM. 
That increased the performance to 13.5, but to figure out what's wrong 
started to play with the receive functions. First I've disabled vector 
PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So 
then I've enabled scattered RX, and with 
ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which 
is I guess as close as possible to the 14.2 line rate (on my HW at 
least, with one core)
Does anyone has a good explanation about why the vector PMD performs so 
significantly worse? I would expect that on a 3.2 GHz i5-4570 one core 
should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a 
difference.
I've tried to look into it with oprofile, but the results were quite 
strange: 35% of the samples were from miniflow_extract, the part where 
parse_vlan calls data_pull to jump after the MAC addresses. The oprofile 
snippet (1M samples):

   511454 190.0037  flow.c:511
   511458 149   0.0292  dp-packet.h:266
   51145f 4264  0.8357  dp-packet.h:267
   511466 180.0035  dp-packet.h:268
   51146d 430.0084  dp-packet.h:269
   511474 172   0.0337  flow.c:511
   51147a 4320  0.8467  string3.h:51
   51147e 358763   70.3176  flow.c:99
   511482 23.9e-04  string3.h:51
   511485 3060  0.5998  string3.h:51
   511488 1693  0.3318  string3.h:51
   51148c 2933  0.5749  flow.c:326
   511491 470.0092  flow.c:326

And the corresponding disassembled code:

   511454:   49 83 f9 0d cmpr9,0xd
   511458:   c6 83 81 00 00 00 00movBYTE PTR [rbx+0x81],0x0
   51145f:   66 89 83 82 00 00 00movWORD PTR [rbx+0x82],ax
   511466:   66 89 93 84 00 00 00movWORD PTR [rbx+0x84],dx
   51146d:   66 89 8b 86 00 00 00movWORD PTR [rbx+0x86],cx
   511474:   0f 86 af 01 00 00   jbe511629 

   51147a:   48 8b 45 00 movrax,QWORD PTR [rbp+0x0]
   51147e:   4c 8d 5d 0c lear11,[rbp+0xc]
   511482:   49 89 00movQWORD PTR [r8],rax
   511485:   8b 45 08moveax,DWORD PTR [rbp+0x8]
   511488:   41 89 40 08 movDWORD PTR [r8+0x8],eax
   51148c:   44 0f b7 55 0c  movzx  r10d,WORD PTR [rbp+0xc]
   511491:   66 41 81 fa 81 00   cmpr10w,0x81

My only explanation to this so far is that I misunderstand something 
about the oprofile results.

Regards,

Zoltan

[dpdk-dev] [PATCH v3] ixgbe: remove vector pmd burst size restriction

2015-08-05 Thread Zoltan Kiss



On 05/08/15 07:28, Liang, Cunming wrote:
> Hi Zoltan,
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Wednesday, August 05, 2015 12:26 AM
>> To: Liang, Cunming; dev at dpdk.org
>> Cc: Ananyev, Konstantin
>> Subject: Re: [PATCH v3] ixgbe: remove vector pmd burst size restriction
>>
>>
>>
>> On 04/08/15 12:47, Cunming Liang wrote:
>>> On receive side, the burst size now floor aligns to RTE_IXGBE_DESCS_PER_LOOP
>> power of 2.
>>> According to this rule, the burst size less than 4 still won't receive 
>>> anything.
>>> (Before this change, the burst size less than 32 can't receive anything.)
>>> _recv_*_pkts_vec returns no more than 32(RTE_IXGBE_RXQ_REARM_THRESH)
>> packets.
>>>
>>> On transmit side, the max burst size no longer bind with a constant, 
>>> however it
>> still
>>> require to check the cross tx_rs_thresh violation.
>>>
>>> There's no obvious performance drop found on both recv_pkts_vec
>>> and recv_scattered_pkts_vec on burst size 32.
>>>
>>> Signed-off-by: Cunming Liang 
>>> ---
>>>v3 change:
>>> - reword the init print log
>>>
>>>v2 change:
>>> - keep max rx burst size in 32
>>> - reword some comments
>>>
>>>drivers/net/ixgbe/ixgbe_rxtx.c |  4 +++-
>>>drivers/net/ixgbe/ixgbe_rxtx.h |  5 ++---
>>>drivers/net/ixgbe/ixgbe_rxtx_vec.c | 39
>> +-
>>>3 files changed, 27 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> index 91023b9..03eb45d 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> @@ -4008,7 +4008,9 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
>>>  */
>>> } else if (adapter->rx_vec_allowed) {
>>>     PMD_INIT_LOG(DEBUG, "Vector rx enabled, please make sure RX "
>>> -  "burst size no less than 32.");
>>> +   "burst size no less than %d (port=%d).",
>>> +RTE_IXGBE_DESCS_PER_LOOP,
>>> +dev->data->port_id);
>>
>> A tab seems to be missing from the indentation, otherwise:
>>
>> Reviewed-by: Zoltan Kiss 
>>
> Thanks for the review. I double checked indentation agian, it looks fine.
> 1st string line 4x/tab intention + space alignment, the other variable lines 
> 3x/tab indentation + space alignment.
> According to the 'coding_style.rst' Indentation section - 'As with all style 
> guideline, code should match style already in use in an existing file.'
> The style keeps the same as its following condition check. It passes 
> 'checkpatch.pl' checking as well.

Sorry for the noise, I was confused because the string is indented to 
the start of itself, while the other parameters to the opening bracket.
>
> Thanks,
> /Steve
>>>
>>> dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
>>> } else if (adapter->rx_bulk_alloc_allowed) {
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
>>> index 113682a..b9eca67 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.h
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.h
>>> @@ -47,9 +47,8 @@
>>> (uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
>>>
>>>#ifdef RTE_IXGBE_INC_VECTOR
>>> -#define RTE_IXGBE_VPMD_RX_BURST 32
>>> -#define RTE_IXGBE_VPMD_TX_BURST 32
>>> -#define RTE_IXGBE_RXQ_REARM_THRESH  RTE_IXGBE_VPMD_RX_BURST
>>> +#define RTE_IXGBE_RXQ_REARM_THRESH  32
>>> +#define RTE_IXGBE_MAX_RX_BURST
>> RTE_IXGBE_RXQ_REARM_THRESH
>>>#define RTE_IXGBE_TX_MAX_FREE_BUF_SZ64
>>>#endif
>>>
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> index cf25a53..2ca0e4c 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> @@ -245,13 +245,13 @@ desc_to_olflags_v(__m128i descs[4], struct
>> rte_mbuf **rx_pkts)
>>>#endif
>>>
>>>/*
>>> - * vPMD receive routine, now only accept (nb_pkts ==
>> RTE_IXGBE_VPMD_RX_BURST)
>>> - * in one loop
>>> + * vPMD raw receive routine, only accept(nb_pkts >=
>>

[dpdk-dev] [PATCH v3] ixgbe: remove vector pmd burst size restriction

2015-08-04 Thread Zoltan Kiss



On 04/08/15 12:47, Cunming Liang wrote:
> On receive side, the burst size now floor aligns to RTE_IXGBE_DESCS_PER_LOOP 
> power of 2.
> According to this rule, the burst size less than 4 still won't receive 
> anything.
> (Before this change, the burst size less than 32 can't receive anything.)
> _recv_*_pkts_vec returns no more than 32(RTE_IXGBE_RXQ_REARM_THRESH) packets.
>
> On transmit side, the max burst size no longer bind with a constant, however 
> it still
> require to check the cross tx_rs_thresh violation.
>
> There's no obvious performance drop found on both recv_pkts_vec
> and recv_scattered_pkts_vec on burst size 32.
>
> Signed-off-by: Cunming Liang 
> ---
>   v3 change:
>- reword the init print log
>
>   v2 change:
>- keep max rx burst size in 32
>- reword some comments
>
>   drivers/net/ixgbe/ixgbe_rxtx.c |  4 +++-
>   drivers/net/ixgbe/ixgbe_rxtx.h |  5 ++---
>   drivers/net/ixgbe/ixgbe_rxtx_vec.c | 39 
> +-
>   3 files changed, 27 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 91023b9..03eb45d 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -4008,7 +4008,9 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
>*/
>   } else if (adapter->rx_vec_allowed) {
>   PMD_INIT_LOG(DEBUG, "Vector rx enabled, please make sure RX "
> -"burst size no less than 32.");
> + "burst size no less than %d (port=%d).",
> +      RTE_IXGBE_DESCS_PER_LOOP,
> +  dev->data->port_id);

A tab seems to be missing from the indentation, otherwise:

Reviewed-by: Zoltan Kiss 

>
>   dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
>   } else if (adapter->rx_bulk_alloc_allowed) {
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
> index 113682a..b9eca67 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.h
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.h
> @@ -47,9 +47,8 @@
>   (uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
>
>   #ifdef RTE_IXGBE_INC_VECTOR
> -#define RTE_IXGBE_VPMD_RX_BURST 32
> -#define RTE_IXGBE_VPMD_TX_BURST 32
> -#define RTE_IXGBE_RXQ_REARM_THRESH  RTE_IXGBE_VPMD_RX_BURST
> +#define RTE_IXGBE_RXQ_REARM_THRESH  32
> +#define RTE_IXGBE_MAX_RX_BURST  RTE_IXGBE_RXQ_REARM_THRESH
>   #define RTE_IXGBE_TX_MAX_FREE_BUF_SZ64
>   #endif
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> index cf25a53..2ca0e4c 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> @@ -245,13 +245,13 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf 
> **rx_pkts)
>   #endif
>
>   /*
> - * vPMD receive routine, now only accept (nb_pkts == RTE_IXGBE_VPMD_RX_BURST)
> - * in one loop
> + * vPMD raw receive routine, only accept(nb_pkts >= RTE_IXGBE_DESCS_PER_LOOP)
>*
>* Notice:
> - * - nb_pkts < RTE_IXGBE_VPMD_RX_BURST, just return no packet
> - * - nb_pkts > RTE_IXGBE_VPMD_RX_BURST, only scan RTE_IXGBE_VPMD_RX_BURST
> + * - nb_pkts < RTE_IXGBE_DESCS_PER_LOOP, just return no packet
> + * - nb_pkts > RTE_IXGBE_MAX_RX_BURST, only scan RTE_IXGBE_MAX_RX_BURST
>*   numbers of DD bit
> + * - floor align nb_pkts to a RTE_IXGBE_DESC_PER_LOOP power-of-two
>* - don't support ol_flags for rss and csum err
>*/
>   static inline uint16_t
> @@ -286,8 +286,11 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>   __m128i dd_check, eop_check;
>   #endif /* RTE_NEXT_ABI */
>
> - if (unlikely(nb_pkts < RTE_IXGBE_VPMD_RX_BURST))
> - return 0;
> + /* nb_pkts shall be less equal than RTE_IXGBE_MAX_RX_BURST */
> + nb_pkts = RTE_MIN(nb_pkts, RTE_IXGBE_MAX_RX_BURST);
> +
> + /* nb_pkts has to be floor-aligned to RTE_IXGBE_DESCS_PER_LOOP */
> + nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_IXGBE_DESCS_PER_LOOP);
>
>   /* Just the act of getting into the function from the application is
>* going to cost about 7 cycles */
> @@ -356,7 +359,7 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>* D. fill info. from desc to mbuf
>*/
>   #endif /* RTE_NEXT_ABI */
> - for (pos = 0, nb_pkts_recd = 0; pos < RTE_IXGBE_VPMD_RX_BURST;
> + for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
>   pos += RTE_IXGBE_DESCS_PER_LOOP,
>   rxdp += RTE_IXGBE_DESCS_PER_LOOP

[dpdk-dev] [PATCH v2] ixgbe: remove vector pmd burst size restriction

2015-08-04 Thread Zoltan Kiss



On 04/08/15 08:32, Cunming Liang wrote:
> On receive side, the burst size now floor aligns to RTE_IXGBE_DESCS_PER_LOOP 
> power of 2.
> According to this rule, the burst size less than 4 still won't receive 
> anything.
> (Before this change, the burst size less than 32 can't receive anything.)
> _recv_*_pkts_vec returns no more than 32(RTE_IXGBE_RXQ_REARM_THRESH) packets.
>
> On transmit side, the max burst size no longer bind with a constant, however 
> it still
> require to check the cross tx_rs_thresh violation.
>
> There's no obvious performance drop found on both recv_pkts_vec
> and recv_scattered_pkts_vec on burst size 32.
>
> Signed-off-by: Cunming Liang 
> ---
>   v2 change:
>- keep max rx burst size in 32
>- reword some comments
>
>   drivers/net/ixgbe/ixgbe_rxtx.c |  4 +++-
>   drivers/net/ixgbe/ixgbe_rxtx.h |  5 ++---
>   drivers/net/ixgbe/ixgbe_rxtx_vec.c | 39 
> +-
>   3 files changed, 27 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 91023b9..d06aaae 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -4008,7 +4008,9 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
>*/
>   } else if (adapter->rx_vec_allowed) {
>   PMD_INIT_LOG(DEBUG, "Vector rx enabled, please make sure RX "
> -"burst size no less than 32.");
> + "burst size no less than "
> + "RTE_IXGBE_DESCS_PER_LOOP(=4) (port=%d).",

I think you should write in this line:

"%d (port=%d)", RTE_IXGBE_DESCS_PER_LOOP,
> +  dev->data->port_id);
>
>   dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
>   } else if (adapter->rx_bulk_alloc_allowed) {
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
> index 113682a..b9eca67 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.h
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.h
> @@ -47,9 +47,8 @@
>   (uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
>
>   #ifdef RTE_IXGBE_INC_VECTOR
> -#define RTE_IXGBE_VPMD_RX_BURST 32
> -#define RTE_IXGBE_VPMD_TX_BURST 32
> -#define RTE_IXGBE_RXQ_REARM_THRESH  RTE_IXGBE_VPMD_RX_BURST
> +#define RTE_IXGBE_RXQ_REARM_THRESH  32
> +#define RTE_IXGBE_MAX_RX_BURST  RTE_IXGBE_RXQ_REARM_THRESH
>   #define RTE_IXGBE_TX_MAX_FREE_BUF_SZ64
>   #endif
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> index cf25a53..2ca0e4c 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> @@ -245,13 +245,13 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf 
> **rx_pkts)
>   #endif
>
>   /*
> - * vPMD receive routine, now only accept (nb_pkts == RTE_IXGBE_VPMD_RX_BURST)
> - * in one loop
> + * vPMD raw receive routine, only accept(nb_pkts >= RTE_IXGBE_DESCS_PER_LOOP)
>*
>* Notice:
> - * - nb_pkts < RTE_IXGBE_VPMD_RX_BURST, just return no packet
> - * - nb_pkts > RTE_IXGBE_VPMD_RX_BURST, only scan RTE_IXGBE_VPMD_RX_BURST
> + * - nb_pkts < RTE_IXGBE_DESCS_PER_LOOP, just return no packet
> + * - nb_pkts > RTE_IXGBE_MAX_RX_BURST, only scan RTE_IXGBE_MAX_RX_BURST
>*   numbers of DD bit
> + * - floor align nb_pkts to a RTE_IXGBE_DESC_PER_LOOP power-of-two
>* - don't support ol_flags for rss and csum err
>*/
>   static inline uint16_t
> @@ -286,8 +286,11 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>   __m128i dd_check, eop_check;
>   #endif /* RTE_NEXT_ABI */
>
> - if (unlikely(nb_pkts < RTE_IXGBE_VPMD_RX_BURST))
> - return 0;
> + /* nb_pkts shall be less equal than RTE_IXGBE_MAX_RX_BURST */
> + nb_pkts = RTE_MIN(nb_pkts, RTE_IXGBE_MAX_RX_BURST);
> +
> + /* nb_pkts has to be floor-aligned to RTE_IXGBE_DESCS_PER_LOOP */
> + nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_IXGBE_DESCS_PER_LOOP);
>
>   /* Just the act of getting into the function from the application is
>* going to cost about 7 cycles */
> @@ -356,7 +359,7 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>* D. fill info. from desc to mbuf
>*/
>   #endif /* RTE_NEXT_ABI */
> - for (pos = 0, nb_pkts_recd = 0; pos < RTE_IXGBE_VPMD_RX_BURST;
> + for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
>   pos += RTE_IXGBE_DESCS_PER_LOOP,
>   rxdp += RTE_IXGBE_DESCS_PER_LOOP) {
>   #ifdef RTE_NEXT_ABI
> @@ -518,13 +521,13 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>   }
>
>   /*
> - * vPMD receive routine, now only accept (nb_pkts == RTE_IXGBE_VPMD_RX_BURST)
> - * in one loop
> + * vPMD receive routine, only accept(nb_pkts >= RTE_IXGBE_DESCS_PER_LOOP)
>*
>* Notice:
> - * - nb_pkts < RTE_IXGBE_VPMD_RX_BURST, just return no packet
> - * - nb_pkts >

[dpdk-dev] [PATCH v1] ixgbe: remove vector pmd burst size restriction

2015-07-31 Thread Zoltan Kiss



On 31/07/15 12:57, Zoltan Kiss wrote:
>>
>> Another thing, that I just thought about:
>> Right now we invoke ixgbe_rxq_rearm() only at the start of
>> _recv_raw_pkts_vec().
>> Before it was ok, as _recv_raw_pkts_vec() would never try to read more
>> then 32 RXDs.
>> But what would happen if nb_pkts > rxq->nb_desc and rxq->rxrearm_nb == 0?
> Yes, that call would deplete the RX ring, the card wouldn't be able to
> receive more, so the receive function wouldn't be called again to rearm
> the ring.
>
Actually not, the problem is that the recv function would probably 
overran the the descriptor ring. But anyway, we should limit nb_pkts indeed.

[dpdk-dev] [PATCH v1] ixgbe: remove vector pmd burst size restriction

2015-07-31 Thread Zoltan Kiss



On 31/07/15 11:21, Ananyev, Konstantin wrote:
>
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Friday, July 31, 2015 11:04 AM
>> To: Liang, Cunming; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe: remove vector pmd burst size 
>> restriction
>>
>>
>>
>> On 31/07/15 09:17, Cunming Liang wrote:
>>> The patch removes the restriction of burst size on a constant 32.
>>>
>>> On receive side, the burst size floor now aligns to 
>>> RTE_IXGBE_DESCS_PER_LOOP power of 2.
>>> According to this rule, the burst size less than 4 still won't receive 
>>> anything.
>>> (Before this change, the burst size less than 32 can't receive anything.)
>>>
>>> On transmit side, the max burst size no longer bind with a constant, 
>>> however it still
>>> require to check the cross tx_rs_thresh violation.
>>>
>>> There's no obvious performance drop found on both recv_pkts_vec
>>> and recv_scattered_pkts_vec on burst size 32.
>>>
>>> Signed-off-by: Cunming Liang 
>>> ---
>>>drivers/net/ixgbe/ixgbe_rxtx.c |  3 ++-
>>>drivers/net/ixgbe/ixgbe_rxtx.h |  4 +---
>>>drivers/net/ixgbe/ixgbe_rxtx_vec.c | 35 
>>> ---
>>>3 files changed, 19 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> index 3f808b3..dbdb761 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> @@ -4008,7 +4008,8 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
>>>  */
>>> } else if (adapter->rx_vec_allowed) {
>>> PMD_INIT_LOG(DEBUG, "Vector rx enabled, please make sure RX "
>>> -  "burst size no less than 32.");
>>> +   "burst size no less than 4 (port=%d).",
>>> +dev->data->port_id);
>>
>> I think it would be better to use RTE_IXGBE_DESCS_PER_LOOP instead of a
>> constant 4.
>>>
>>> dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
>>> } else if (adapter->rx_bulk_alloc_allowed) {
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
>>> index 113682a..eb931fe 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.h
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.h
>>> @@ -47,9 +47,7 @@
>>> (uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
>>>
>>>#ifdef RTE_IXGBE_INC_VECTOR
>>> -#define RTE_IXGBE_VPMD_RX_BURST 32
>>> -#define RTE_IXGBE_VPMD_TX_BURST 32
>>> -#define RTE_IXGBE_RXQ_REARM_THRESH  RTE_IXGBE_VPMD_RX_BURST
>>> +#define RTE_IXGBE_RXQ_REARM_THRESH  32
>>>#define RTE_IXGBE_TX_MAX_FREE_BUF_SZ64
>>>#endif
>>>
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
>>> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> index 1c16dec..b72f817 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> @@ -199,13 +199,11 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf 
>>> **rx_pkts)
>>>#endif
>>>
>>>/*
>>> - * vPMD receive routine, now only accept (nb_pkts == 
>>> RTE_IXGBE_VPMD_RX_BURST)
>>> - * in one loop
>>> + * vPMD raw receive routine
>> I would keep some warning there, like "(if nb_pkts <
>> RTE_IXGBE_DESCS_PER_LOOP, won't receive anything)"
>>
>>> *
>>> * Notice:
>>> - * - nb_pkts < RTE_IXGBE_VPMD_RX_BURST, just return no packet
>>> - * - nb_pkts > RTE_IXGBE_VPMD_RX_BURST, only scan RTE_IXGBE_VPMD_RX_BURST
>>> - *   numbers of DD bit
>>> + * - floor align nb_pkts to a RTE_IXGBE_DESC_PER_LOOP power-of-two
>>> + * - 'nb_pkts < 4' causes 0 packet receiving
>> Again, RTE_IXGBE_DESCS_PER_LOOP would be better than 4
>>
>>> * - don't support ol_flags for rss and csum err
>>> */
>>>static inline uint16_t
>>> @@ -240,8 +238,7 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
>>> rte_mbuf **rx_pkts,
>>> __m128i dd_check, eop_check;
>>>#endif /* RTE_NEXT_ABI */
>>>
>>> -   if (unlikely(nb_pkts < RTE_IXGBE_VPMD_RX_BURST))
>>> -   return 0;
>>> +   nb_pkts = R

[dpdk-dev] [PATCH v1] ixgbe: remove vector pmd burst size restriction

2015-07-31 Thread Zoltan Kiss



On 31/07/15 09:17, Cunming Liang wrote:
> The patch removes the restriction of burst size on a constant 32.
>
> On receive side, the burst size floor now aligns to RTE_IXGBE_DESCS_PER_LOOP 
> power of 2.
> According to this rule, the burst size less than 4 still won't receive 
> anything.
> (Before this change, the burst size less than 32 can't receive anything.)
>
> On transmit side, the max burst size no longer bind with a constant, however 
> it still
> require to check the cross tx_rs_thresh violation.
>
> There's no obvious performance drop found on both recv_pkts_vec
> and recv_scattered_pkts_vec on burst size 32.
>
> Signed-off-by: Cunming Liang 
> ---
>   drivers/net/ixgbe/ixgbe_rxtx.c |  3 ++-
>   drivers/net/ixgbe/ixgbe_rxtx.h |  4 +---
>   drivers/net/ixgbe/ixgbe_rxtx_vec.c | 35 ---
>   3 files changed, 19 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 3f808b3..dbdb761 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -4008,7 +4008,8 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
>*/
>   } else if (adapter->rx_vec_allowed) {
>   PMD_INIT_LOG(DEBUG, "Vector rx enabled, please make sure RX "
> -"burst size no less than 32.");
> + "burst size no less than 4 (port=%d).",
> +  dev->data->port_id);

I think it would be better to use RTE_IXGBE_DESCS_PER_LOOP instead of a 
constant 4.
>
>   dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
>   } else if (adapter->rx_bulk_alloc_allowed) {
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
> index 113682a..eb931fe 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.h
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.h
> @@ -47,9 +47,7 @@
>   (uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
>
>   #ifdef RTE_IXGBE_INC_VECTOR
> -#define RTE_IXGBE_VPMD_RX_BURST 32
> -#define RTE_IXGBE_VPMD_TX_BURST 32
> -#define RTE_IXGBE_RXQ_REARM_THRESH  RTE_IXGBE_VPMD_RX_BURST
> +#define RTE_IXGBE_RXQ_REARM_THRESH  32
>   #define RTE_IXGBE_TX_MAX_FREE_BUF_SZ64
>   #endif
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> index 1c16dec..b72f817 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> @@ -199,13 +199,11 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf 
> **rx_pkts)
>   #endif
>
>   /*
> - * vPMD receive routine, now only accept (nb_pkts == RTE_IXGBE_VPMD_RX_BURST)
> - * in one loop
> + * vPMD raw receive routine
I would keep some warning there, like "(if nb_pkts < 
RTE_IXGBE_DESCS_PER_LOOP, won't receive anything)"

>*
>* Notice:
> - * - nb_pkts < RTE_IXGBE_VPMD_RX_BURST, just return no packet
> - * - nb_pkts > RTE_IXGBE_VPMD_RX_BURST, only scan RTE_IXGBE_VPMD_RX_BURST
> - *   numbers of DD bit
> + * - floor align nb_pkts to a RTE_IXGBE_DESC_PER_LOOP power-of-two
> + * - 'nb_pkts < 4' causes 0 packet receiving
Again, RTE_IXGBE_DESCS_PER_LOOP would be better than 4

>* - don't support ol_flags for rss and csum err
>*/
>   static inline uint16_t
> @@ -240,8 +238,7 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>   __m128i dd_check, eop_check;
>   #endif /* RTE_NEXT_ABI */
>
> - if (unlikely(nb_pkts < RTE_IXGBE_VPMD_RX_BURST))
> - return 0;
> + nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_IXGBE_DESCS_PER_LOOP);
>
>   /* Just the act of getting into the function from the application is
>* going to cost about 7 cycles */
> @@ -310,7 +307,7 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>* D. fill info. from desc to mbuf
>*/
>   #endif /* RTE_NEXT_ABI */
> - for (pos = 0, nb_pkts_recd = 0; pos < RTE_IXGBE_VPMD_RX_BURST;
> + for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
>   pos += RTE_IXGBE_DESCS_PER_LOOP,
>   rxdp += RTE_IXGBE_DESCS_PER_LOOP) {
>   __m128i descs[RTE_IXGBE_DESCS_PER_LOOP];
> @@ -450,13 +447,11 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>   }
>
>   /*
> - * vPMD receive routine, now only accept (nb_pkts == RTE_IXGBE_VPMD_RX_BURST)
> - * in one loop
> + * vPMD receive routine
Same note here as above

>*
>* Notice:
> - * - nb_pkts < RTE_IXGBE_VPMD_RX_BURST, just return no packet
> - * - nb_pkts > RTE_IXGBE_VPMD_RX_BURST, only scan RTE_IXGBE_VPMD_RX_BURST
> - *   numbers of DD bit
> + * - floor align nb_pkts to a RTE_IXGBE_DESC_PER_LOOP power-of-two
> + * - 'nb_pkts < 4' causes 0 packet receiving
>* - don't support ol_flags for rss and csum err
>*/
>   uint16_t
> @@ -470,12 +465,11 @@ static inline uint16_t
>   reassemble_packets(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_bufs,
>   uint16_t

[dpdk-dev] ixgbe vPMD RX functions and buffer number minimum requirement

2015-07-29 Thread Zoltan Kiss

Hi,

On 28/07/15 01:10, Ananyev, Konstantin wrote:
> Hi Zoltan,
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Monday, July 27, 2015 12:38 PM
>> To: Ananyev, Konstantin; Richardson, Bruce; dev at dpdk.org
>> Subject: Re: [dpdk-dev] ixgbe vPMD RX functions and buffer number minimum 
>> requirement
>>
>> Hi Konstantin,
>>
>> Thanks! Another question I would have: why does _recv_raw_pkts_vec()
>> insist on (nb_pkts > RTE_IXGBE_VPMD_RX_BURST)? Looking at the code it
>> should be able to return packets when nb_pkts >=
>> RTE_IXGBE_DESCS_PER_LOOP.
>
> Yes, that seems pretty trivial modification.
> Don't know any good reason why it wasn't done that way.
>
>> The split_flags check in
>> ixgbe_recv_scattered_pkts_vec() would be a bit more complicated, and
>> therefore maybe would have a tiny performance overhead as well, but I
>> don't it would be anything serious.
>
> I think, if the performance wouldn't be affected, that will be really useful 
> change.
> So it is definitely worth to try.
> Probably even _recv_raw_pkts_vec() for first nb_pkts &  
> ~(RTE_IXGBE_VPMD_RX_BURST - 1),
> and then sort of scalar analog for the remainder.

Ok, I'll give it a go.
Another question, regarding performance: what setup you used to show a 
performance difference? I've tried to compare the vector function with 
the normal bulk alloc with receiving a 10 Gbps stream of 64 bytes UDP 
packets (and forward them out on the other port), but both yielded ~14 
Mpps. I have a i5-4570 CPU @ 3.20GHz, maybe I should limit the clock speed?

Regards,

Zoltan

> Konstantin
>
>>
>> Regards,
>>
>> Zoltan
>>
>>
>> On 24/07/15 17:43, Ananyev, Konstantin wrote:
>>> Hi Zoltan,
>>>
>>>> -Original Message-
>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>>>> Sent: Friday, July 24, 2015 4:00 PM
>>>> To: Richardson, Bruce; dev at dpdk.org
>>>> Subject: [dpdk-dev] ixgbe vPMD RX functions and buffer number minimum 
>>>> requirement
>>>>
>>>> Hi,
>>>>
>>>> I was thinking how to handle the situation when you call
>>>> rte_eth_rx_burst with less than RTE_IXGBE_VPMD_RX_BURST buffers. In
>>>> ODP-DPDK unfortunately we can't force this requirement onto the calling
>>>> application.
>>>> One idea I had to check in ixgbe_recv_pkts_vec() if nb_pkts <
>>>> RTE_IXGBE_VPMD_RX_BURST, and call ixgbe_recv_pkts_bulk_alloc in that
>>>> case. Accordingly, in ixgbe_recv_scattered_pkts_vec() we could call
>>>> ixgbe_recv_scattered_pkts() in this case. A branch predictor can easily
>>>> eliminate the performance penalty of this, and applications can use
>>>> whatever burst size feasible for them.
>>>> The obvious problem could be whether you can mix the receive functions
>>>> this way. I have a feeling it wouldn't fly, but I wanted to ask first
>>>> before spending time investigate this option further.
>>>
>>> No, it is not possible to mix different RX functions, they setup/use 
>>> ixgbe_rx_queue
>>> fields in a different manner.
>>> Konstantin
>>>
>>>>
>>>> Regards,
>>>>
>>>> Zoltan

[dpdk-dev] ixgbe vPMD RX functions and buffer number minimum requirement

2015-07-27 Thread Zoltan Kiss

Hi Konstantin,

Thanks! Another question I would have: why does _recv_raw_pkts_vec() 
insist on (nb_pkts > RTE_IXGBE_VPMD_RX_BURST)? Looking at the code it 
should be able to return packets when nb_pkts >= 
RTE_IXGBE_DESCS_PER_LOOP. The split_flags check in 
ixgbe_recv_scattered_pkts_vec() would be a bit more complicated, and 
therefore maybe would have a tiny performance overhead as well, but I 
don't it would be anything serious.

Regards,

Zoltan


On 24/07/15 17:43, Ananyev, Konstantin wrote:
> Hi Zoltan,
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Friday, July 24, 2015 4:00 PM
>> To: Richardson, Bruce; dev at dpdk.org
>> Subject: [dpdk-dev] ixgbe vPMD RX functions and buffer number minimum 
>> requirement
>>
>> Hi,
>>
>> I was thinking how to handle the situation when you call
>> rte_eth_rx_burst with less than RTE_IXGBE_VPMD_RX_BURST buffers. In
>> ODP-DPDK unfortunately we can't force this requirement onto the calling
>> application.
>> One idea I had to check in ixgbe_recv_pkts_vec() if nb_pkts <
>> RTE_IXGBE_VPMD_RX_BURST, and call ixgbe_recv_pkts_bulk_alloc in that
>> case. Accordingly, in ixgbe_recv_scattered_pkts_vec() we could call
>> ixgbe_recv_scattered_pkts() in this case. A branch predictor can easily
>> eliminate the performance penalty of this, and applications can use
>> whatever burst size feasible for them.
>> The obvious problem could be whether you can mix the receive functions
>> this way. I have a feeling it wouldn't fly, but I wanted to ask first
>> before spending time investigate this option further.
>
> No, it is not possible to mix different RX functions, they setup/use 
> ixgbe_rx_queue
> fields in a different manner.
> Konstantin
>
>>
>> Regards,
>>
>> Zoltan

[dpdk-dev] ixgbe vPMD RX functions and buffer number minimum requirement

2015-07-24 Thread Zoltan Kiss

Hi,

I was thinking how to handle the situation when you call 
rte_eth_rx_burst with less than RTE_IXGBE_VPMD_RX_BURST buffers. In 
ODP-DPDK unfortunately we can't force this requirement onto the calling 
application.
One idea I had to check in ixgbe_recv_pkts_vec() if nb_pkts < 
RTE_IXGBE_VPMD_RX_BURST, and call ixgbe_recv_pkts_bulk_alloc in that 
case. Accordingly, in ixgbe_recv_scattered_pkts_vec() we could call 
ixgbe_recv_scattered_pkts() in this case. A branch predictor can easily 
eliminate the performance penalty of this, and applications can use 
whatever burst size feasible for them.
The obvious problem could be whether you can mix the receive functions 
this way. I have a feeling it wouldn't fly, but I wanted to ask first 
before spending time investigate this option further.

Regards,

Zoltan

[dpdk-dev] [RFC] l2fwd: trying to expose eth_igb_xmit_pkts unusual behavior

2015-07-24 Thread Zoltan Kiss

Hi,

On 24/07/15 11:13, Ciprian Barbu wrote:
> From: Ciprian Barbu 
>
> This tries to show an approximate behavior described in an earlier discussion
> called "can eth_igb_xmit_pkts called with len 0 affect transmission?"
>
> I'm using Intel i350 dual port 1Gb card and I've tweaked the pool size to the
> minimum possible. But unmodified l2fwd still worked withouth a sweat. After
> applying this patch transmission stopped after maybe 30 seconds. Lowering the
> 'magic' number causes l2fwd to stop sending packets even earlier.
> Using gdb I could see eth_igb_xmit_pkts always returning because nb_tx was 0
> and E1000_TXD_STAT_DD was not set (approx igb_rxtx.c : 476)

I think the proper solution for this will be to introduce an explicit 
API which makes it possible for the application to call the actual PMD's 
callback function, which flushes out the completed buffers from the TX 
ring, if possible. Then in case of igb it could be a noop. See the 
discussion in this thread:

http://dpdk.org/ml/archives/dev/2015-June/018487.html

I'll plan to propose such API, but it won't be in the very near future, 
as 2.1 will come out soon and no new features accepted at the moment.
For the short term we can introduce a check in ODP-DPDK, which avoids 
this rte_eth_tx_burst(..., 0) call if the PMD is igb. How about that?


>
> Signed-off-by: Ciprian Barbu 
> ---
>   examples/l2fwd/main.c | 13 +++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/examples/l2fwd/main.c b/examples/l2fwd/main.c
> index 17621ee..623b836 100644
> --- a/examples/l2fwd/main.c
> +++ b/examples/l2fwd/main.c
> @@ -72,7 +72,7 @@
>   #define RTE_LOGTYPE_L2FWD RTE_LOGTYPE_USER1
>
>   #define MBUF_SIZE (2048 + sizeof(struct rte_mbuf) + RTE_PKTMBUF_HEADROOM)
> -#define NB_MBUF   8192
> +#define NB_MBUF   1024
>
>   #define MAX_PKT_BURST 32
>   #define BURST_TX_DRAIN_US 100 /* TX drain every ~100us */
> @@ -260,7 +260,7 @@ l2fwd_main_loop(void)
>   struct rte_mbuf *m;
>   unsigned lcore_id;
>   uint64_t prev_tsc, diff_tsc, cur_tsc, timer_tsc;
> - unsigned i, j, portid, nb_rx;
> + unsigned i, j, portid, nb_rx, magic = 0;
>   struct lcore_queue_conf *qconf;
>   const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S 
> * BURST_TX_DRAIN_US;
>
> @@ -285,6 +285,15 @@ l2fwd_main_loop(void)
>   }
>
>   while (1) {
> + if (++magic == 5) {
> + magic = 0;
> + l2fwd_send_burst(_queue_conf[lcore_id],
> +  0,
> +  (uint8_t) 0);
> + l2fwd_send_burst(_queue_conf[lcore_id],
> +  0,
> +  (uint8_t) 1);
> + }
>
>   cur_tsc = rte_rdtsc();
>
>

[dpdk-dev] ixgbe packet drops not accounted for

2015-07-23 Thread Zoltan Kiss

Never mind, I've digged a bit more in the sink's driver stats, it turned 
out there are a lot of RX misses when it receives through the test box. 
Increasing the RX ring from 512 to 2048 solved the issue. I guess the 
change in the burst pattern caused this whole issue.

Regards,

Zoltan

On 23/07/15 12:15, Zoltan Kiss wrote:
> Hi,
>
> I've seen an odd behaviour in my test setup, which affected my test
> results, so I set up a much simpler scenario.
> I'm using netmap pktgen as a packet source, it creates a steady 14.2
> Mpps of 64 byte UDP packets over one port of a 82599ES dual port card.
> This traffic then goes to an another similar machine with the same dual
> port NIC, where it get forwarded out on the other port. The packet sink
> runs on the same machine as the generator, it's also netmap pktgen, and
> it tells me there is a big fluctuation of throughput between 13 and 14
> Mpps, the average comes out around 13.4 Mpps. After I've stripped down
> my test app to nothing but calling rx and tx functions in a loop (it
> doesn't even modifies the MAC address as DPDK l2fwd does), I've started
> to check what rte_eth_tx_burst() tells us. I've added it's return value
> to a counter, and a separate thread printed it out every second
> (sleep(1)), and I've found it reports a steady 14.05 Mpps output. I've
> checked with rte_eth_stats_get(), it gives me the same numbers, and no
> indication of any failure.
> When I've connected the generator to the sink directly, it was able to
> receive all the packets, so it's not that the sink is not able to count
> them all. I've even replaced the cables with each other to see if the
> one towards the sink drops some packets, but nothing changed.
> I had the impression that once rte_eth_tx_burst() managed to place the
> packets on the descriptor ring, it will go out in some finite time, or
> if the card itself drops it, it will appear in the stats at least, but
> the oerrors and q_errors values are always 0.
> Does anyone has an idea where could those packets (avg 0.6 Mpps) get
> dropped?
>
> Regards,
>
> Zoltan Kiss

[dpdk-dev] ixgbe packet drops not accounted for

2015-07-23 Thread Zoltan Kiss

Hi,

I've seen an odd behaviour in my test setup, which affected my test 
results, so I set up a much simpler scenario.
I'm using netmap pktgen as a packet source, it creates a steady 14.2 
Mpps of 64 byte UDP packets over one port of a 82599ES dual port card. 
This traffic then goes to an another similar machine with the same dual 
port NIC, where it get forwarded out on the other port. The packet sink 
runs on the same machine as the generator, it's also netmap pktgen, and 
it tells me there is a big fluctuation of throughput between 13 and 14 
Mpps, the average comes out around 13.4 Mpps. After I've stripped down 
my test app to nothing but calling rx and tx functions in a loop (it 
doesn't even modifies the MAC address as DPDK l2fwd does), I've started 
to check what rte_eth_tx_burst() tells us. I've added it's return value 
to a counter, and a separate thread printed it out every second 
(sleep(1)), and I've found it reports a steady 14.05 Mpps output. I've 
checked with rte_eth_stats_get(), it gives me the same numbers, and no 
indication of any failure.
When I've connected the generator to the sink directly, it was able to 
receive all the packets, so it's not that the sink is not able to count 
them all. I've even replaced the cables with each other to see if the 
one towards the sink drops some packets, but nothing changed.
I had the impression that once rte_eth_tx_burst() managed to place the 
packets on the descriptor ring, it will go out in some finite time, or 
if the card itself drops it, it will appear in the stats at least, but 
the oerrors and q_errors values are always 0.
Does anyone has an idea where could those packets (avg 0.6 Mpps) get 
dropped?

Regards,

Zoltan Kiss

[dpdk-dev] [PATCH v2] ixgbe: fix check for split packets

2015-07-22 Thread Zoltan Kiss



On 22/07/15 14:19, Zoltan Kiss wrote:
> Btw. vPMD was a bit misleading abbreviation for me, it took me a while
> until I realized 'v' stands for 'vector', not 'virtualization' as in
> most cases nowadays.
>

Though that's mostly my fault to not to check the documentation :)

[dpdk-dev] [PATCH v2] ixgbe: fix check for split packets

2015-07-22 Thread Zoltan Kiss

Hi,

And what happens if someone changes RTE_IXGBE_VPMD_RX_BURST to something 
else than 32? I guess this bug were introduced when someone raised it 
from 16 to 32
I think you are better off with a for loop which uses this value. Or at 
least make a comment around RTE_IXGBE_VPMD_RX_BURST that if you change 
that value this check should be modified as well.

Regards,

Zoltan

On 22/07/15 10:13, Bruce Richardson wrote:
> The check for split packets to be reassembled in the vector ixgbe PMD
> was incorrectly only checking the first 16 elements of the array instead
> of all 32. This is fixed by changing the uint32_t values to be uint64_t
> instead.
>
> Fixes: cf4b4708a88a ("ixgbe: improve slow-path perf with vector scattered Rx")
>
> Reported-by: Zoltan Kiss 
> Signed-off-by: Bruce Richardson 
>
> ---
> V2: Rename variable from split_fl32 to split_fl64 to match type change.
> ---
>   drivers/net/ixgbe/ixgbe_rxtx_vec.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> index d3ac74a..f2052c6 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> @@ -549,10 +549,10 @@ ixgbe_recv_scattered_pkts_vec(void *rx_queue, struct 
> rte_mbuf **rx_pkts,
>   return 0;
>
>   /* happy day case, full burst + no packets to be joined */
> - const uint32_t *split_fl32 = (uint32_t *)split_flags;
> + const uint64_t *split_fl64 = (uint64_t *)split_flags;
>   if (rxq->pkt_first_seg == NULL &&
> - split_fl32[0] == 0 && split_fl32[1] == 0 &&
> - split_fl32[2] == 0 && split_fl32[3] == 0)
> + split_fl64[0] == 0 && split_fl64[1] == 0 &&
> + split_fl64[2] == 0 && split_fl64[3] == 0)
>   return nb_bufs;
>
>   /* reassemble any packets that need reassembly*/
>

[dpdk-dev] ixgbe_recv_scattered_pkts_vec split_flags question

2015-07-21 Thread Zoltan Kiss

Hi,

I have a question regarding split_flags in this question. It's defined 
as an array of 1 byte unsigned ints:

uint8_t split_flags[RTE_IXGBE_VPMD_RX_BURST] = {0};

RTE_IXGBE_VPMD_RX_BURST is 32, so it will be 32 bytes. Then we cast it 
into a pointer for 4 byte values, and check the first 4 elements of that 
array

const uint32_t *split_fl32 = (uint32_t *)split_flags;
if (rxq->pkt_first_seg == NULL &&
split_fl32[0] == 0 && split_fl32[1] == 0 &&
split_fl32[2] == 0 && split_fl32[3] == 0)

So we only check the first half of this 32 byte array. But 
_recv_raw_pkts_vec() seems to use the whole array. Is this a bug or a 
feature? Or am I mistaken in the math somewhere?

Regards,

Zoltan Kiss

[dpdk-dev] driver initialization in DPDK 2.0 built into a shared library.

2015-07-16 Thread Zoltan Kiss

On 14/07/15 19:21, Polevoy, Igor wrote:
> Hi,
> We are developing an application that uses DPDK PMD functionality .
> We are using a linux shared library which contains the network packets 
> processing code and it is statically linked with all the necessary DPDK libs.
> The .so is loaded by the main program.
> For the DPDK compilation we have added the -fPIC to the GCC options.
>
> While it all worked fine with DPDK 1.6 where we had the rte_pmd_init_all 
> method, in the 2.0 version the
> drivers registration methods (PMD_REGISTER_DRIVER) are not called when the 
> shared library is loaded.
>
> Although, I can go along the lines of the rte_pmd_init all and manually call 
> the driver registration, I'm concerned
> that DPDK has other drivers initialization calls, and I don't actually know 
> which are needed or could be needed and when.
>
> Do you have any advice on that? What is the best way to resolve this issue?
>
> Thank you
> Igor.
>

I've seen a similar problem, but in a different setup. My app (OVS) 
links to ODP-DPDK statically, which then links to DPDK code statically. 
I found that the devinitfn_* functions are not called because OVS 
doesn't call directly into the DPDK library, only into ODP-DPDK. The 
workaround for me was to refer to those functions in the ODP-DPDK code:

#define PMD_EXT(drv)  extern void devinitfn_##drv(void);
PMD_EXT(bond_drv)
PMD_EXT(em_pmd_drv)
...

But there might be better solutions than that. This one is fragile, if 
you update DPDK you have to remember adding these references for new PMDs.

Zoli

[dpdk-dev] [PATCH v2] mempool: improve cache search

2015-07-07 Thread Zoltan Kiss



On 02/07/15 18:07, Ananyev, Konstantin wrote:
>
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Wednesday, July 01, 2015 10:04 AM
>> To: dev at dpdk.org
>> Subject: [dpdk-dev] [PATCH v2] mempool: improve cache search
>>
>> The current way has a few problems:
>>
>> - if cache->len < n, we copy our elements into the cache first, then
>>into obj_table, that's unnecessary
>> - if n >= cache_size (or the backfill fails), and we can't fulfil the
>>request from the ring alone, we don't try to combine with the cache
>> - if refill fails, we don't return anything, even if the ring has enough
>>for our request
>>
>> This patch rewrites it severely:
>> - at the first part of the function we only try the cache if cache->len < n
>> - otherwise take our elements straight from the ring
>> - if that fails but we have something in the cache, try to combine them
>> - the refill happens at the end, and its failure doesn't modify our return
>>value
>>
>> Signed-off-by: Zoltan Kiss 
>> ---
>> v2:
>> - fix subject
>> - add unlikely for branch where request is fulfilled both from cache and ring
>>
>>   lib/librte_mempool/rte_mempool.h | 63 
>> +---
>>   1 file changed, 39 insertions(+), 24 deletions(-)
>>
>> diff --git a/lib/librte_mempool/rte_mempool.h 
>> b/lib/librte_mempool/rte_mempool.h
>> index 6d4ce9a..1e96f03 100644
>> --- a/lib/librte_mempool/rte_mempool.h
>> +++ b/lib/librte_mempool/rte_mempool.h
>> @@ -947,34 +947,14 @@ __mempool_get_bulk(struct rte_mempool *mp, void 
>> **obj_table,
>>  unsigned lcore_id = rte_lcore_id();
>>  uint32_t cache_size = mp->cache_size;
>>
>> -/* cache is not enabled or single consumer */
>> +cache = >local_cache[lcore_id];
>> +/* cache is not enabled or single consumer or not enough */
>>  if (unlikely(cache_size == 0 || is_mc == 0 ||
>> - n >= cache_size || lcore_id >= RTE_MAX_LCORE))
>> + cache->len < n || lcore_id >= RTE_MAX_LCORE))
>>  goto ring_dequeue;
>>
>> -cache = >local_cache[lcore_id];
>>  cache_objs = cache->objs;
>>
>> -/* Can this be satisfied from the cache? */
>> -if (cache->len < n) {
>> -/* No. Backfill the cache first, and then fill from it */
>> -uint32_t req = n + (cache_size - cache->len);
>> -
>> -/* How many do we require i.e. number to fill the cache + the 
>> request */
>> -ret = rte_ring_mc_dequeue_bulk(mp->ring, 
>> >objs[cache->len], req);
>> -if (unlikely(ret < 0)) {
>> -/*
>> - * In the offchance that we are buffer constrained,
>> - * where we are not able to allocate cache + n, go to
>> - * the ring directly. If that fails, we are truly out of
>> - * buffers.
>> - */
>> -goto ring_dequeue;
>> -}
>> -
>> -cache->len += req;
>> -}
>> -
>>  /* Now fill in the response ... */
>>  for (index = 0, len = cache->len - 1; index < n; ++index, len--, 
>> obj_table++)
>>  *obj_table = cache_objs[len];
>> @@ -983,7 +963,8 @@ __mempool_get_bulk(struct rte_mempool *mp, void 
>> **obj_table,
>>
>>  __MEMPOOL_STAT_ADD(mp, get_success, n);
>>
>> -return 0;
>> +ret = 0;
>> +goto cache_refill;
>>
>>   ring_dequeue:
>>   #endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>> @@ -994,11 +975,45 @@ ring_dequeue:
>>  else
>>  ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n);
>>
>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>> +if (unlikely(ret < 0 && is_mc == 1 && cache->len > 0)) {
>> +uint32_t req = n - cache->len;
>> +
>> +ret = rte_ring_mc_dequeue_bulk(mp->ring, obj_table, req);
>> +if (ret == 0) {
>> +cache_objs = cache->objs;
>> +obj_table += req;
>> +for (index = 0; index < cache->len;
>> + ++index, ++obj_table)
>> +*obj_table = cache_objs[index];
>> +cache->len = 0;
>> +

[dpdk-dev] [PATCH v2] mempool: improve cache search

2015-07-01 Thread Zoltan Kiss

The current way has a few problems:

- if cache->len < n, we copy our elements into the cache first, then
  into obj_table, that's unnecessary
- if n >= cache_size (or the backfill fails), and we can't fulfil the
  request from the ring alone, we don't try to combine with the cache
- if refill fails, we don't return anything, even if the ring has enough
  for our request

This patch rewrites it severely:
- at the first part of the function we only try the cache if cache->len < n
- otherwise take our elements straight from the ring
- if that fails but we have something in the cache, try to combine them
- the refill happens at the end, and its failure doesn't modify our return
  value

Signed-off-by: Zoltan Kiss 
---
v2:
- fix subject
- add unlikely for branch where request is fulfilled both from cache and ring

 lib/librte_mempool/rte_mempool.h | 63 +---
 1 file changed, 39 insertions(+), 24 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 6d4ce9a..1e96f03 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -947,34 +947,14 @@ __mempool_get_bulk(struct rte_mempool *mp, void 
**obj_table,
unsigned lcore_id = rte_lcore_id();
uint32_t cache_size = mp->cache_size;

-   /* cache is not enabled or single consumer */
+   cache = >local_cache[lcore_id];
+   /* cache is not enabled or single consumer or not enough */
if (unlikely(cache_size == 0 || is_mc == 0 ||
-n >= cache_size || lcore_id >= RTE_MAX_LCORE))
+cache->len < n || lcore_id >= RTE_MAX_LCORE))
goto ring_dequeue;

-   cache = >local_cache[lcore_id];
cache_objs = cache->objs;

-   /* Can this be satisfied from the cache? */
-   if (cache->len < n) {
-   /* No. Backfill the cache first, and then fill from it */
-   uint32_t req = n + (cache_size - cache->len);
-
-   /* How many do we require i.e. number to fill the cache + the 
request */
-   ret = rte_ring_mc_dequeue_bulk(mp->ring, 
>objs[cache->len], req);
-   if (unlikely(ret < 0)) {
-   /*
-* In the offchance that we are buffer constrained,
-* where we are not able to allocate cache + n, go to
-* the ring directly. If that fails, we are truly out of
-* buffers.
-*/
-   goto ring_dequeue;
-   }
-
-   cache->len += req;
-   }
-
/* Now fill in the response ... */
for (index = 0, len = cache->len - 1; index < n; ++index, len--, 
obj_table++)
*obj_table = cache_objs[len];
@@ -983,7 +963,8 @@ __mempool_get_bulk(struct rte_mempool *mp, void **obj_table,

__MEMPOOL_STAT_ADD(mp, get_success, n);

-   return 0;
+   ret = 0;
+   goto cache_refill;

 ring_dequeue:
 #endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
@@ -994,11 +975,45 @@ ring_dequeue:
else
ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n);

+#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
+   if (unlikely(ret < 0 && is_mc == 1 && cache->len > 0)) {
+   uint32_t req = n - cache->len;
+
+   ret = rte_ring_mc_dequeue_bulk(mp->ring, obj_table, req);
+   if (ret == 0) {
+   cache_objs = cache->objs;
+   obj_table += req;
+   for (index = 0; index < cache->len;
+++index, ++obj_table)
+   *obj_table = cache_objs[index];
+   cache->len = 0;
+   }
+   }
+#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
+
if (ret < 0)
__MEMPOOL_STAT_ADD(mp, get_fail, n);
else
__MEMPOOL_STAT_ADD(mp, get_success, n);

+#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
+cache_refill:
+   /* If previous dequeue was OK and we have less than n, start refill */
+   if (ret == 0 && cache_size > 0 && cache->len < n) {
+   uint32_t req = cache_size - cache->len;
+
+   cache_objs = cache->objs;
+   ret = rte_ring_mc_dequeue_bulk(mp->ring,
+  >objs[cache->len],
+  req);
+   if (likely(ret == 0))
+   cache->len += req;
+   else
+   /* Don't spoil the return value */
+   ret = 0;
+   }
+#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
+
return ret;
 }

-- 
1.9.1

[dpdk-dev] [PATCH] mempool: improbe cache search

2015-06-30 Thread Zoltan Kiss



On 30/06/15 12:58, Olivier MATZ wrote:
> Hi Zoltan,
>
> On 06/25/2015 08:48 PM, Zoltan Kiss wrote:
>> The current way has a few problems:
>>
>> - if cache->len < n, we copy our elements into the cache first, then
>>into obj_table, that's unnecessary
>> - if n >= cache_size (or the backfill fails), and we can't fulfil the
>>request from the ring alone, we don't try to combine with the cache
>> - if refill fails, we don't return anything, even if the ring has enough
>>for our request
>>
>> This patch rewrites it severely:
>> - at the first part of the function we only try the cache if
>> cache->len < n
>> - otherwise take our elements straight from the ring
>> - if that fails but we have something in the cache, try to combine them
>> - the refill happens at the end, and its failure doesn't modify our
>> return
>>value
>
> Indeed, it looks easier to read that way. I checked the performance with
> "mempool_perf_autotest" of app/test and it show that there is no
> regression (it is even slightly better in some test cases).
>
> There is a small typo in the title: s/improbe/improve

Yes, I'll fix that.


> Please see also a comment below.
>
>>
>> Signed-off-by: Zoltan Kiss 
>> ---
>>   lib/librte_mempool/rte_mempool.h | 63
>> +---
>>   1 file changed, 39 insertions(+), 24 deletions(-)
>>
>> diff --git a/lib/librte_mempool/rte_mempool.h
>> b/lib/librte_mempool/rte_mempool.h
>> index a8054e1..896946c 100644
>> --- a/lib/librte_mempool/rte_mempool.h
>> +++ b/lib/librte_mempool/rte_mempool.h
>> @@ -948,34 +948,14 @@ __mempool_get_bulk(struct rte_mempool *mp, void
>> **obj_table,
>>   unsigned lcore_id = rte_lcore_id();
>>   uint32_t cache_size = mp->cache_size;
>>
>> -/* cache is not enabled or single consumer */
>> +cache = >local_cache[lcore_id];
>> +/* cache is not enabled or single consumer or not enough */
>>   if (unlikely(cache_size == 0 || is_mc == 0 ||
>> - n >= cache_size || lcore_id >= RTE_MAX_LCORE))
>> + cache->len < n || lcore_id >= RTE_MAX_LCORE))
>>   goto ring_dequeue;
>>
>> -cache = >local_cache[lcore_id];
>>   cache_objs = cache->objs;
>>
>> -/* Can this be satisfied from the cache? */
>> -if (cache->len < n) {
>> -/* No. Backfill the cache first, and then fill from it */
>> -uint32_t req = n + (cache_size - cache->len);
>> -
>> -/* How many do we require i.e. number to fill the cache + the
>> request */
>> -ret = rte_ring_mc_dequeue_bulk(mp->ring,
>> >objs[cache->len], req);
>> -if (unlikely(ret < 0)) {
>> -/*
>> - * In the offchance that we are buffer constrained,
>> - * where we are not able to allocate cache + n, go to
>> - * the ring directly. If that fails, we are truly out of
>> - * buffers.
>> - */
>> -goto ring_dequeue;
>> -}
>> -
>> -cache->len += req;
>> -}
>> -
>>   /* Now fill in the response ... */
>>   for (index = 0, len = cache->len - 1; index < n; ++index, len--,
>> obj_table++)
>>   *obj_table = cache_objs[len];
>> @@ -984,7 +964,8 @@ __mempool_get_bulk(struct rte_mempool *mp, void
>> **obj_table,
>>
>>   __MEMPOOL_STAT_ADD(mp, get_success, n);
>>
>> -return 0;
>> +ret = 0;
>> +goto cache_refill;
>>
>>   ring_dequeue:
>>   #endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>> @@ -995,11 +976,45 @@ ring_dequeue:
>>   else
>>   ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n);
>>
>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>> +if (ret < 0 && is_mc == 1 && cache->len > 0) {
>
> if (unlikely(ret < 0 && is_mc == 1 && cache->len > 0))  ?

Ok

>
>
>> +uint32_t req = n - cache->len;
>> +
>> +ret = rte_ring_mc_dequeue_bulk(mp->ring, obj_table, req);
>> +if (ret == 0) {
>> +cache_objs = cache->objs;
>> +obj_table += req;
>> +for (index = 0; index < cache->len;
>> + ++index, ++obj_table)
>> +*obj_table = cache_objs[index];
>> +cache->len = 0;
>> +}
>> +}
>> +#endif /* RTE_M

[dpdk-dev] [PATCH] mempool: improbe cache search

2015-06-25 Thread Zoltan Kiss

The current way has a few problems:

- if cache->len < n, we copy our elements into the cache first, then
  into obj_table, that's unnecessary
- if n >= cache_size (or the backfill fails), and we can't fulfil the
  request from the ring alone, we don't try to combine with the cache
- if refill fails, we don't return anything, even if the ring has enough
  for our request

This patch rewrites it severely:
- at the first part of the function we only try the cache if cache->len < n
- otherwise take our elements straight from the ring
- if that fails but we have something in the cache, try to combine them
- the refill happens at the end, and its failure doesn't modify our return
  value

Signed-off-by: Zoltan Kiss 
---
 lib/librte_mempool/rte_mempool.h | 63 +---
 1 file changed, 39 insertions(+), 24 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index a8054e1..896946c 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -948,34 +948,14 @@ __mempool_get_bulk(struct rte_mempool *mp, void 
**obj_table,
unsigned lcore_id = rte_lcore_id();
uint32_t cache_size = mp->cache_size;

-   /* cache is not enabled or single consumer */
+   cache = >local_cache[lcore_id];
+   /* cache is not enabled or single consumer or not enough */
if (unlikely(cache_size == 0 || is_mc == 0 ||
-n >= cache_size || lcore_id >= RTE_MAX_LCORE))
+cache->len < n || lcore_id >= RTE_MAX_LCORE))
goto ring_dequeue;

-   cache = >local_cache[lcore_id];
cache_objs = cache->objs;

-   /* Can this be satisfied from the cache? */
-   if (cache->len < n) {
-   /* No. Backfill the cache first, and then fill from it */
-   uint32_t req = n + (cache_size - cache->len);
-
-   /* How many do we require i.e. number to fill the cache + the 
request */
-   ret = rte_ring_mc_dequeue_bulk(mp->ring, 
>objs[cache->len], req);
-   if (unlikely(ret < 0)) {
-   /*
-* In the offchance that we are buffer constrained,
-* where we are not able to allocate cache + n, go to
-* the ring directly. If that fails, we are truly out of
-* buffers.
-*/
-   goto ring_dequeue;
-   }
-
-   cache->len += req;
-   }
-
/* Now fill in the response ... */
for (index = 0, len = cache->len - 1; index < n; ++index, len--, 
obj_table++)
*obj_table = cache_objs[len];
@@ -984,7 +964,8 @@ __mempool_get_bulk(struct rte_mempool *mp, void **obj_table,

__MEMPOOL_STAT_ADD(mp, get_success, n);

-   return 0;
+   ret = 0;
+   goto cache_refill;

 ring_dequeue:
 #endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
@@ -995,11 +976,45 @@ ring_dequeue:
else
ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n);

+#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
+   if (ret < 0 && is_mc == 1 && cache->len > 0) {
+   uint32_t req = n - cache->len;
+
+   ret = rte_ring_mc_dequeue_bulk(mp->ring, obj_table, req);
+   if (ret == 0) {
+   cache_objs = cache->objs;
+   obj_table += req;
+   for (index = 0; index < cache->len;
+++index, ++obj_table)
+   *obj_table = cache_objs[index];
+   cache->len = 0;
+   }
+   }
+#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
+
if (ret < 0)
__MEMPOOL_STAT_ADD(mp, get_fail, n);
else
__MEMPOOL_STAT_ADD(mp, get_success, n);

+#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
+cache_refill:
+   /* If previous dequeue was OK and we have less than n, start refill */
+   if (ret == 0 && cache_size > 0 && cache->len < n) {
+   uint32_t req = cache_size - cache->len;
+
+   cache_objs = cache->objs;
+   ret = rte_ring_mc_dequeue_bulk(mp->ring,
+  >objs[cache->len],
+  req);
+   if (likely(ret == 0))
+   cache->len += req;
+   else
+   /* Don't spoil the return value */
+   ret = 0;
+   }
+#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
+
return ret;
 }

-- 
1.9.1

[dpdk-dev] [PATCH] ethdev: fix checking for tx_free_thresh

2015-06-23 Thread Zoltan Kiss

This parameter is not consistent between the drivers: some use it as
rte_eth_tx_burst() requires, some release buffers when the number of free
descriptors drop below this value.
Let's use it as most fast-path code does, which is the latter, and update
comments throughout the code to reflect that.

Signed-off-by: Zoltan Kiss 
---
 drivers/net/e1000/em_rxtx.c  |  7 ---
 drivers/net/fm10k/fm10k.h|  1 -
 drivers/net/fm10k/fm10k_ethdev.c |  1 -
 drivers/net/fm10k/fm10k_rxtx.c   |  2 +-
 drivers/net/i40e/i40e_rxtx.c |  2 +-
 drivers/net/i40e/i40e_rxtx.h |  5 +++--
 drivers/net/ixgbe/ixgbe_rxtx.c   |  3 +--
 drivers/net/ixgbe/ixgbe_rxtx.h   |  4 +++-
 drivers/net/virtio/virtio_rxtx.c |  2 +-
 lib/librte_ether/rte_ethdev.h| 11 ++-
 10 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/drivers/net/e1000/em_rxtx.c b/drivers/net/e1000/em_rxtx.c
index fdc825f..38f5c3b 100644
--- a/drivers/net/e1000/em_rxtx.c
+++ b/drivers/net/e1000/em_rxtx.c
@@ -182,7 +182,9 @@ struct em_tx_queue {
volatile uint32_t  *tdt_reg_addr; /**< Address of TDT register. */
uint16_t   nb_tx_desc;/**< number of TX descriptors. */
uint16_t   tx_tail;  /**< Current value of TDT register. */
-   uint16_t   tx_free_thresh;/**< minimum TX before freeing. */
+   /**< Start freeing TX buffers if there are less free descriptors than
+this value. */
+   uint16_t   tx_free_thresh;
/**< Number of TX descriptors to use before RS bit is set. */
uint16_t   tx_rs_thresh;
/** Number of TX descriptors used since RS bit was set. */
@@ -418,9 +420,8 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
txe = _ring[tx_id];

/* Determine if the descriptor ring needs to be cleaned. */
-   if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh) {
+if (txq->nb_tx_free < txq->tx_free_thresh)
em_xmit_cleanup(txq);
-   }

/* TX loop */
for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
diff --git a/drivers/net/fm10k/fm10k.h b/drivers/net/fm10k/fm10k.h
index 1b542c5..c089882 100644
--- a/drivers/net/fm10k/fm10k.h
+++ b/drivers/net/fm10k/fm10k.h
@@ -200,7 +200,6 @@ struct fm10k_tx_queue {
uint16_t next_free;
uint16_t nb_free;
uint16_t nb_used;
-   uint16_t free_trigger;
uint16_t free_thresh;
uint16_t rs_thresh;
volatile uint32_t *tail_ptr;
diff --git a/drivers/net/fm10k/fm10k_ethdev.c b/drivers/net/fm10k/fm10k_ethdev.c
index 406c350..f031f8f 100644
--- a/drivers/net/fm10k/fm10k_ethdev.c
+++ b/drivers/net/fm10k/fm10k_ethdev.c
@@ -203,7 +203,6 @@ tx_queue_reset(struct fm10k_tx_queue *q)
q->next_free = 0;
q->nb_used = 0;
q->nb_free = q->nb_desc - 1;
-   q->free_trigger = q->nb_free - q->free_thresh;
fifo_reset(>rs_tracker, (q->nb_desc + 1) / q->rs_thresh);
FM10K_PCI_REG_WRITE(q->tail_ptr, 0);
 }
diff --git a/drivers/net/fm10k/fm10k_rxtx.c b/drivers/net/fm10k/fm10k_rxtx.c
index f5d1ad0..7d5e32c 100644
--- a/drivers/net/fm10k/fm10k_rxtx.c
+++ b/drivers/net/fm10k/fm10k_rxtx.c
@@ -450,7 +450,7 @@ fm10k_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
mb = tx_pkts[count];

/* running low on descriptors? try to free some... */
-   if (q->nb_free < q->free_trigger)
+   if (q->nb_free < q->free_thresh)
tx_free_descriptors(q);

/* make sure there are enough free descriptors to transmit the
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 2de0ac4..fcacd34 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -1236,7 +1236,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, 
uint16_t nb_pkts)
txe = _ring[tx_id];

/* Check if the descriptor ring needs to be cleaned. */
-   if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
+   if (txq->nb_tx_free < txq->tx_free_thresh)
i40e_xmit_cleanup(txq);

for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index decbc3d..87222e6 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -136,8 +136,9 @@ struct i40e_tx_queue {
uint16_t last_desc_cleaned;
/**< Total number of TX descriptors ready to be allocated. */
uint16_t nb_tx_free;
-   /**< Number of TX descriptors to use before RS bit is set. */
-   uint16_t tx_free_thresh; /**< minimum TX before freeing. */
+   /**< Start freeing TX buffers if there are less free descriptors than
+this value. */
+   uint16_t tx_free_thresh;
/** Number of TX des

[dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh

2015-06-09 Thread Zoltan Kiss



On 09/06/15 16:44, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Tuesday, June 09, 2015 4:08 PM
>> To: Ananyev, Konstantin; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>
>>
>>
>> On 09/06/15 12:18, Ananyev, Konstantin wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>> Sent: Wednesday, June 03, 2015 6:47 PM
>>>> To: Ananyev, Konstantin; dev at dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>>>
>>>>
>>>>
>>>> On 02/06/15 18:35, Ananyev, Konstantin wrote:
>>>>>
>>>>>
>>>>>> -Original Message-
>>>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>>>> Sent: Tuesday, June 02, 2015 4:08 PM
>>>>>> To: Ananyev, Konstantin; dev at dpdk.org
>>>>>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02/06/15 14:31, Ananyev, Konstantin wrote:
>>>>>>> Hi Zoltan,
>>>>>>>
>>>>>>>> -Original Message-
>>>>>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>>>>>>>> Sent: Monday, June 01, 2015 5:16 PM
>>>>>>>> To: dev at dpdk.org
>>>>>>>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Anyone would like to review this patch? Venky sent a NAK, but I've
>>>>>>>> explained to him why it is a bug.
>>>>>>>
>>>>>>>
>>>>>>> Well, I think Venky is right here.
>>>>>> I think the comments above rte_eth_tx_burst() definition are quite clear
>>>>>> about what tx_free_thresh means, e1000 and i40e use it that way, but not
>>>>>> ixgbe.
>>>>>>
>>>>>>> Indeed that fix, will cause more often unsuccessful checks for DD bits 
>>>>>>> and might cause a
>>>>>>> slowdown for TX fast-path.
>>>>>> Not if the applications set tx_free_thresh according to the definition
>>>>>> of this value. But we can change the default value from 32 to something
>>>>>> higher, e.g I'm using nb_desc/2, and it works out well.
>>>>>
>>>>> Sure we can, as I said below, we can unify it one way or another.
>>>>> One way would be to make fast-path TX to free TXDs when number of 
>>>>> occupied TXDs raises above tx_free_thresh
>>>>> (what rte_ethdev.h comments say and what full-featured TX is doing).
>>>>> Though in that case we have to change default value for tx_free_thresh, 
>>>>> and all existing apps that
>>>>> using tx_free_thresh==32 and fast-path TX will probably experience a 
>>>>> slowdown.
>>>>
>>>> They are in trouble already, because i40e and e1000 uses it as defined.
>>>
>>> In fact, i40e has exactly the same problem as ixgbe:
>>> fast-path and full-featured TX  code treat  tx_free_thresh in a different 
>>> way.
>>> igb just ignores input tx_free_thresh, while em has only full featured path.
>>>
>>> What I am saying, existing app that uses TX fast-path and sets 
>>> tx_free_thresh=32
>>> (as we did in our examples in previous versions) will experience a slowdown,
>>> if we'll make all TX functions to behave like full-featured ones
>>> (txq->nb_tx_desc - txq->nb_tx_free > txq->tx_free_thresh).
>>>
>>>   From other side, if app uses TX full-featured TX and sets 
>>> tx_free_thresh=32,
>>> then it  already has a possible slowdown, because of too often TXDs 
>>> checking.
>>> So, if we'll change tx_free_thresh semantics to wht fast-path uses,
>>> It shouldn't see any slowdown, in fact it might see some improvement.
>>>
>>>> But I guess most apps are going with 0, which sets the drivers default.
>>>> Others have to change the value to nb_txd - curr_value to have the same
>>>> behaviour
>>>>
>>>>

[dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh

2015-06-09 Thread Zoltan Kiss



On 09/06/15 12:18, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Wednesday, June 03, 2015 6:47 PM
>> To: Ananyev, Konstantin; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>
>>
>>
>> On 02/06/15 18:35, Ananyev, Konstantin wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>> Sent: Tuesday, June 02, 2015 4:08 PM
>>>> To: Ananyev, Konstantin; dev at dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>>>
>>>>
>>>>
>>>> On 02/06/15 14:31, Ananyev, Konstantin wrote:
>>>>> Hi Zoltan,
>>>>>
>>>>>> -Original Message-
>>>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>>>>>> Sent: Monday, June 01, 2015 5:16 PM
>>>>>> To: dev at dpdk.org
>>>>>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Anyone would like to review this patch? Venky sent a NAK, but I've
>>>>>> explained to him why it is a bug.
>>>>>
>>>>>
>>>>> Well, I think Venky is right here.
>>>> I think the comments above rte_eth_tx_burst() definition are quite clear
>>>> about what tx_free_thresh means, e1000 and i40e use it that way, but not
>>>> ixgbe.
>>>>
>>>>> Indeed that fix, will cause more often unsuccessful checks for DD bits 
>>>>> and might cause a
>>>>> slowdown for TX fast-path.
>>>> Not if the applications set tx_free_thresh according to the definition
>>>> of this value. But we can change the default value from 32 to something
>>>> higher, e.g I'm using nb_desc/2, and it works out well.
>>>
>>> Sure we can, as I said below, we can unify it one way or another.
>>> One way would be to make fast-path TX to free TXDs when number of occupied 
>>> TXDs raises above tx_free_thresh
>>> (what rte_ethdev.h comments say and what full-featured TX is doing).
>>> Though in that case we have to change default value for tx_free_thresh, and 
>>> all existing apps that
>>> using tx_free_thresh==32 and fast-path TX will probably experience a 
>>> slowdown.
>>
>> They are in trouble already, because i40e and e1000 uses it as defined.
>
> In fact, i40e has exactly the same problem as ixgbe:
> fast-path and full-featured TX  code treat  tx_free_thresh in a different way.
> igb just ignores input tx_free_thresh, while em has only full featured path.
>
> What I am saying, existing app that uses TX fast-path and sets 
> tx_free_thresh=32
> (as we did in our examples in previous versions) will experience a slowdown,
> if we'll make all TX functions to behave like full-featured ones
> (txq->nb_tx_desc - txq->nb_tx_free > txq->tx_free_thresh).
>
>  From other side, if app uses TX full-featured TX and sets tx_free_thresh=32,
> then it  already has a possible slowdown, because of too often TXDs checking.
> So, if we'll change tx_free_thresh semantics to wht fast-path uses,
> It shouldn't see any slowdown, in fact it might see some improvement.
>
>> But I guess most apps are going with 0, which sets the drivers default.
>> Others have to change the value to nb_txd - curr_value to have the same
>> behaviour
>>
>>> Another way would be to make all TX functions to treat 
>>> tx_conf->tx_free_thresh as fast-path TX functions do
>>> (free TXDs when number of free TXDs drops below  tx_free_thresh) and update 
>>>  rte_ethdev.h comments.
>> And i40e and e1000e code as well. I don't see what difference it makes
>> which way of definition you use, what I care is that it should be used
>> consistently.
>
> Yes, both ways are possible, the concern is - how to minimise the impact for 
> existing apps.
> That's why I am leaning to the fast-path way.

Make sense to favour the fast-path way, I'll look into that and try to 
come up with a patch

>
>>>
>>> Though, I am not sure that it really worth all these changes.
>>>   From one side, whatever tx_free_thresh would be,
>>> the app should still assume that the worst case might happen,
>>> and up to nb_tx_desc mbufs can be consumed by the queue.
>>>   From other side, I think the d

[dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh

2015-06-03 Thread Zoltan Kiss



On 02/06/15 18:35, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Tuesday, June 02, 2015 4:08 PM
>> To: Ananyev, Konstantin; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>
>>
>>
>> On 02/06/15 14:31, Ananyev, Konstantin wrote:
>>> Hi Zoltan,
>>>
>>>> -Original Message-
>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>>>> Sent: Monday, June 01, 2015 5:16 PM
>>>> To: dev at dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>>>
>>>> Hi,
>>>>
>>>> Anyone would like to review this patch? Venky sent a NAK, but I've
>>>> explained to him why it is a bug.
>>>
>>>
>>> Well, I think Venky is right here.
>> I think the comments above rte_eth_tx_burst() definition are quite clear
>> about what tx_free_thresh means, e1000 and i40e use it that way, but not
>> ixgbe.
>>
>>> Indeed that fix, will cause more often unsuccessful checks for DD bits and 
>>> might cause a
>>> slowdown for TX fast-path.
>> Not if the applications set tx_free_thresh according to the definition
>> of this value. But we can change the default value from 32 to something
>> higher, e.g I'm using nb_desc/2, and it works out well.
>
> Sure we can, as I said below, we can unify it one way or another.
> One way would be to make fast-path TX to free TXDs when number of occupied 
> TXDs raises above tx_free_thresh
> (what rte_ethdev.h comments say and what full-featured TX is doing).
> Though in that case we have to change default value for tx_free_thresh, and 
> all existing apps that
> using tx_free_thresh==32 and fast-path TX will probably experience a slowdown.

They are in trouble already, because i40e and e1000 uses it as defined. 
But I guess most apps are going with 0, which sets the drivers default. 
Others have to change the value to nb_txd - curr_value to have the same 
behaviour

> Another way would be to make all TX functions to treat 
> tx_conf->tx_free_thresh as fast-path TX functions do
> (free TXDs when number of free TXDs drops below  tx_free_thresh) and update  
> rte_ethdev.h comments.
And i40e and e1000e code as well. I don't see what difference it makes 
which way of definition you use, what I care is that it should be used 
consistently.
>
> Though, I am not sure that it really worth all these changes.
>  From one side, whatever tx_free_thresh would be,
> the app should still assume that the worst case might happen,
> and up to nb_tx_desc mbufs can be consumed by the queue.
>  From other side, I think the default value should work well for most cases.
> So I am still for graceful deprecation of that config parameter, see below.
>
>>
>>> Anyway, with current PMD implementation, you can't guarantee that at any 
>>> moment
>>> TX queue wouldn't use more than tx_free_thresh mbufs.
>>
>>
>>> There could be situations (low speed, or link is down for some short 
>>> period, etc), when
>>> much more than tx_free_thresh TXDs are in use and none of them could be 
>>> freed by HW right now.
>>> So your app better be prepared, that up to (nb_tx_desc * num_of_TX_queues) 
>>> could be in use
>>> by TX path at any given moment.
>>>
>>> Though yes,  there is an inconsistency how different ixgbe TX functions 
>>> treat tx_conf->tx_free_thresh parameter.
>>> That probably creates wrong expectations and confusion.
>> Yes, ixgbe_xmit_pkts() use it the way it's defined, this two function
>> doesn't.
>>
>>> We might try to unify it's usage one way or another, but I personally don't 
>>> see much point in it.
>>> After all, tx_free_tresh seems like a driver internal choice (based on the 
>>> nb_tx_desc and other parameters).
>>> So I think a better way would be:
>>> 1. Deprecate tx_conf->tx_free_thresh (and remove it in later releases) and 
>>> make
>>> each driver to use what it thinks would be the best value.
>> But how does the driver knows what's the best for the applications
>> traffic pattern? I think it's better to leave the possibility for the
>> app to fine tune it.
>
> My understanding is that for most cases the default value should do pretty 
> well.
> That default value, shouldn't be too small, so we avoid unnecessary & 
> unsuccessful checks,
> and probably shouldn't be too big, to prevent unnecessary mbufs consumption
&g

[dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh

2015-06-02 Thread Zoltan Kiss



On 02/06/15 14:31, Ananyev, Konstantin wrote:
> Hi Zoltan,
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Monday, June 01, 2015 5:16 PM
>> To: dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh
>>
>> Hi,
>>
>> Anyone would like to review this patch? Venky sent a NAK, but I've
>> explained to him why it is a bug.
>
>
> Well, I think Venky is right here.
I think the comments above rte_eth_tx_burst() definition are quite clear 
about what tx_free_thresh means, e1000 and i40e use it that way, but not 
ixgbe.

> Indeed that fix, will cause more often unsuccessful checks for DD bits and 
> might cause a
> slowdown for TX fast-path.
Not if the applications set tx_free_thresh according to the definition 
of this value. But we can change the default value from 32 to something 
higher, e.g I'm using nb_desc/2, and it works out well.

> Anyway, with current PMD implementation, you can't guarantee that at any 
> moment
> TX queue wouldn't use more than tx_free_thresh mbufs.


> There could be situations (low speed, or link is down for some short period, 
> etc), when
> much more than tx_free_thresh TXDs are in use and none of them could be freed 
> by HW right now.
> So your app better be prepared, that up to (nb_tx_desc * num_of_TX_queues) 
> could be in use
> by TX path at any given moment.
>
> Though yes,  there is an inconsistency how different ixgbe TX functions treat 
> tx_conf->tx_free_thresh parameter.
> That probably creates wrong expectations and confusion.
Yes, ixgbe_xmit_pkts() use it the way it's defined, this two function 
doesn't.

> We might try to unify it's usage one way or another, but I personally don't 
> see much point in it.
> After all, tx_free_tresh seems like a driver internal choice (based on the 
> nb_tx_desc and other parameters).
> So I think a better way would be:
> 1. Deprecate tx_conf->tx_free_thresh (and remove it in later releases) and 
> make
> each driver to use what it thinks would be the best value.
But how does the driver knows what's the best for the applications 
traffic pattern? I think it's better to leave the possibility for the 
app to fine tune it.
In the meantime we can improve the default selection as well, as I 
suggested above.

> 2. As you suggested in another mail, introduce an new function:
> uint16_t rte_eth_tx_free_pkts(port_id, queue_id, nb_to_free).
> That would give upper layer a better control of memory usage, and might be 
> called by the upper layer at idle time,
> so further tx_burst, don't need to spend time on freeing TXDs/packets.
I agree.

>
> Konstantin
>
>
>>
>> Regards,
>>
>> Zoltan
>>
>> On 27/05/15 21:12, Zoltan Kiss wrote:
>>> This check doesn't do what's required by rte_eth_tx_burst:
>>> "When the number of previously sent packets reached the "minimum transmit
>>> packets to free" threshold"
>>>
>>> This can cause problems when txq->tx_free_thresh + [number of elements in 
>>> the
>>> pool] < txq->nb_tx_desc.
>>>
>>> Signed-off-by: Zoltan Kiss 
>>> ---
>>>drivers/net/ixgbe/ixgbe_rxtx.c | 4 ++--
>>>drivers/net/ixgbe/ixgbe_rxtx_vec.c | 2 +-
>>>2 files changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> index 4f9ab22..b70ed8c 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> @@ -250,10 +250,10 @@ tx_xmit_pkts(void *tx_queue, struct rte_mbuf 
>>> **tx_pkts,
>>>
>>> /*
>>>  * Begin scanning the H/W ring for done descriptors when the
>>> -* number of available descriptors drops below tx_free_thresh.  For
>>> +* number of in flight descriptors reaches tx_free_thresh. For
>>>  * each done descriptor, free the associated buffer.
>>>  */
>>> -   if (txq->nb_tx_free < txq->tx_free_thresh)
>>> +   if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
>>> ixgbe_tx_free_bufs(txq);
>>>
>>> /* Only use descriptors that are available */
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
>>> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> index abd10f6..f91c698 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> @@ -598,7 +598,7 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf 
>>> **tx_pkts,
>>> if (unlikely(nb_pkts > RTE_IXGBE_VPMD_TX_BURST))
>>> nb_pkts = RTE_IXGBE_VPMD_TX_BURST;
>>>
>>> -   if (txq->nb_tx_free < txq->tx_free_thresh)
>>> +   if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
>>> ixgbe_tx_free_bufs(txq);
>>>
>>> nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
>>>

[dpdk-dev] Free up completed TX buffers

2015-06-01 Thread Zoltan Kiss

On 01/06/15 09:50, Andriy Berestovskyy wrote:
> Hi Zoltan,
>
> On Fri, May 29, 2015 at 7:00 PM, Zoltan Kiss  
> wrote:
>> The easy way is just to increase your buffer pool's size to make
>> sure that doesn't happen.
>
> Go for it!

I went for it, my question is whether is it a good and worthwhile idea 
to give the applications a last resort option for rainy days? It's a 
problem which probably won't occur very often, but when it does, I think 
it can take painfully long until you figure out what's wrong.
>
>>   But there is no bulletproof way to calculate such
>> a number
>
> Yeah, there are many places for mbufs to stay :( I would try:
>
> Mempool size = sum(numbers of all TX descriptors)
>  + sum(rx_free_thresh)
>  + (mempool cache size * (number of lcores - 1))
>  + (burst size * number of lcores)

It heavily depends on what your application does, and I think it's easy 
to make a mistake in these calculations.

>
>> I'm thinking about a foolproof way, which is exposing functions like
>> ixgbe_tx_free_bufs from the PMDs, so the application can call it as a last
>> resort to avoid deadlock.
>
> Have a look at rte_eth_dev_tx_queue_stop()/start(). Some NICs (i.e.
> ixgbe) do reset the queue and free all the mbufs.

That's a bit drastic, I just want to flush the finished TX buffers, even 
if tx_free_thresh were not reached.
An easy option would be to use rte_eth_tx_burst(..., nb_pkts=0), I'm 
already using this to enforce TX completion if it's really needed. It 
checks for tx_free_thresh, like this:

/* Check if the descriptor ring needs to be cleaned. */
if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
i40e_xmit_cleanup(txq);

My idea is to extend this condition and add " || nb_pkts == 0", so you 
can force cleanup. But there might be others who uses this same way to 
do manual TX completion, and they expect that it only happens when 
tx_free_thresh is reached.

>
> Regards,
> Andriy
>

[dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh

2015-06-01 Thread Zoltan Kiss

Hi,

Anyone would like to review this patch? Venky sent a NAK, but I've 
explained to him why it is a bug.

Regards,

Zoltan

On 27/05/15 21:12, Zoltan Kiss wrote:
> This check doesn't do what's required by rte_eth_tx_burst:
> "When the number of previously sent packets reached the "minimum transmit
> packets to free" threshold"
>
> This can cause problems when txq->tx_free_thresh + [number of elements in the
> pool] < txq->nb_tx_desc.
>
> Signed-off-by: Zoltan Kiss 
> ---
>   drivers/net/ixgbe/ixgbe_rxtx.c | 4 ++--
>   drivers/net/ixgbe/ixgbe_rxtx_vec.c | 2 +-
>   2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 4f9ab22..b70ed8c 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -250,10 +250,10 @@ tx_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>
>   /*
>* Begin scanning the H/W ring for done descriptors when the
> -  * number of available descriptors drops below tx_free_thresh.  For
> +  * number of in flight descriptors reaches tx_free_thresh. For
>* each done descriptor, free the associated buffer.
>*/
> - if (txq->nb_tx_free < txq->tx_free_thresh)
> + if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
>   ixgbe_tx_free_bufs(txq);
>
>   /* Only use descriptors that are available */
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> index abd10f6..f91c698 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> @@ -598,7 +598,7 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf 
> **tx_pkts,
>   if (unlikely(nb_pkts > RTE_IXGBE_VPMD_TX_BURST))
>   nb_pkts = RTE_IXGBE_VPMD_TX_BURST;
>
> - if (txq->nb_tx_free < txq->tx_free_thresh)
> + if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
>   ixgbe_tx_free_bufs(txq);
>
>   nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
>

[dpdk-dev] Free up completed TX buffers

2015-05-29 Thread Zoltan Kiss

Hi,

I've came across an another problem while sorting out the one fixed by 
my patch "ixgbe: fix checking for tx_free_thresh". Even when the 
threshold check is correct it can happen that the application run out of 
free buffers, and the only solution would be to get back the ones from 
the TX rings. But if their number is still less than tx_free_thresh (per 
queue), currently there is no interface to achieve that.
The bad way is to set tx_free_thresh to 1, but it has a very bad 
performance penalty. The easy way is just to increase your buffer pool's 
size to make sure that doesn't happen. But there is no bulletproof way 
to calculate such a number, and based on my experience it's hard to 
debug if it causes problem.
I'm thinking about a foolproof way, which is exposing functions like 
ixgbe_tx_free_bufs from the PMDs, so the application can call it as a 
last resort to avoid deadlock. Instead it causes probably worse 
performance, but at least fools like me will easily see that from e.g. 
oprofile.
How does that sound? Or is there a better way to solve this problem?

Regards,

Zoli

[dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh

2015-05-28 Thread Zoltan Kiss

The requirements for rte_eth_tx_burst(), which calls a driver specific 
function, in case of ixgbe, these two:

"It is the responsibility of the rte_eth_tx_burst() function to 
transparently free the memory buffers of packets previously sent. This 
feature is driven by the *tx_free_thresh* value supplied to the 
rte_eth_dev_configure() function at device configuration time. When the 
number of previously sent packets reached the "minimum transmit packets 
to free" threshold, the rte_eth_tx_burst() function must [attempt to] 
free the *rte_mbuf*  buffers of those packets whose transmission was 
effectively completed."

Also rte_eth_tx_queue_setup() uses the same description for tx_free_thresh:

"The *tx_free_thresh* value indicates the [minimum] number of network 
buffers that must be pending in the transmit ring to trigger their 
[implicit] freeing by the driver transmit function."

And all the other poll mode drivers are using this formula. Plus I've 
described a possible hang situation in the commit message.

On 28/05/15 11:50, Venkatesan, Venky wrote:
> NAK. This causes more (unsuccessful) cleanup attempts on the descriptor ring. 
> What is motivating this change?
>
> Regards,
> Venky
>
>
>> On May 28, 2015, at 1:42 AM, Zoltan Kiss  wrote:
>>
>> This check doesn't do what's required by rte_eth_tx_burst:
>> "When the number of previously sent packets reached the "minimum transmit
>> packets to free" threshold"
>>
>> This can cause problems when txq->tx_free_thresh + [number of elements in the
>> pool] < txq->nb_tx_desc.
>>
>> Signed-off-by: Zoltan Kiss 
>> ---
>> drivers/net/ixgbe/ixgbe_rxtx.c | 4 ++--
>> drivers/net/ixgbe/ixgbe_rxtx_vec.c | 2 +-
>> 2 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index 4f9ab22..b70ed8c 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -250,10 +250,10 @@ tx_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>>
>> /*
>>  * Begin scanning the H/W ring for done descriptors when the
>> - * number of available descriptors drops below tx_free_thresh.  For
>> + * number of in flight descriptors reaches tx_free_thresh. For
>>  * each done descriptor, free the associated buffer.
>>  */
>> -if (txq->nb_tx_free < txq->tx_free_thresh)
>> +if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
>> ixgbe_tx_free_bufs(txq);
>>
>> /* Only use descriptors that are available */
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
>> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>> index abd10f6..f91c698 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>> @@ -598,7 +598,7 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf 
>> **tx_pkts,
>> if (unlikely(nb_pkts > RTE_IXGBE_VPMD_TX_BURST))
>> nb_pkts = RTE_IXGBE_VPMD_TX_BURST;
>>
>> -if (txq->nb_tx_free < txq->tx_free_thresh)
>> +if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
>> ixgbe_tx_free_bufs(txq);
>>
>> nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
>> --
>> 1.9.1
>>

[dpdk-dev] [PATCH] ixgbe: fix checking for tx_free_thresh

2015-05-27 Thread Zoltan Kiss

This check doesn't do what's required by rte_eth_tx_burst:
"When the number of previously sent packets reached the "minimum transmit
packets to free" threshold"

This can cause problems when txq->tx_free_thresh + [number of elements in the
pool] < txq->nb_tx_desc.

Signed-off-by: Zoltan Kiss 
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 4 ++--
 drivers/net/ixgbe/ixgbe_rxtx_vec.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 4f9ab22..b70ed8c 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -250,10 +250,10 @@ tx_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,

/*
 * Begin scanning the H/W ring for done descriptors when the
-* number of available descriptors drops below tx_free_thresh.  For
+* number of in flight descriptors reaches tx_free_thresh. For
 * each done descriptor, free the associated buffer.
 */
-   if (txq->nb_tx_free < txq->tx_free_thresh)
+   if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
ixgbe_tx_free_bufs(txq);

/* Only use descriptors that are available */
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
index abd10f6..f91c698 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
@@ -598,7 +598,7 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf 
**tx_pkts,
if (unlikely(nb_pkts > RTE_IXGBE_VPMD_TX_BURST))
nb_pkts = RTE_IXGBE_VPMD_TX_BURST;

-   if (txq->nb_tx_free < txq->tx_free_thresh)
+   if ((txq->nb_tx_desc - txq->nb_tx_free) > txq->tx_free_thresh)
ixgbe_tx_free_bufs(txq);

nb_commit = nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
-- 
1.9.1

[dpdk-dev] [PATCH] mempool: limit cache_size

2015-05-18 Thread Zoltan Kiss



On 18/05/15 15:13, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Monday, May 18, 2015 2:31 PM
>> To: Ananyev, Konstantin; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH] mempool: limit cache_size
>>
>>
>>
>> On 18/05/15 14:14, Ananyev, Konstantin wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>>>> Sent: Monday, May 18, 2015 1:50 PM
>>>> To: Ananyev, Konstantin; dev at dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH] mempool: limit cache_size
>>>>
>>>>
>>>>
>>>> On 18/05/15 13:41, Ananyev, Konstantin wrote:
>>>>>
>>>>>
>>>>>> -Original Message-
>>>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>>>>>> Sent: Monday, May 18, 2015 1:28 PM
>>>>>> To: dev at dpdk.org
>>>>>> Subject: Re: [dpdk-dev] [PATCH] mempool: limit cache_size
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Any opinion on this patch?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Zoltan
>>>>>>
>>>>>> On 13/05/15 19:59, Zoltan Kiss wrote:
>>>>>>> Otherwise cache_flushthresh can be bigger than n, and
>>>>>>> a consumer can starve others by keeping every element
>>>>>>> either in use or in the cache.
>>>>>>>
>>>>>>> Signed-off-by: Zoltan Kiss 
>>>>>>> ---
>>>>>>>  lib/librte_mempool/rte_mempool.c | 3 ++-
>>>>>>>  lib/librte_mempool/rte_mempool.h | 2 +-
>>>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/lib/librte_mempool/rte_mempool.c 
>>>>>>> b/lib/librte_mempool/rte_mempool.c
>>>>>>> index cf7ed76..ca6cd9c 100644
>>>>>>> --- a/lib/librte_mempool/rte_mempool.c
>>>>>>> +++ b/lib/librte_mempool/rte_mempool.c
>>>>>>> @@ -440,7 +440,8 @@ rte_mempool_xmem_create(const char *name, unsigned 
>>>>>>> n, unsigned elt_size,
>>>>>>> mempool_list = RTE_TAILQ_CAST(rte_mempool_tailq.head, 
>>>>>>> rte_mempool_list);
>>>>>>>
>>>>>>> /* asked cache too big */
>>>>>>> -   if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE) {
>>>>>>> +   if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
>>>>>>> +   (uint32_t) cache_size * CACHE_FLUSHTHRESH_MULTIPLIER > n) {
>>>>>>> rte_errno = EINVAL;
>>>>>>> return NULL;
>>>>>>> }
>>>>>
>>>>> Why just no 'cache_size > n' then?
>>>>
>>>> The commit message says: "Otherwise cache_flushthresh can be bigger than
>>>> n, and a consumer can starve others by keeping every element either in
>>>> use or in the cache."
>>>
>>> Ah yes, you right - your condition is more restrictive, which is better.
>>> Though here you implicitly convert cache_size and n to floats and compare 2 
>>> floats :
>>> (uint32_t) cache_size * CACHE_FLUSHTHRESH_MULTIPLIER > n)
>>> Shouldn't it be:
>>> (uint32_t)(cache_size * CACHE_FLUSHTHRESH_MULTIPLIER) > n)
>>> So we do conversion back to uint32_t compare to unsigned integers instead?
>>> Same as below:
>>> mp->cache_flushthresh = (uint32_t)
>>>   (cache_size * CACHE_FLUSHTHRESH_MULTIPLIER);
>>
>> To bring it further: how about ditching the whole cache_flushthresh
>> member of the mempool structure, and use this:
>>
>> #define CACHE_FLUSHTHRESH(mp) (uint32_t)((mp)->cache_size * 1.5)
>
> That's quite expensive and I think would slow down mempool_put() quite a lot .
> So I'd suggest we keep cache_flushthresh as it is.
Ok, I have posted a v2 based on your suggestion.
>
>>
>> Furthermore, do we want to expose the flush threshold multiplier through
>> the config file?
>
> Hmm, my opinion is no - so far no one ask for that,
> and as general tendency - we trying to reduce number of options in config 
> file.
> Do you have a

[dpdk-dev] [PATCH v2] mempool: limit cache_size

2015-05-18 Thread Zoltan Kiss

Otherwise cache_flushthresh can be bigger than n, and
a consumer can starve others by keeping every element
either in use or in the cache.

Signed-off-by: Zoltan Kiss 
---
v2: use macro for calculation, with proper casting

 lib/librte_mempool/rte_mempool.c | 8 +---
 lib/librte_mempool/rte_mempool.h | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index cf7ed76..5cfb96b 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -68,6 +68,8 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 EAL_REGISTER_TAILQ(rte_mempool_tailq)

 #define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
+#define CALC_CACHE_FLUSHTHRESH(c)  \
+   ((typeof (c))((c) *  CACHE_FLUSHTHRESH_MULTIPLIER))

 /*
  * return the greatest common divisor between a and b (fast algorithm)
@@ -440,7 +442,8 @@ rte_mempool_xmem_create(const char *name, unsigned n, 
unsigned elt_size,
mempool_list = RTE_TAILQ_CAST(rte_mempool_tailq.head, rte_mempool_list);

/* asked cache too big */
-   if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE) {
+   if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
+   CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
rte_errno = EINVAL;
return NULL;
}
@@ -565,8 +568,7 @@ rte_mempool_xmem_create(const char *name, unsigned n, 
unsigned elt_size,
mp->header_size = objsz.header_size;
mp->trailer_size = objsz.trailer_size;
mp->cache_size = cache_size;
-   mp->cache_flushthresh = (uint32_t)
-   (cache_size * CACHE_FLUSHTHRESH_MULTIPLIER);
+   mp->cache_flushthresh = CALC_CACHE_FLUSHTHRESH(cache_size);
mp->private_data_size = private_data_size;

/* calculate address of the first element for continuous mempool. */
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 9001312..a4a9610 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -468,7 +468,7 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, 
void *);
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
  *   per-lcore object cache. This argument must be lower or equal to
- *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE. It is advised to choose
+ *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose
  *   cache_size to have "n modulo cache_size == 0": if this is
  *   not the case, some elements will always stay in the pool and will
  *   never be used. The access to the per-lcore table is of course
-- 
1.9.1

[dpdk-dev] [PATCH] mempool: limit cache_size

2015-05-18 Thread Zoltan Kiss



On 18/05/15 14:14, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Monday, May 18, 2015 1:50 PM
>> To: Ananyev, Konstantin; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH] mempool: limit cache_size
>>
>>
>>
>> On 18/05/15 13:41, Ananyev, Konstantin wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>>>> Sent: Monday, May 18, 2015 1:28 PM
>>>> To: dev at dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH] mempool: limit cache_size
>>>>
>>>> Hi,
>>>>
>>>> Any opinion on this patch?
>>>>
>>>> Regards,
>>>>
>>>> Zoltan
>>>>
>>>> On 13/05/15 19:59, Zoltan Kiss wrote:
>>>>> Otherwise cache_flushthresh can be bigger than n, and
>>>>> a consumer can starve others by keeping every element
>>>>> either in use or in the cache.
>>>>>
>>>>> Signed-off-by: Zoltan Kiss 
>>>>> ---
>>>>> lib/librte_mempool/rte_mempool.c | 3 ++-
>>>>> lib/librte_mempool/rte_mempool.h | 2 +-
>>>>> 2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/lib/librte_mempool/rte_mempool.c 
>>>>> b/lib/librte_mempool/rte_mempool.c
>>>>> index cf7ed76..ca6cd9c 100644
>>>>> --- a/lib/librte_mempool/rte_mempool.c
>>>>> +++ b/lib/librte_mempool/rte_mempool.c
>>>>> @@ -440,7 +440,8 @@ rte_mempool_xmem_create(const char *name, unsigned n, 
>>>>> unsigned elt_size,
>>>>>   mempool_list = RTE_TAILQ_CAST(rte_mempool_tailq.head, 
>>>>> rte_mempool_list);
>>>>>
>>>>>   /* asked cache too big */
>>>>> - if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE) {
>>>>> + if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
>>>>> + (uint32_t) cache_size * CACHE_FLUSHTHRESH_MULTIPLIER > n) {
>>>>>   rte_errno = EINVAL;
>>>>>   return NULL;
>>>>>   }
>>>
>>> Why just no 'cache_size > n' then?
>>
>> The commit message says: "Otherwise cache_flushthresh can be bigger than
>> n, and a consumer can starve others by keeping every element either in
>> use or in the cache."
>
> Ah yes, you right - your condition is more restrictive, which is better.
> Though here you implicitly convert cache_size and n to floats and compare 2 
> floats :
> (uint32_t) cache_size * CACHE_FLUSHTHRESH_MULTIPLIER > n)
> Shouldn't it be:
> (uint32_t)(cache_size * CACHE_FLUSHTHRESH_MULTIPLIER) > n)
> So we do conversion back to uint32_t compare to unsigned integers instead?
> Same as below:
> mp->cache_flushthresh = (uint32_t)
>  (cache_size * CACHE_FLUSHTHRESH_MULTIPLIER);

To bring it further: how about ditching the whole cache_flushthresh 
member of the mempool structure, and use this:

#define CACHE_FLUSHTHRESH(mp) (uint32_t)((mp)->cache_size * 1.5)

Furthermore, do we want to expose the flush threshold multiplier through 
the config file?

> ?
>
> In fact, as we use it more than once, it probably makes sense to create a 
> macro for it,
> something like:
> #define CALC_CACHE_FLUSHTHRESH(c) ((uint32_t)((c) *  
> CACHE_FLUSHTHRESH_MULTIPLIER)
>
> Or even
>
> #define CALC_CACHE_FLUSHTHRESH(c) ((typeof (c))((c) *  
> CACHE_FLUSHTHRESH_MULTIPLIER)
>
>
> Konstantin
>
>>
>>> Konstantin
>>>
>>>>> diff --git a/lib/librte_mempool/rte_mempool.h 
>>>>> b/lib/librte_mempool/rte_mempool.h
>>>>> index 9001312..a4a9610 100644
>>>>> --- a/lib/librte_mempool/rte_mempool.h
>>>>> +++ b/lib/librte_mempool/rte_mempool.h
>>>>> @@ -468,7 +468,7 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool 
>>>>> *, void *);
>>>>>  *   If cache_size is non-zero, the rte_mempool library will try to
>>>>>  *   limit the accesses to the common lockless pool, by maintaining a
>>>>>  *   per-lcore object cache. This argument must be lower or equal to
>>>>> - *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE. It is advised to choose
>>>>> + *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to 
>>>>> choose
>>>>>  *   cache_size to have "n modulo cache_size == 0": if this is
>>>>>  *   not the case, some elements will always stay in the pool and will
>>>>>  *   never be used. The access to the per-lcore table is of course
>>>>>

[dpdk-dev] [PATCH] mempool: limit cache_size

2015-05-18 Thread Zoltan Kiss



On 18/05/15 13:41, Ananyev, Konstantin wrote:
>
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Monday, May 18, 2015 1:28 PM
>> To: dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH] mempool: limit cache_size
>>
>> Hi,
>>
>> Any opinion on this patch?
>>
>> Regards,
>>
>> Zoltan
>>
>> On 13/05/15 19:59, Zoltan Kiss wrote:
>>> Otherwise cache_flushthresh can be bigger than n, and
>>> a consumer can starve others by keeping every element
>>> either in use or in the cache.
>>>
>>> Signed-off-by: Zoltan Kiss 
>>> ---
>>>lib/librte_mempool/rte_mempool.c | 3 ++-
>>>lib/librte_mempool/rte_mempool.h | 2 +-
>>>2 files changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/lib/librte_mempool/rte_mempool.c 
>>> b/lib/librte_mempool/rte_mempool.c
>>> index cf7ed76..ca6cd9c 100644
>>> --- a/lib/librte_mempool/rte_mempool.c
>>> +++ b/lib/librte_mempool/rte_mempool.c
>>> @@ -440,7 +440,8 @@ rte_mempool_xmem_create(const char *name, unsigned n, 
>>> unsigned elt_size,
>>> mempool_list = RTE_TAILQ_CAST(rte_mempool_tailq.head, rte_mempool_list);
>>>
>>> /* asked cache too big */
>>> -   if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE) {
>>> +   if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
>>> +   (uint32_t) cache_size * CACHE_FLUSHTHRESH_MULTIPLIER > n) {
>>> rte_errno = EINVAL;
>>> return NULL;
>>> }
>
> Why just no 'cache_size > n' then?

The commit message says: "Otherwise cache_flushthresh can be bigger than 
n, and a consumer can starve others by keeping every element either in 
use or in the cache."

> Konstantin
>
>>> diff --git a/lib/librte_mempool/rte_mempool.h 
>>> b/lib/librte_mempool/rte_mempool.h
>>> index 9001312..a4a9610 100644
>>> --- a/lib/librte_mempool/rte_mempool.h
>>> +++ b/lib/librte_mempool/rte_mempool.h
>>> @@ -468,7 +468,7 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, 
>>> void *);
>>> *   If cache_size is non-zero, the rte_mempool library will try to
>>> *   limit the accesses to the common lockless pool, by maintaining a
>>> *   per-lcore object cache. This argument must be lower or equal to
>>> - *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE. It is advised to choose
>>> + *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose
>>> *   cache_size to have "n modulo cache_size == 0": if this is
>>> *   not the case, some elements will always stay in the pool and will
>>> *   never be used. The access to the per-lcore table is of course
>>>

[dpdk-dev] [PATCH] mempool: limit cache_size

2015-05-18 Thread Zoltan Kiss

Hi,

Any opinion on this patch?

Regards,

Zoltan

On 13/05/15 19:59, Zoltan Kiss wrote:
> Otherwise cache_flushthresh can be bigger than n, and
> a consumer can starve others by keeping every element
> either in use or in the cache.
>
> Signed-off-by: Zoltan Kiss 
> ---
>   lib/librte_mempool/rte_mempool.c | 3 ++-
>   lib/librte_mempool/rte_mempool.h | 2 +-
>   2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/lib/librte_mempool/rte_mempool.c 
> b/lib/librte_mempool/rte_mempool.c
> index cf7ed76..ca6cd9c 100644
> --- a/lib/librte_mempool/rte_mempool.c
> +++ b/lib/librte_mempool/rte_mempool.c
> @@ -440,7 +440,8 @@ rte_mempool_xmem_create(const char *name, unsigned n, 
> unsigned elt_size,
>   mempool_list = RTE_TAILQ_CAST(rte_mempool_tailq.head, rte_mempool_list);
>
>   /* asked cache too big */
> - if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE) {
> + if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
> + (uint32_t) cache_size * CACHE_FLUSHTHRESH_MULTIPLIER > n) {
>   rte_errno = EINVAL;
>   return NULL;
>   }
> diff --git a/lib/librte_mempool/rte_mempool.h 
> b/lib/librte_mempool/rte_mempool.h
> index 9001312..a4a9610 100644
> --- a/lib/librte_mempool/rte_mempool.h
> +++ b/lib/librte_mempool/rte_mempool.h
> @@ -468,7 +468,7 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, 
> void *);
>*   If cache_size is non-zero, the rte_mempool library will try to
>*   limit the accesses to the common lockless pool, by maintaining a
>*   per-lcore object cache. This argument must be lower or equal to
> - *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE. It is advised to choose
> + *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose
>*   cache_size to have "n modulo cache_size == 0": if this is
>*   not the case, some elements will always stay in the pool and will
>*   never be used. The access to the per-lcore table is of course
>

[dpdk-dev] [PATCH] mempool: limit cache_size

2015-05-13 Thread Zoltan Kiss

Otherwise cache_flushthresh can be bigger than n, and
a consumer can starve others by keeping every element
either in use or in the cache.

Signed-off-by: Zoltan Kiss 
---
 lib/librte_mempool/rte_mempool.c | 3 ++-
 lib/librte_mempool/rte_mempool.h | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index cf7ed76..ca6cd9c 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -440,7 +440,8 @@ rte_mempool_xmem_create(const char *name, unsigned n, 
unsigned elt_size,
mempool_list = RTE_TAILQ_CAST(rte_mempool_tailq.head, rte_mempool_list);

/* asked cache too big */
-   if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE) {
+   if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
+   (uint32_t) cache_size * CACHE_FLUSHTHRESH_MULTIPLIER > n) {
rte_errno = EINVAL;
return NULL;
}
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 9001312..a4a9610 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -468,7 +468,7 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, 
void *);
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
  *   per-lcore object cache. This argument must be lower or equal to
- *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE. It is advised to choose
+ *   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose
  *   cache_size to have "n modulo cache_size == 0": if this is
  *   not the case, some elements will always stay in the pool and will
  *   never be used. The access to the per-lcore table is of course
-- 
1.9.1

[dpdk-dev] data copy in vhost-user

2015-04-28 Thread Zoltan Kiss



On 28/04/15 02:22, Xie, Huawei wrote:
>
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Tuesday, April 28, 2015 12:27 AM
>> To: Nikita Kalyazin; dev at dpdk.org
>> Subject: Re: [dpdk-dev] data copy in vhost-user
>>
>>
>>
>> On 27/04/15 12:54, Nikita Kalyazin wrote:
>>> Hi,
>>>
>>>
>>> As far as I understand, DPDK vhost-user implementation requires data copy
>> for either RX or TX (rte_vhost_dequeue_burst() and
>> rte_vhost_enqueue_burst()). It means that two data copies are needed to
>> transfer a packet from one VM to another.
>>>
>>> Why is not it possible to eliminate one of the copies (e.g.,
>> rte_vhost_enqueue_burst() might set up a reference at vring descriptor to
>> mbuf's data rather than copying the data)?
> This had been added to the to-do list. We could delay the copy until the real 
> copy is needed.
>>
>> I'm just guessing, but in case of VM-to-VM traffic the receiving one
>> could hold onto the buffer indefinitely, preventing the sender to reuse
>> the buffer. That could lead to a DoS in some cases, and shutting down
>> the sender would be also tricky. At least in case of Xen
>> netback/netfront that's the reason. A reasonable solution for this
>> problem is to make sure the buffer is swapped out with a copy after a
>> finite time.
> Do you mean we associate a timeout for the buffer?
Yes, I think xen-netback had such a version once, but it was removed. As 
far as I know the overhead and complexity of handling these timeouts 
were too severe.
I might be wrong about this, I don't know if this problem applies here 
as well or not.

>>
>> Regards,
>>
>> Zoltan

[dpdk-dev] [PATCH v6 00/13] mbuf: enhancements of mbuf clones

2015-04-28 Thread Zoltan Kiss



On 27/04/15 18:38, Thomas Monjalon wrote:
> 2015-04-24 11:38, Zoltan Kiss:
>> On 22/04/15 12:59, Ananyev, Konstantin wrote:
>>> From: Olivier Matz [mailto:olivier.matz at 6wind.com]
>>> Sent: Wednesday, April 22, 2015 10:57 AM
>>>> The first objective of this series is to fix the support of indirect
>>>> mbufs when the application reserves a private area in mbufs. It also
>>>> removes the limitation that rte_pktmbuf_clone() is only allowed on
>>>> direct (non-cloned) mbufs. The series also contains some enhancements
>>>> and fixes in the mbuf area that makes the implementation of the
>>>> last patches easier.
>>>
>>> Acked-by: Konstantin Ananyev 
>>
>> When does this series get merged?
>
> It was acked on April 22  and your question was 2 days later on April 24.
> Does it mean you are expecting it to be merged the day it is acked?

I was just curious about when to expect it, so I can plan to do some 
further work based on it, but nothing pressing.

Regards,

Zoltan

> Or do you fear the merging because of a local dev you are working on?
> Anyway, everybody seems happy with this version so it's going to be merged.
>

[dpdk-dev] data copy in vhost-user

2015-04-27 Thread Zoltan Kiss

On 27/04/15 12:54, Nikita Kalyazin wrote:
> Hi,
>
>
> As far as I understand, DPDK vhost-user implementation requires data copy for 
> either RX or TX (rte_vhost_dequeue_burst() and rte_vhost_enqueue_burst()). It 
> means that two data copies are needed to transfer a packet from one VM to 
> another.
>
> Why is not it possible to eliminate one of the copies (e.g., 
> rte_vhost_enqueue_burst() might set up a reference at vring descriptor to 
> mbuf's data rather than copying the data)?

I'm just guessing, but in case of VM-to-VM traffic the receiving one 
could hold onto the buffer indefinitely, preventing the sender to reuse 
the buffer. That could lead to a DoS in some cases, and shutting down 
the sender would be also tricky. At least in case of Xen 
netback/netfront that's the reason. A reasonable solution for this 
problem is to make sure the buffer is swapped out with a copy after a 
finite time.

Regards,

Zoltan

[dpdk-dev] [PATCH v6 00/13] mbuf: enhancements of mbuf clones

2015-04-24 Thread Zoltan Kiss

Hi,

On 22/04/15 12:59, Ananyev, Konstantin wrote:
>
>
>> -Original Message-
>> From: Olivier Matz [mailto:olivier.matz at 6wind.com]
>> Sent: Wednesday, April 22, 2015 10:57 AM
>> To: dev at dpdk.org
>> Cc: Ananyev, Konstantin; zoltan.kiss at linaro.org; Richardson, Bruce; 
>> nhorman at tuxdriver.com; olivier.matz at 6wind.com
>> Subject: [PATCH v6 00/13] mbuf: enhancements of mbuf clones
>>
>> The first objective of this series is to fix the support of indirect
>> mbufs when the application reserves a private area in mbufs. It also
>> removes the limitation that rte_pktmbuf_clone() is only allowed on
>> direct (non-cloned) mbufs. The series also contains some enhancements
>> and fixes in the mbuf area that makes the implementation of the
>> last patches easier.
>>
>
> Acked-by: Konstantin Ananyev 

When does this series get merged?

Regards,

Zoltan

[dpdk-dev] mempool deleting and cache_size

2015-04-17 Thread Zoltan Kiss

Hi,

On 15/04/15 20:15, Zoltan Kiss wrote:
> Hi,
>
> I have two questions regarding mempools:
>
> - the first is trivial: how do you delete them? Can you? I can't see a
> function to do that, and none of the examples are doing such thing. When
> exactly it get deleted?
> - during creation, cache_size have one requirement: it has to be smaller
> than RTE_MEMPOOL_CACHE_MAX_SIZE. And one recommendation: "n modulo
> cache_size == 0". Is there any more guideline to determine that number?
> E.g. now I'm using the biggest number which fits the above two conditions.
>
> Regards,
>
> Zoltan


Thanks for all the answers for the first one, but does anyone has an 
idea for the second one?

Zoli

[dpdk-dev] [PATCH v3 1/5] mbuf: fix clone support when application uses private mbuf data

2015-04-02 Thread Zoltan Kiss



On 31/03/15 20:23, Olivier Matz wrote:
> From: Olivier Matz 
>
> Add a new private_size field in mbuf structure that should
> be initialized at mbuf pool creation. This field contains the
> size of the application private data in mbufs.
>
> Introduce new static inline functions rte_mbuf_from_indirect()
> and rte_mbuf_to_baddr() to replace the existing macros, which
> take the private size in account when attaching and detaching
> mbufs.
>
> Signed-off-by: Olivier Matz 
Reviewed-by: Zoltan Kiss 

I assume the rest of the series haven't changed apart from occasional 
rebasing, I've reviewed them earlier.

> ---
>   app/test-pmd/testpmd.c |  1 +
>   examples/vhost/main.c  |  4 +--
>   lib/librte_mbuf/rte_mbuf.c |  1 +
>   lib/librte_mbuf/rte_mbuf.h | 77 
> +++---
>   4 files changed, 63 insertions(+), 20 deletions(-)
>
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index 3057791..c5a195a 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -425,6 +425,7 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
>   mb->tx_offload   = 0;
>   mb->vlan_tci = 0;
>   mb->hash.rss = 0;
> + mb->priv_size= 0;
>   }
>
>   static void
> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index c3fcb80..e44e82f 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -139,7 +139,7 @@
>   /* Number of descriptors per cacheline. */
>   #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
>
> -#define MBUF_EXT_MEM(mb)   (RTE_MBUF_FROM_BADDR((mb)->buf_addr) != (mb))
> +#define MBUF_EXT_MEM(mb)   (rte_mbuf_from_indirect(mb) != (mb))
>
>   /* mask of enabled ports */
>   static uint32_t enabled_port_mask = 0;
> @@ -1550,7 +1550,7 @@ attach_rxmbuf_zcp(struct virtio_net *dev)
>   static inline void pktmbuf_detach_zcp(struct rte_mbuf *m)
>   {
>   const struct rte_mempool *mp = m->pool;
> - void *buf = RTE_MBUF_TO_BADDR(m);
> + void *buf = rte_mbuf_to_baddr(m);
>   uint32_t buf_ofs;
>   uint32_t buf_len = mp->elt_size - sizeof(*m);
>   m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof(*m);
> diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
> index 526b18d..e095999 100644
> --- a/lib/librte_mbuf/rte_mbuf.c
> +++ b/lib/librte_mbuf/rte_mbuf.c
> @@ -125,6 +125,7 @@ rte_pktmbuf_init(struct rte_mempool *mp,
>   m->pool = mp;
>   m->nb_segs = 1;
>   m->port = 0xff;
> + m->priv_size = 0;
>   }
>
>   /* do some sanity checks on a mbuf: panic if it fails */
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 17ba791..932fe58 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -317,18 +317,51 @@ struct rte_mbuf {
>   /* uint64_t unused:8; */
>   };
>   };
> +
> + /** Size of the application private data. In case of an indirect
> +  * mbuf, it stores the direct mbuf private data size. */
> + uint16_t priv_size;
>   } __rte_cache_aligned;
>
>   /**
> - * Given the buf_addr returns the pointer to corresponding mbuf.
> + * Return the mbuf owning the data buffer address of an indirect mbuf.
> + *
> + * @param mi
> + *   The pointer to the indirect mbuf.
> + * @return
> + *   The address of the direct mbuf corresponding to buffer_addr.
>*/
> -#define RTE_MBUF_FROM_BADDR(ba) (((struct rte_mbuf *)(ba)) - 1)
> +static inline struct rte_mbuf *
> +rte_mbuf_from_indirect(struct rte_mbuf *mi)
> +{
> +   struct rte_mbuf *md;
> +
> +   /* mi->buf_addr and mi->priv_size correspond to buffer and
> + * private size of the direct mbuf */
> +   md = (struct rte_mbuf *)((char *)mi->buf_addr - sizeof(*mi) -
> +mi->priv_size);
> +   return md;
> +}
>
>   /**
> - * Given the pointer to mbuf returns an address where it's  buf_addr
> - * should point to.
> + * Return the buffer address embedded in the given mbuf.
> + *
> + * The user must ensure that m->priv_size corresponds to the
> + * private size of this mbuf, which is not the case for indirect
> + * mbufs.
> + *
> + * @param md
> + *   The pointer to the mbuf.
> + * @return
> + *   The address of the data buffer owned by the mbuf.
>*/
> -#define RTE_MBUF_TO_BADDR(mb)   (((struct rte_mbuf *)(mb)) + 1)
> +static inline char *
> +rte_mbuf_to_baddr(struct rte_mbuf *md)
> +{
> +   char *buffer_addr;
> +   buffer_addr = (char *)md + sizeof(*md) + md->priv_size;
> +   return buffer_addr;
> +}
>
>   /*

[dpdk-dev] [PATCH] mbuf: add comment explaining confusing code

2015-03-31 Thread Zoltan Kiss

Hi,

On 30/03/15 18:39, Don Provan wrote:
>   if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>   likely (rte_mbuf_refcnt_update(m, -1) == 0))
>
> In all the debate about atomics, I don't think anyone got around to pointing 
> out that in the unlikely case that the refcnt is not 1, then it's equally 
> unlikely that decrementing it will result in 0 despite the code's claim to 
> the contrary. That's the part that confused me. Would it make sense to fix 
> this while adding the comment?
> -don
> dprovan at bivio.net
>

I was thinking about that too, either remove it or turn it into 
"unlikely". Currently it suggest that "if there are more than one users, 
they are likely to release at the same time". If that's not true, we 
should remove it, but as Don said, it would hardly make a difference in 
real world cases as more than one users is not really a hot usecase, AFAIK.

Regards,

Zoltan

[dpdk-dev] [PATCH v2 1/5] mbuf: fix clone support when application uses private mbuf data

2015-03-27 Thread Zoltan Kiss



On 27/03/15 15:17, Olivier MATZ wrote:
> Hi Konstantin,
>
> On 03/27/2015 03:25 PM, Ananyev, Konstantin wrote:
>> Hi Olivier,
>>
>>> -Original Message-
>>> From: Olivier MATZ [mailto:olivier.matz at 6wind.com]
>>> Sent: Friday, March 27, 2015 1:56 PM
>>> To: Ananyev, Konstantin; dev at dpdk.org
>>> Subject: Re: [dpdk-dev] [PATCH v2 1/5] mbuf: fix clone support when 
>>> application uses private mbuf data
>>>
>>> Hi Konstantin,
>>>
>>> On 03/27/2015 10:07 AM, Olivier MATZ wrote:
> I think that to support ability to setup priv_size on a mempool basis,
> and reserve private space between struct rte_mbuf and rte_mbuf. buf_addr,
> we need to:
>
> 1. Store priv_size both inside the mempool and inside the mbuf.
>
> 2. rte_pktmbuf_attach() should change the value of priv_size to the 
> priv_size of the direct mbuf we are going to attach to:
> rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md) {... 
> mi->priv_size = md->priv_size; ...}
>
> 3. rte_pktmbuf_detach() should restore original value of mbuf's priv_size:
> rte_pktmbuf_detach(struct rte_mbuf *m)
> {
>...
> m->priv_size = rte_mempool_get_privsize(m->pool);
> m->buf _addr= rte_mbuf_to_baddr(m);
> ...
> }
>
> Also I think we need to provide a way to specify priv_size for all mbufs 
> of the mepool at init time:
> - either force people to specify it at rte_mempool_create() time 
> (probably use init_arg for that),
> - or provide separate function that could be called straight after 
> rte_mempool_create() , that
> would setup priv_size for the  pool and for all its mbufs.
> - or some sort of combination of these 2 approaches - introduce a wrapper 
> function
> (rte_mbuf_pool_create() or something) that would take priv_size as 
> parameter,
> create a new mempool and then setup priv_size.
>>>
>>> I though a bit more about this solution, and I realized that doing
>>> mi->priv_size = md->priv_size in rte_pktmbuf_attach() is probably not
>>> a good idea, as there is no garantee that the size of mi is large enough
>>> to store the priv of md.
>>>
>>> Having the same priv_size for mi and md is maybe a good constraint.
>>> I can add this in the API comments and assertions in the code to
>>> check this condition, what do you think?
>>
>> Probably we have a different concepts of what is mbuf's  private space in 
>> mind.
>>  From my point, even indirect buffer should use it's own private space and
>> leave contents of direct mbuf it attached to unmodified.
>> After attach() operation, only contents of the buffer are shared between 
>> mbufs,
>> but not the mbuf's metadata.
>
> Sorry if it was not clear in my previous messages, but I agree
> with your description. When attaching a mbuf, only data, not
> metadata should be shared.
>
> In the solution you are suggesting (quoted above), you say we need
> to set mi->priv_size to md->priv_size in rte_pktmbuf_attach(). I felt
> this was not possible, but it depends on the meaning we give to
> priv_size:
>
> 1. If the meaning is "the size of the private data embedded in this
> mbuf", which is the most logical meaning, we cannot do this
> affectation
>
> 2. If the meaning is "the size of the private data embedded in the
> mbuf the buf_addr is pointing to" (which is harder to get), the
> affectation makes sense.
>
>  From what I understand, you feel we should use 2/ as priv_size
> definition. Is it correct?
>
> In my previous message, the definition of m->priv_size was 1/,
> so that's why I felt assigning mi->priv_size to md->priv_size was
> not possible.
>
> I agree 2/ is probably a good choice, as it would allow to attach
> to a mbuf with a different priv_size. It may require some additional
> comments above the field in the structure to explain that.
I think we need to document it in the comments very well that you can 
attach mbuf's to each other with different private area sizes, and 
applications should care about this difference. And we should give a 
macro to get the private area size, which will get rte_mbuf.mp->priv_size.
Actually we should give some better name to rte_mbuf.priv_size, it's a 
bit misleading now. Maybe direct_priv_size?
>
>
>> Otherwise on detach(), you'll have to copy contents of private space back, 
>> from direct to indirect mbuf?
>> Again how to deal with the case, when 2 or more mbufs will attach to the 
>> same direct one?
>>
>> So let say, if we'll have a macro:
>>
>> #define RTE_MBUF_PRIV_PTR(mb)((void *)((struct rte_mbuf *)(mb)) + 1))
>>
>> No matter is mb  a direct or indirect mbuf.
>> Do you have something else in mind here?
>
> I completely agree with this macro. We should consider the private data
> as an extension of the mbuf structure.
>
>
 Introducing rte_mbuf_pool_create() seems a good idea to me, it
 would hide 'rte_pktmbuf_pool_private' from the user and force
 to initialize all the

[dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free

2015-03-26 Thread Zoltan Kiss

The current way is not the most efficient: if m->refcnt is 1, the second
condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
condition fails again, although the code suggest otherwise to branch
prediction. Instead we should keep the second condition only, and remove the
duplicate set to zero.

Signed-off-by: Zoltan Kiss 
---
 lib/librte_mbuf/rte_mbuf.h | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 17ba791..3ec4024 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 {
__rte_mbuf_sanity_check(m, 0);

-   if (likely (rte_mbuf_refcnt_read(m) == 1) ||
-   likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
-
-   rte_mbuf_refcnt_set(m, 0);
+   if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {

/* if this is an indirect mbuf, then
 *  - detach mbuf
-- 
1.9.1

[dpdk-dev] [PATCH] examples/vhost: use library routines instead of local copies

2015-03-26 Thread Zoltan Kiss



On 26/03/15 17:34, Ananyev, Konstantin wrote:
>
>
>> -Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Thursday, March 26, 2015 4:46 PM
>> To: Ananyev, Konstantin; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH] examples/vhost: use library routines instead 
>> of local copies
>>
>>
>>
>> On 26/03/15 01:20, Ananyev, Konstantin wrote:
>>>
>>>
>>>> -----Original Message-
>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>>>> Sent: Wednesday, March 25, 2015 6:43 PM
>>>> To: dev at dpdk.org
>>>> Cc: Zoltan Kiss
>>>> Subject: [dpdk-dev] [PATCH] examples/vhost: use library routines instead 
>>>> of local copies
>>>>
>>>> This macro and function were copies from the mbuf library, no reason to 
>>>> keep
>>>> them.
>>>
>>> NACK
>>> You can't use RTE_MBUF_INDIRECT macro here.
>>> If you'll look at vhost code carefully, you'll realise that we don't use 
>>> standard rte_pktmbuf_attach() here.
>>> As we attach mbuf not to another mbuf but to external memory buffer, passed 
>>> to us by virtio device.
>>> Look at attach_rxmbuf_zcp().
>> Yes, I think the proper fix is to set the flag in attach_rxmbuf_zcp()
>> and virtio_tx_route_zcp(), then you can use the library macro here.
>
> No, it is not.
> IND_ATTACHED_MBUF flag indicates that that mbuf attached to another mbuf and 
> __rte_pktmbuf_prefree_seg()
> would try to do mbuf detach.
> We definetly don't want to set IND_ATTACHED_MBUF here.
I see. Quite confusing how vhost reuse some library code to do something 
slightly different.

> I think there is no need to fix anything here.
>
> Konstantin
>
>>
>>> Though I suppose, we can replace pktmbuf_detach_zcp() , with  
>>> rte_pktmbuf_detach() - they are doing identical things.
>> Yes, the only difference is that the latter do "m->ol_flags = 0" as well.
>>
>>> BTW, I wonder did you ever  test your patch?
>> Indeed I did not, shame on me. I don't have a KVM setup at hand. This
>> fix were born as a side effect of the cleaning up in the library,
>> and I'm afraid I don't have the time right now to create a KVM setup.
>> Could anyone who has it at hand help out to run a quick test? (for the
>> v2 of this patch, which I'll send in shortly)
>
>
>>
>> Regards,
>>
>> Zoltan
>>
>>> My guess it would cause vhost with '--zero-copy' to crash or  corrupt the 
>>> packets straightway.
>>>
>>> Konstantin
>>>
>>>>
>>>> Signed-off-by: Zoltan Kiss 
>>>> ---
>>>>examples/vhost/main.c | 38 +-
>>>>1 file changed, 5 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
>>>> index c3fcb80..1c998a5 100644
>>>> --- a/examples/vhost/main.c
>>>> +++ b/examples/vhost/main.c
>>>> @@ -139,8 +139,6 @@
>>>>/* Number of descriptors per cacheline. */
>>>>#define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct 
>>>> vring_desc))
>>>>
>>>> -#define MBUF_EXT_MEM(mb)   (RTE_MBUF_FROM_BADDR((mb)->buf_addr) != (mb))
>>>> -
>>>>/* mask of enabled ports */
>>>>static uint32_t enabled_port_mask = 0;
>>>>
>>>> @@ -1538,32 +1536,6 @@ attach_rxmbuf_zcp(struct virtio_net *dev)
>>>>return;
>>>>}
>>>>
>>>> -/*
>>>> - * Detach an attched packet mbuf -
>>>> - *  - restore original mbuf address and length values.
>>>> - *  - reset pktmbuf data and data_len to their default values.
>>>> - *  All other fields of the given packet mbuf will be left intact.
>>>> - *
>>>> - * @param m
>>>> - *   The attached packet mbuf.
>>>> - */
>>>> -static inline void pktmbuf_detach_zcp(struct rte_mbuf *m)
>>>> -{
>>>> -  const struct rte_mempool *mp = m->pool;
>>>> -  void *buf = RTE_MBUF_TO_BADDR(m);
>>>> -  uint32_t buf_ofs;
>>>> -  uint32_t buf_len = mp->elt_size - sizeof(*m);
>>>> -  m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof(*m);
>>>> -
>>>> -  m->buf_addr = buf;
>>>> -  m->buf_len = (uint16_t)buf_len;
>>>> -
>>>> -  b

[dpdk-dev] [PATCH v2 1/5] mbuf: fix clone support when application uses private mbuf data

2015-03-26 Thread Zoltan Kiss



On 26/03/15 15:59, Olivier Matz wrote:
> Add a new private_size field in mbuf structure that should
> be initialized at mbuf pool creation. This field contains the
> size of the application private data in mbufs.
>
> Introduce new static inline functions rte_mbuf_from_indirect()
> and rte_mbuf_to_baddr() to replace the existing macros, which
> take the private size in account when attaching and detaching
> mbufs.
>
> Signed-off-by: Olivier Matz 
> ---
>   app/test-pmd/testpmd.c |  1 +
>   examples/vhost/main.c  |  2 +-
>   lib/librte_mbuf/rte_mbuf.c |  1 +
>   lib/librte_mbuf/rte_mbuf.h | 44 ++--
>   4 files changed, 37 insertions(+), 11 deletions(-)
>
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index 3057791..c5a195a 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -425,6 +425,7 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
>   mb->tx_offload   = 0;
>   mb->vlan_tci = 0;
>   mb->hash.rss = 0;
> + mb->priv_size= 0;
>   }
>
>   static void
> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index c3fcb80..d542461 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -139,7 +139,7 @@
>   /* Number of descriptors per cacheline. */
>   #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
>
> -#define MBUF_EXT_MEM(mb)   (RTE_MBUF_FROM_BADDR((mb)->buf_addr) != (mb))
> +#define MBUF_EXT_MEM(mb)   (rte_mbuf_from_indirect(mb) != (mb))
>
>   /* mask of enabled ports */
>   static uint32_t enabled_port_mask = 0;
You would need to fix pktmbuf_detach_zcp as well, but see my reply to 
Konstantin.

> diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
> index 526b18d..e095999 100644
> --- a/lib/librte_mbuf/rte_mbuf.c
> +++ b/lib/librte_mbuf/rte_mbuf.c
> @@ -125,6 +125,7 @@ rte_pktmbuf_init(struct rte_mempool *mp,
>   m->pool = mp;
>   m->nb_segs = 1;
>   m->port = 0xff;
> + m->priv_size = 0;
>   }
>
>   /* do some sanity checks on a mbuf: panic if it fails */
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 17ba791..45ac948 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -317,18 +317,42 @@ struct rte_mbuf {
>   /* uint64_t unused:8; */
>   };
>   };
> +
> + uint16_t priv_size;   /**< size of the application private data */
>   } __rte_cache_aligned;
>
>   /**
> - * Given the buf_addr returns the pointer to corresponding mbuf.
> + * Return the mbuf owning the data buffer address of an indirect mbuf.
> + *
> + * @param mi
> + *   The pointer to the indirect mbuf.
> + * @return
> + *   The address of the direct mbuf corresponding to buffer_addr.
>*/
> -#define RTE_MBUF_FROM_BADDR(ba) (((struct rte_mbuf *)(ba)) - 1)
> +static inline struct rte_mbuf *
> +rte_mbuf_from_indirect(struct rte_mbuf *mi)
> +{
> +   struct rte_mbuf *md;
> +   md = (struct rte_mbuf *)((char *)mi->buf_addr - sizeof(*mi) -
> +mi->priv_size);
> +   return md;
> +}
>
>   /**
> - * Given the pointer to mbuf returns an address where it's  buf_addr
> - * should point to.
> + * Return the buffer address embedded in the given mbuf.
> + *
> + * @param md
> + *   The pointer to the mbuf.
> + * @return
> + *   The address of the data buffer owned by the mbuf.
>*/
> -#define RTE_MBUF_TO_BADDR(mb)   (((struct rte_mbuf *)(mb)) + 1)
> +static inline char *
> +rte_mbuf_to_baddr(struct rte_mbuf *md)
> +{
> +   char *buffer_addr;
> +   buffer_addr = (char *)md + sizeof(*md) + md->priv_size;
> +   return buffer_addr;
> +}
>
>   /**
>* Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> @@ -744,12 +768,12 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf 
> *mi, struct rte_mbuf *md)
>   static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
>   {
>   const struct rte_mempool *mp = m->pool;
> - void *buf = RTE_MBUF_TO_BADDR(m);
> - uint32_t buf_len = mp->elt_size - sizeof(*m);
> - m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof (*m);
> + void *buf = rte_mbuf_to_baddr(m);
> + unsigned mhdr_size = (char *)buf - (char *)m;
Minor nit: I think "sizeof(*m) + m->priv_size" would be much more clear, 
like you did above. In fact, this might worth its own inline function, 
and then you can substitute mhdr_size below.

>
> + m->buf_physaddr = rte_mempool_virt2phy(mp, m) + mhdr_size;
>   m->buf_addr = buf;
> - m->buf_len = (uint16_t)buf_len;
> + m->buf_len = (uint16_t)(mp->elt_size - mhdr_size);
>
>   m->data_off = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
>   RTE_PKTMBUF_HEADROOM : m->buf_len;
> @@ -774,7 +798,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
>*  - free attached mbuf segment
>*/
>   if (RTE_MBUF_INDIRECT(m)) {
> - struct rte_mbuf *md =

[dpdk-dev] [PATCH] examples/vhost: use library routines instead of local copies

2015-03-26 Thread Zoltan Kiss



On 26/03/15 01:20, Ananyev, Konstantin wrote:
>
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Wednesday, March 25, 2015 6:43 PM
>> To: dev at dpdk.org
>> Cc: Zoltan Kiss
>> Subject: [dpdk-dev] [PATCH] examples/vhost: use library routines instead of 
>> local copies
>>
>> This macro and function were copies from the mbuf library, no reason to keep
>> them.
>
> NACK
> You can't use RTE_MBUF_INDIRECT macro here.
> If you'll look at vhost code carefully, you'll realise that we don't use 
> standard rte_pktmbuf_attach() here.
> As we attach mbuf not to another mbuf but to external memory buffer, passed 
> to us by virtio device.
> Look at attach_rxmbuf_zcp().
Yes, I think the proper fix is to set the flag in attach_rxmbuf_zcp() 
and virtio_tx_route_zcp(), then you can use the library macro here.

> Though I suppose, we can replace pktmbuf_detach_zcp() , with  
> rte_pktmbuf_detach() - they are doing identical things.
Yes, the only difference is that the latter do "m->ol_flags = 0" as well.

> BTW, I wonder did you ever  test your patch?
Indeed I did not, shame on me. I don't have a KVM setup at hand. This 
fix were born as a side effect of the cleaning up in the library,
and I'm afraid I don't have the time right now to create a KVM setup. 
Could anyone who has it at hand help out to run a quick test? (for the 
v2 of this patch, which I'll send in shortly)

Regards,

Zoltan

> My guess it would cause vhost with '--zero-copy' to crash or  corrupt the 
> packets straightway.
>
> Konstantin
>
>>
>> Signed-off-by: Zoltan Kiss 
>> ---
>>   examples/vhost/main.c | 38 +-
>>   1 file changed, 5 insertions(+), 33 deletions(-)
>>
>> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
>> index c3fcb80..1c998a5 100644
>> --- a/examples/vhost/main.c
>> +++ b/examples/vhost/main.c
>> @@ -139,8 +139,6 @@
>>   /* Number of descriptors per cacheline. */
>>   #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct 
>> vring_desc))
>>
>> -#define MBUF_EXT_MEM(mb)   (RTE_MBUF_FROM_BADDR((mb)->buf_addr) != (mb))
>> -
>>   /* mask of enabled ports */
>>   static uint32_t enabled_port_mask = 0;
>>
>> @@ -1538,32 +1536,6 @@ attach_rxmbuf_zcp(struct virtio_net *dev)
>>  return;
>>   }
>>
>> -/*
>> - * Detach an attched packet mbuf -
>> - *  - restore original mbuf address and length values.
>> - *  - reset pktmbuf data and data_len to their default values.
>> - *  All other fields of the given packet mbuf will be left intact.
>> - *
>> - * @param m
>> - *   The attached packet mbuf.
>> - */
>> -static inline void pktmbuf_detach_zcp(struct rte_mbuf *m)
>> -{
>> -const struct rte_mempool *mp = m->pool;
>> -void *buf = RTE_MBUF_TO_BADDR(m);
>> -uint32_t buf_ofs;
>> -uint32_t buf_len = mp->elt_size - sizeof(*m);
>> -m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof(*m);
>> -
>> -m->buf_addr = buf;
>> -m->buf_len = (uint16_t)buf_len;
>> -
>> -buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
>> -RTE_PKTMBUF_HEADROOM : m->buf_len;
>> -m->data_off = buf_ofs;
>> -
>> -m->data_len = 0;
>> -}
>>
>>   /*
>>* This function is called after packets have been transimited. It fetchs 
>> mbuf
>> @@ -1590,8 +1562,8 @@ txmbuf_clean_zcp(struct virtio_net *dev, struct vpool 
>> *vpool)
>>
>>  for (index = 0; index < mbuf_count; index++) {
>>  mbuf = __rte_mbuf_raw_alloc(vpool->pool);
>> -if (likely(MBUF_EXT_MEM(mbuf)))
>> -pktmbuf_detach_zcp(mbuf);
>> +if (likely(RTE_MBUF_INDIRECT(mbuf)))
>> +rte_pktmbuf_detach(mbuf);
>>  rte_ring_sp_enqueue(vpool->ring, mbuf);
>>
>>  /* Update used index buffer information. */
>> @@ -1653,8 +1625,8 @@ static void mbuf_destroy_zcp(struct vpool *vpool)
>>  for (index = 0; index < mbuf_count; index++) {
>>  mbuf = __rte_mbuf_raw_alloc(vpool->pool);
>>  if (likely(mbuf != NULL)) {
>> -if (likely(MBUF_EXT_MEM(mbuf)))
>> -pktmbuf_detach_zcp(mbuf);
>> +if (likely(RTE_MBUF_INDIRECT(mbuf)))
>> +rte_pktmbuf_detach(mbuf);
>>  rte_ring_sp_enqueue(vpool->ring, (void *)mbuf);
>>

[dpdk-dev] ovs-dpdk: ofpbuf reinitialization

2015-03-25 Thread Zoltan Kiss

Hi,

Looking around in the DPDK code I've found that it only initializes the 
packet metadata (whih contains the struct ofpbuf belonging to the 
packet) during setup, as the packet initializer of rte_mempool_create.
That means that every time a packet buffer is released back by OVS to 
the buffer pool, it retains ofpbuf state, and it doesn't change when the 
poll mode driver use the buffer again to store a new packet. "source" 
and "allocated members of ofpbuf shouldn't change, but frame, 
l2_pad_size and the offsets does at various places. Even though I 
couldn't establish an error scenario yet, I think it's quite dangerous 
to leave the packet to inherit the previous packet's ofpbuf.
Or am I missing some place where this piece is reinitialized?

Regards,

Zoltan Kiss

[dpdk-dev] ovs-dpdk: placing the metadata

2015-03-25 Thread Zoltan Kiss

Hi Olivier,

On 25/03/15 17:04, Olivier MATZ wrote:
> Hi Zoltan,
>
> On 03/24/2015 06:42 PM, Zoltan Kiss wrote:
>> Hi,
>>
>> I've noticed in lib/netdev-dpdk.c that __rte_pktmbuf_init() stores the
>> packet metadata right after "struct rte_mbuf", and before the buffer
>> data:
>>
>>  /* start of buffer is just after mbuf structure */
>>  m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>
>> (struct dp_packet has the rte_mbuf as first member if DPDK enabled)
>>
>> However, lib/librte_mbuf/rte_mbuf.h seems to codify that the buffer
>> should start right after the rte_mbuf:
>>
>> /**
>>   * Given the buf_addr returns the pointer to corresponding mbuf.
>>   */
>> #define RTE_MBUF_FROM_BADDR(ba) (((struct rte_mbuf *)(ba)) - 1)
>>
>> /**
>>   * Given the pointer to mbuf returns an address where it's  buf_addr
>>   * should point to.
>>   */
>> #define RTE_MBUF_TO_BADDR(mb)   (((struct rte_mbuf *)(mb)) + 1)
>>
>> These macros are used for attaching/detaching mbuf's to each other. This
>> is the way the code retrieves the direct buffer from an indirect one,
>> and vica versa. I think if we want to keep the metadata feature (which I
>> guess is quite important), we need to add a pointer to rte_mbuf, which
>> helps the direct and indirect structs to find each other. Something like:
>>
>>  struct rte_mbuf *attach;/**< Points to the other buffer if this
>> one
>>   is (in)direct. Otherwise NULL.  */
>>
>> What do you think?
>
> I've just sent a patch that should fix this issue.
> http://dpdk.org/ml/archives/dev/2015-March/015722.html
>
> Let me know if you have any comment on it.

I have some comments for the first patch:

> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index c3fcb80..050f3ac 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
I've sent in a separate patch for this file, I think it's just easier to 
ditch the old copy-pasted code, see "[PATCH] examples/vhost: use library 
routines instead of local copies"

> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 17ba791..4ced6d3 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -268,7 +268,7 @@ struct rte_mbuf {
>   uint16_t data_len;/**< Amount of data in segment buffer. */
>   uint32_t pkt_len; /**< Total pkt len: sum of all segments. */
>   uint16_t vlan_tci;/**< VLAN Tag Control Identifier (CPU order) 
> */
> - uint16_t reserved;
> + uint16_t priv_size;   /**< size of the application private data */
>   union {
>   uint32_t rss; /**< RSS hash result if RSS enabled */
>   struct {
> @@ -320,15 +320,38 @@ struct rte_mbuf {
>  } __rte_cache_aligned;
>
>  /**
> - * Given the buf_addr returns the pointer to corresponding mbuf.
> + * Return the mbuf owning the given data buffer address.
> + *
> + * @param mi
> + *   The pointer to the indirect mbuf.
> + * @param buffer_addr
> + *   The address of the data buffer of the direct mbuf.
You don't need this parameter, it's mi->buf_addr.

> @@ -744,9 +767,11 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf 
> *mi, struct rte_mbuf *md)
>  static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
>  {
>   const struct rte_mempool *mp = m->pool;
> - void *buf = RTE_MBUF_TO_BADDR(m);
> + void *buf = rte_mbuf_to_baddr(m);
>   uint32_t buf_len = mp->elt_size - sizeof(*m);
I don't see any reason to keep buf and buf_len, just assign straight to 
m->buf_addr and *len.
Besides that, you need to deduct m->priv_size from buf_len.

> - m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof (*m);
> +
> + m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof (*m) +
> + m->priv_size;
>
>   m->buf_addr = buf;
>   m->buf_len = (uint16_t)buf_len;

The rest of the series looks good,

Reviewed-by: Zoltan Kiss

[dpdk-dev] [PATCH] examples/vhost: use library routines instead of local copies

2015-03-25 Thread Zoltan Kiss

This macro and function were copies from the mbuf library, no reason to keep
them.

Signed-off-by: Zoltan Kiss 
---
 examples/vhost/main.c | 38 +-
 1 file changed, 5 insertions(+), 33 deletions(-)

diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index c3fcb80..1c998a5 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -139,8 +139,6 @@
 /* Number of descriptors per cacheline. */
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))

-#define MBUF_EXT_MEM(mb)   (RTE_MBUF_FROM_BADDR((mb)->buf_addr) != (mb))
-
 /* mask of enabled ports */
 static uint32_t enabled_port_mask = 0;

@@ -1538,32 +1536,6 @@ attach_rxmbuf_zcp(struct virtio_net *dev)
return;
 }

-/*
- * Detach an attched packet mbuf -
- *  - restore original mbuf address and length values.
- *  - reset pktmbuf data and data_len to their default values.
- *  All other fields of the given packet mbuf will be left intact.
- *
- * @param m
- *   The attached packet mbuf.
- */
-static inline void pktmbuf_detach_zcp(struct rte_mbuf *m)
-{
-   const struct rte_mempool *mp = m->pool;
-   void *buf = RTE_MBUF_TO_BADDR(m);
-   uint32_t buf_ofs;
-   uint32_t buf_len = mp->elt_size - sizeof(*m);
-   m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof(*m);
-
-   m->buf_addr = buf;
-   m->buf_len = (uint16_t)buf_len;
-
-   buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
-   RTE_PKTMBUF_HEADROOM : m->buf_len;
-   m->data_off = buf_ofs;
-
-   m->data_len = 0;
-}

 /*
  * This function is called after packets have been transimited. It fetchs mbuf
@@ -1590,8 +1562,8 @@ txmbuf_clean_zcp(struct virtio_net *dev, struct vpool 
*vpool)

for (index = 0; index < mbuf_count; index++) {
mbuf = __rte_mbuf_raw_alloc(vpool->pool);
-   if (likely(MBUF_EXT_MEM(mbuf)))
-   pktmbuf_detach_zcp(mbuf);
+   if (likely(RTE_MBUF_INDIRECT(mbuf)))
+   rte_pktmbuf_detach(mbuf);
rte_ring_sp_enqueue(vpool->ring, mbuf);

/* Update used index buffer information. */
@@ -1653,8 +1625,8 @@ static void mbuf_destroy_zcp(struct vpool *vpool)
for (index = 0; index < mbuf_count; index++) {
mbuf = __rte_mbuf_raw_alloc(vpool->pool);
if (likely(mbuf != NULL)) {
-   if (likely(MBUF_EXT_MEM(mbuf)))
-   pktmbuf_detach_zcp(mbuf);
+   if (likely(RTE_MBUF_INDIRECT(mbuf)))
+   rte_pktmbuf_detach(mbuf);
rte_ring_sp_enqueue(vpool->ring, (void *)mbuf);
}
}
@@ -2149,7 +2121,7 @@ switch_worker_zcp(__attribute__((unused)) void *arg)
}
while (likely(rx_count)) {
rx_count--;
-   pktmbuf_detach_zcp(
+   rte_pktmbuf_detach(
pkts_burst[rx_count]);
rte_ring_sp_enqueue(
vpool_array[index].ring,
-- 
1.9.1

[dpdk-dev] ovs-dpdk: placing the metadata

2015-03-24 Thread Zoltan Kiss

Hi,

I've noticed in lib/netdev-dpdk.c that __rte_pktmbuf_init() stores the 
packet metadata right after "struct rte_mbuf", and before the buffer data:

 /* start of buffer is just after mbuf structure */
 m->buf_addr = (char *)m + sizeof(struct dp_packet);

(struct dp_packet has the rte_mbuf as first member if DPDK enabled) 

However, lib/librte_mbuf/rte_mbuf.h seems to codify that the buffer 
should start right after the rte_mbuf:

/**
  * Given the buf_addr returns the pointer to corresponding mbuf.
  */
#define RTE_MBUF_FROM_BADDR(ba) (((struct rte_mbuf *)(ba)) - 1)

/**
  * Given the pointer to mbuf returns an address where it's  buf_addr
  * should point to.
  */
#define RTE_MBUF_TO_BADDR(mb)   (((struct rte_mbuf *)(mb)) + 1)

These macros are used for attaching/detaching mbuf's to each other. This 
is the way the code retrieves the direct buffer from an indirect one, 
and vica versa. I think if we want to keep the metadata feature (which I 
guess is quite important), we need to add a pointer to rte_mbuf, which 
helps the direct and indirect structs to find each other. Something like:

struct rte_mbuf *attach;/**< Points to the other buffer if this one
 is (in)direct. Otherwise NULL.  */

What do you think?

Regards,

Zoltan Kiss

98 matches

Mail list logo