[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-29 Thread Thomas Monjalon
2016-02-29 12:51, Panu Matilainen:
> On 02/24/2016 03:23 PM, Ananyev, Konstantin wrote:
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Panu Matilainen
> >> On 02/23/2016 07:35 AM, Xie, Huawei wrote:
> >>> On 2/22/2016 10:52 PM, Xie, Huawei wrote:
>  On 2/4/2016 1:24 AM, Olivier MATZ wrote:
> > On 01/27/2016 02:56 PM, Panu Matilainen wrote:
> >> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
> >> the library ABI and should not be listed in the version map.
> >>
> >> I assume its inline for performance reasons, but then you lose the
> >> benefits of dynamic linking such as ability to fix bugs and/or improve
> >> itby just updating the library. Since the point of having a bulk API is
> >> to improve performance by reducing the number of calls required, does 
> >> it
> >> really have to be inline? As in, have you actually measured the
> >> difference between inline and non-inline and decided its worth all the
> >> downsides?
> > Agree with Panu. It would be interesting to compare the performance
> > between inline and non inline to decide whether inlining it or not.
>  Will update after i gathered more data. inline could show obvious
>  performance difference in some cases.
> >>>
> >>> Panu and Oliver:
> >>> I write a simple benchmark. This benchmark run 10M rounds, in each round
> >>> 8 mbufs are allocated through bulk API, and then freed.
> >>> These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
> >>> 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
> >>> Btw, i have removed some exceptional data, the frequency of which is
> >>> like 1/10. Sometimes observed user usage suddenly disappeared, no clue
> >>> what happened.
> >>>
> >>> With 8 mbufs allocated, there is about 6% performance increase using 
> >>> inline.
> >> [...]
> >>>
> >>> With 16 mbufs allocated, we could still observe obvious performance
> >>> difference, though only 1%-2%
> >>>
> >> [...]
> >>>
> >>> With 32/64 mbufs allocated, the deviation of the data itself would hide
> >>> the performance difference.
> >>> So we prefer using inline for performance.
> >>
> >> At least I was more after real-world performance in a real-world
> >> use-case rather than CPU cycles in a microbenchmark, we know function
> >> calls have a cost but the benefits tend to outweight the cons.
> >>
> >> Inline functions have their place and they're far less evil in project
> >> internal use, but in library public API they are BAD and should be ...
> >> well, not banned because there are exceptions to every rule, but highly
> >> discouraged.
> >
> > Why is that?
> 
> For all the reasons static linking is bad, and what's worse it forces 
> the static linking badness into dynamically linked builds.
> 
> If there's a bug (security or otherwise) in a library, a distro wants to 
> supply an updated package which fixes that bug and be done with it. But 
> if that bug is in an inlined code, supplying an update is not enough, 
> you also need to recompile everything using that code, and somehow 
> inform customers possibly using that code that they need to not only 
> update the library but to recompile their apps as well. That is 
> precisely the reason distros go to great lenghts to avoid *any* 
> statically linked apps and libs in the distro, completely regardless of 
> the performance overhead.
> 
> In addition, inlined code complicates ABI compatibility issues because 
> some of the code is one the "wrong" side, and worse, it bypasses all the 
> other ABI compatibility safeguards like soname and symbol versioning.
> 
> Like said, inlined code is fine for internal consumption, but incredibly 
> bad for public interfaces. And of course, the more complicated a 
> function is, greater the potential of needing bugfixes.
> 
> Mind you, none of this is magically specific to this particular 
> function. Except in the sense that bulk operations offer a better way of 
> performance improvements than just inlining everything.
> 
> > As you can see right now we have all mbuf alloc/free routines as static 
> > inline.
> > And I think we would like to keep it like that.
> > So why that particular function should be different?
> 
> Because there's much less need to have it inlined since the function 
> call overhead is "amortized" by the fact its doing bulk operations. "We 
> always did it that way" is not a very good reason :)
> 
> > After all that function is nothing more than a wrapper
> > around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
> > So unless mempool get/put API would change, I can hardly see there could be 
> > any ABI
> > breakages in future.
> > About 'real world' performance gain - it was a 'real world' performance 
> > problem,
> > that we tried to solve by introducing that function:
> > http://dpdk.org/ml/archives/dev/2015-May/017633.html
> >
> > And according to the user feedback, it does help:
> > 

[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-29 Thread Panu Matilainen
On 02/24/2016 03:23 PM, Ananyev, Konstantin wrote:
> Hi Panu,
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Panu Matilainen
>> Sent: Wednesday, February 24, 2016 12:12 PM
>> To: Xie, Huawei; Olivier MATZ; dev at dpdk.org
>> Cc: dprovan at bivio.net
>> Subject: Re: [dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk 
>> API
>>
>> On 02/23/2016 07:35 AM, Xie, Huawei wrote:
>>> On 2/22/2016 10:52 PM, Xie, Huawei wrote:
>>>> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>>>>> Hi,
>>>>>
>>>>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>>>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>>>>>> the library ABI and should not be listed in the version map.
>>>>>>
>>>>>> I assume its inline for performance reasons, but then you lose the
>>>>>> benefits of dynamic linking such as ability to fix bugs and/or improve
>>>>>> itby just updating the library. Since the point of having a bulk API is
>>>>>> to improve performance by reducing the number of calls required, does it
>>>>>> really have to be inline? As in, have you actually measured the
>>>>>> difference between inline and non-inline and decided its worth all the
>>>>>> downsides?
>>>>> Agree with Panu. It would be interesting to compare the performance
>>>>> between inline and non inline to decide whether inlining it or not.
>>>> Will update after i gathered more data. inline could show obvious
>>>> performance difference in some cases.
>>>
>>> Panu and Oliver:
>>> I write a simple benchmark. This benchmark run 10M rounds, in each round
>>> 8 mbufs are allocated through bulk API, and then freed.
>>> These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
>>> 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
>>> Btw, i have removed some exceptional data, the frequency of which is
>>> like 1/10. Sometimes observed user usage suddenly disappeared, no clue
>>> what happened.
>>>
>>> With 8 mbufs allocated, there is about 6% performance increase using inline.
>> [...]
>>>
>>> With 16 mbufs allocated, we could still observe obvious performance
>>> difference, though only 1%-2%
>>>
>> [...]
>>>
>>> With 32/64 mbufs allocated, the deviation of the data itself would hide
>>> the performance difference.
>>> So we prefer using inline for performance.
>>
>> At least I was more after real-world performance in a real-world
>> use-case rather than CPU cycles in a microbenchmark, we know function
>> calls have a cost but the benefits tend to outweight the cons.
>>
>> Inline functions have their place and they're far less evil in project
>> internal use, but in library public API they are BAD and should be ...
>> well, not banned because there are exceptions to every rule, but highly
>> discouraged.
>
> Why is that?

For all the reasons static linking is bad, and what's worse it forces 
the static linking badness into dynamically linked builds.

If there's a bug (security or otherwise) in a library, a distro wants to 
supply an updated package which fixes that bug and be done with it. But 
if that bug is in an inlined code, supplying an update is not enough, 
you also need to recompile everything using that code, and somehow 
inform customers possibly using that code that they need to not only 
update the library but to recompile their apps as well. That is 
precisely the reason distros go to great lenghts to avoid *any* 
statically linked apps and libs in the distro, completely regardless of 
the performance overhead.

In addition, inlined code complicates ABI compatibility issues because 
some of the code is one the "wrong" side, and worse, it bypasses all the 
other ABI compatibility safeguards like soname and symbol versioning.

Like said, inlined code is fine for internal consumption, but incredibly 
bad for public interfaces. And of course, the more complicated a 
function is, greater the potential of needing bugfixes.

Mind you, none of this is magically specific to this particular 
function. Except in the sense that bulk operations offer a better way of 
performance improvements than just inlining everything.

> As you can see right now we have all mbuf alloc/free routines as static 
> inline.
> And I think we would like to keep it like that.
> So why that particular function should be different?

Because there's much less need to have it inlined

[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-26 Thread Olivier MATZ
Hi Huawei,

On 02/26/2016 10:07 AM, Xie, Huawei wrote:
> On 2/26/2016 4:56 PM, Olivier MATZ wrote:
>> test_one_pktmbuf(struct rte_mbuf *m)
>> {
>>  /* same as before without the allocation/free */
>> }
>>
>> test_pkt_mbuf(void)
>> {
>>  m = rte_pktmbuf_alloc(pool);
>>  test_one_pktmbuf(m);
>>  rte_pktmbuf_free(m);
>>
>>  ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
>>  for (i = 0; i < BULK_CNT; i++) {
>>  m = mtab[i];
>>  test_one_pktmbuf(m);
>>  rte_pktmbuf_free(m);
>>  }
>> }
> 
> This is to test the functionality.
> Let us also have the case like the following?
> cycles_start = rte_get_timer_cycles();
> while(rounds--) {
> 
>   ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
>   for (i = 0; i < BULK_CNT; i++) {
>   m = mtab[i];
>   /* some work if needed */
>   rte_pktmbuf_free(m);
>   }
> }
>   cycles_end = rte_get_timer_cycles();
> 
> to compare with
>cycles_start = rte_get_timer_cycles();
>while(rounds--) {
> for (i = 0; i < BULK_CNT; i++)
> mtab[i] = rte_pktmbuf_alloc(...);
> 
>   ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
>   for (i = 0; i < BULK_CNT; i++) {
>   m = mtab[i];
>   /* some work if needed */
>   rte_pktmbuf_free(m);
>   }
> }
>   cycles_end = rte_get_timer_cycles();

In my opinion, it's already quite obvious that the bulk allocation
will be faster than the non-bulk (and we already have some mempool
benchmarks showing it). So I would say that functional testing is
enough.

On the other hand, it would be good to see if some examples
applications could be updated to take advantage of the new API (as
you did for the librte_vhost).

What do you think?


[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-26 Thread Olivier MATZ


On 02/23/2016 06:35 AM, Xie, Huawei wrote:
>>> Also, it would be nice to have a simple test function in
>>> app/test/test_mbuf.c. For instance, you could update
>>> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
>>> the mbuf allocation from the function. Then it could be called with
>>> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
>>> all the mbufs of rte_pktmbuf_alloc_bulk().
> 
> Don't quite get you. Is it that we write two cases, one case allocate
> mbuf through rte_pktmbuf_alloc_bulk and one use rte_pktmbuf_alloc? It is
> good to have. 

Yes, something like:

test_one_pktmbuf(struct rte_mbuf *m)
{
/* same as before without the allocation/free */
}

test_pkt_mbuf(void)
{
m = rte_pktmbuf_alloc(pool);
test_one_pktmbuf(m);
rte_pktmbuf_free(m);

ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
for (i = 0; i < BULK_CNT; i++) {
m = mtab[i];
test_one_pktmbuf(m);
rte_pktmbuf_free(m);
}
}

> I could do this after this patch.

Yes, please.


Thanks,
Olivier


[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-26 Thread Olivier MATZ


On 02/26/2016 08:39 AM, Xie, Huawei wrote:
 With 8 mbufs allocated, there is about 6% performance increase using 
 inline.
 With 16 mbufs allocated, we could still observe obvious performance
 difference, though only 1%-2%
> 

> On 2/24/2016 9:23 PM, Ananyev, Konstantin wrote:
>> As you can see right now we have all mbuf alloc/free routines as static 
>> inline.
>> And I think we would like to keep it like that.
>> So why that particular function should be different?
>> After all that function is nothing more than a wrapper 
>> around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
>> So unless mempool get/put API would change, I can hardly see there could be 
>> any ABI
>> breakages in future. 
>> About 'real world' performance gain - it was a 'real world' performance 
>> problem,
>> that we tried to solve by introducing that function:
>> http://dpdk.org/ml/archives/dev/2015-May/017633.html
>>
>> And according to the user feedback, it does help:  
>> http://dpdk.org/ml/archives/dev/2016-February/033203.html

For me, there's no doubt this function will help in real world use
cases. That's also true that today most (oh no, all) datapath mbuf
functions are inline. Although I understand Panu's point of view
about the use of inline functions, trying to de-inline some functions
of the mbuf API (and others APIs like mempool or ring) would require
a deep analysis first to check the performance impact. And I think there
would be an impact for most of them.

In this particular case, as the function does bulk allocations, it
probably tempers the cost of the function call, and that's why I
was curious of any comparison with/without inlining. But I'm not
sure having this only function as non-inline makes a lot of sense.

So:
Acked-by: Olivier Matz 



[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-26 Thread Xie, Huawei
On 2/26/2016 4:56 PM, Olivier MATZ wrote:
>
> On 02/23/2016 06:35 AM, Xie, Huawei wrote:
 Also, it would be nice to have a simple test function in
 app/test/test_mbuf.c. For instance, you could update
 test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
 the mbuf allocation from the function. Then it could be called with
 a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
 all the mbufs of rte_pktmbuf_alloc_bulk().
>> Don't quite get you. Is it that we write two cases, one case allocate
>> mbuf through rte_pktmbuf_alloc_bulk and one use rte_pktmbuf_alloc? It is
>> good to have. 
> Yes, something like:
>
> test_one_pktmbuf(struct rte_mbuf *m)
> {
>   /* same as before without the allocation/free */
> }
>
> test_pkt_mbuf(void)
> {
>   m = rte_pktmbuf_alloc(pool);
>   test_one_pktmbuf(m);
>   rte_pktmbuf_free(m);
>
>   ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
>   for (i = 0; i < BULK_CNT; i++) {
>   m = mtab[i];
>   test_one_pktmbuf(m);
>   rte_pktmbuf_free(m);
>   }
> }

This is to test the functionality.
Let us also have the case like the following?
cycles_start = rte_get_timer_cycles();
while(rounds--) {

ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
for (i = 0; i < BULK_CNT; i++) {
m = mtab[i];
/* some work if needed */
rte_pktmbuf_free(m);
}
}
cycles_end = rte_get_timer_cycles();

to compare with
   cycles_start = rte_get_timer_cycles();
   while(rounds--) {
for (i = 0; i < BULK_CNT; i++)
mtab[i] = rte_pktmbuf_alloc(...);

ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
for (i = 0; i < BULK_CNT; i++) {
m = mtab[i];
/* some work if needed */
rte_pktmbuf_free(m);
}
}
cycles_end = rte_get_timer_cycles();


>> I could do this after this patch.
> Yes, please.
>
>
> Thanks,
> Olivier
>



[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-26 Thread Xie, Huawei
On 2/24/2016 9:23 PM, Ananyev, Konstantin wrote:
> Hi Panu,
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Panu Matilainen
>> Sent: Wednesday, February 24, 2016 12:12 PM
>> To: Xie, Huawei; Olivier MATZ; dev at dpdk.org
>> Cc: dprovan at bivio.net
>> Subject: Re: [dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk 
>> API
>>
>> On 02/23/2016 07:35 AM, Xie, Huawei wrote:
>>> On 2/22/2016 10:52 PM, Xie, Huawei wrote:
>>>> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>>>>> Hi,
>>>>>
>>>>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>>>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>>>>>> the library ABI and should not be listed in the version map.
>>>>>>
>>>>>> I assume its inline for performance reasons, but then you lose the
>>>>>> benefits of dynamic linking such as ability to fix bugs and/or improve
>>>>>> itby just updating the library. Since the point of having a bulk API is
>>>>>> to improve performance by reducing the number of calls required, does it
>>>>>> really have to be inline? As in, have you actually measured the
>>>>>> difference between inline and non-inline and decided its worth all the
>>>>>> downsides?
>>>>> Agree with Panu. It would be interesting to compare the performance
>>>>> between inline and non inline to decide whether inlining it or not.
>>>> Will update after i gathered more data. inline could show obvious
>>>> performance difference in some cases.
>>> Panu and Oliver:
>>> I write a simple benchmark. This benchmark run 10M rounds, in each round
>>> 8 mbufs are allocated through bulk API, and then freed.
>>> These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
>>> 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
>>> Btw, i have removed some exceptional data, the frequency of which is
>>> like 1/10. Sometimes observed user usage suddenly disappeared, no clue
>>> what happened.
>>>
>>> With 8 mbufs allocated, there is about 6% performance increase using inline.
>> [...]
>>> With 16 mbufs allocated, we could still observe obvious performance
>>> difference, though only 1%-2%
>>>
>> [...]
>>> With 32/64 mbufs allocated, the deviation of the data itself would hide
>>> the performance difference.
>>> So we prefer using inline for performance.
>> At least I was more after real-world performance in a real-world
>> use-case rather than CPU cycles in a microbenchmark, we know function
>> calls have a cost but the benefits tend to outweight the cons.

It depends on what could be called the real world case. It could be
argued. I think the case Konstantin mentioned could be called a real
world one.
If your opinion on whether use benchmark or real-world use case is not
specific to this bulk API, then i have different opinion. For example,
for kernel virtio optimization, people use vring bench. We couldn't
guarantee each small optimization could bring obvious performance gain
in some big workload. The gain could be hided if bottleneck is
elsewhere, so i also plan to build such kind of virtio bench in DPDK.

Finally, i am open to inline or not, but currently priority better goes
with performance. If we make it an API now, we couldn't easily step back
in future; But we could change otherwise, after we have more confidence.
We could even check every inline "API", whether it should be inline or
be in the lib.

>>
>> Inline functions have their place and they're far less evil in project
>> internal use, but in library public API they are BAD and should be ...
>> well, not banned because there are exceptions to every rule, but highly
>> discouraged.
> Why is that?
> As you can see right now we have all mbuf alloc/free routines as static 
> inline.
> And I think we would like to keep it like that.
> So why that particular function should be different?
> After all that function is nothing more than a wrapper 
> around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
> So unless mempool get/put API would change, I can hardly see there could be 
> any ABI
> breakages in future. 
> About 'real world' performance gain - it was a 'real world' performance 
> problem,
> that we tried to solve by introducing that function:
> http://dpdk.org/ml/archives/dev/2015-May/017633.html
>
> And according to the user feedback, it does help:  
> http://dpdk.org/ml/archives/dev/2016-February/033203.html
>
> Konstantin
>
>>  - Panu -
>>



[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-24 Thread Panu Matilainen
On 02/23/2016 07:35 AM, Xie, Huawei wrote:
> On 2/22/2016 10:52 PM, Xie, Huawei wrote:
>> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>>> Hi,
>>>
>>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
 Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
 the library ABI and should not be listed in the version map.

 I assume its inline for performance reasons, but then you lose the
 benefits of dynamic linking such as ability to fix bugs and/or improve
 itby just updating the library. Since the point of having a bulk API is
 to improve performance by reducing the number of calls required, does it
 really have to be inline? As in, have you actually measured the
 difference between inline and non-inline and decided its worth all the
 downsides?
>>> Agree with Panu. It would be interesting to compare the performance
>>> between inline and non inline to decide whether inlining it or not.
>> Will update after i gathered more data. inline could show obvious
>> performance difference in some cases.
>
> Panu and Oliver:
> I write a simple benchmark. This benchmark run 10M rounds, in each round
> 8 mbufs are allocated through bulk API, and then freed.
> These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
> 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
> Btw, i have removed some exceptional data, the frequency of which is
> like 1/10. Sometimes observed user usage suddenly disappeared, no clue
> what happened.
>
> With 8 mbufs allocated, there is about 6% performance increase using inline.
[...]
>
> With 16 mbufs allocated, we could still observe obvious performance
> difference, though only 1%-2%
>
[...]
>
> With 32/64 mbufs allocated, the deviation of the data itself would hide
> the performance difference.
> So we prefer using inline for performance.

At least I was more after real-world performance in a real-world 
use-case rather than CPU cycles in a microbenchmark, we know function 
calls have a cost but the benefits tend to outweight the cons.

Inline functions have their place and they're far less evil in project 
internal use, but in library public API they are BAD and should be ... 
well, not banned because there are exceptions to every rule, but highly 
discouraged.

- Panu -




[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-24 Thread Ananyev, Konstantin
Hi Panu,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Panu Matilainen
> Sent: Wednesday, February 24, 2016 12:12 PM
> To: Xie, Huawei; Olivier MATZ; dev at dpdk.org
> Cc: dprovan at bivio.net
> Subject: Re: [dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk 
> API
> 
> On 02/23/2016 07:35 AM, Xie, Huawei wrote:
> > On 2/22/2016 10:52 PM, Xie, Huawei wrote:
> >> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
> >>> Hi,
> >>>
> >>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
> >>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
> >>>> the library ABI and should not be listed in the version map.
> >>>>
> >>>> I assume its inline for performance reasons, but then you lose the
> >>>> benefits of dynamic linking such as ability to fix bugs and/or improve
> >>>> itby just updating the library. Since the point of having a bulk API is
> >>>> to improve performance by reducing the number of calls required, does it
> >>>> really have to be inline? As in, have you actually measured the
> >>>> difference between inline and non-inline and decided its worth all the
> >>>> downsides?
> >>> Agree with Panu. It would be interesting to compare the performance
> >>> between inline and non inline to decide whether inlining it or not.
> >> Will update after i gathered more data. inline could show obvious
> >> performance difference in some cases.
> >
> > Panu and Oliver:
> > I write a simple benchmark. This benchmark run 10M rounds, in each round
> > 8 mbufs are allocated through bulk API, and then freed.
> > These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
> > 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
> > Btw, i have removed some exceptional data, the frequency of which is
> > like 1/10. Sometimes observed user usage suddenly disappeared, no clue
> > what happened.
> >
> > With 8 mbufs allocated, there is about 6% performance increase using inline.
> [...]
> >
> > With 16 mbufs allocated, we could still observe obvious performance
> > difference, though only 1%-2%
> >
> [...]
> >
> > With 32/64 mbufs allocated, the deviation of the data itself would hide
> > the performance difference.
> > So we prefer using inline for performance.
> 
> At least I was more after real-world performance in a real-world
> use-case rather than CPU cycles in a microbenchmark, we know function
> calls have a cost but the benefits tend to outweight the cons.
> 
> Inline functions have their place and they're far less evil in project
> internal use, but in library public API they are BAD and should be ...
> well, not banned because there are exceptions to every rule, but highly
> discouraged.

Why is that?
As you can see right now we have all mbuf alloc/free routines as static inline.
And I think we would like to keep it like that.
So why that particular function should be different?
After all that function is nothing more than a wrapper 
around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
So unless mempool get/put API would change, I can hardly see there could be any 
ABI
breakages in future. 
About 'real world' performance gain - it was a 'real world' performance problem,
that we tried to solve by introducing that function:
http://dpdk.org/ml/archives/dev/2015-May/017633.html

And according to the user feedback, it does help:  
http://dpdk.org/ml/archives/dev/2016-February/033203.html

Konstantin

> 
>   - Panu -
> 



[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-23 Thread Xie, Huawei
On 2/22/2016 10:52 PM, Xie, Huawei wrote:
> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>> Hi,
>>
>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>>> the library ABI and should not be listed in the version map.
>>>
>>> I assume its inline for performance reasons, but then you lose the
>>> benefits of dynamic linking such as ability to fix bugs and/or improve
>>> itby just updating the library. Since the point of having a bulk API is
>>> to improve performance by reducing the number of calls required, does it
>>> really have to be inline? As in, have you actually measured the
>>> difference between inline and non-inline and decided its worth all the
>>> downsides?
>> Agree with Panu. It would be interesting to compare the performance
>> between inline and non inline to decide whether inlining it or not.
> Will update after i gathered more data. inline could show obvious
> performance difference in some cases.

Panu and Oliver:
I write a simple benchmark. This benchmark run 10M rounds, in each round
8 mbufs are allocated through bulk API, and then freed.
These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
Btw, i have removed some exceptional data, the frequency of which is
like 1/10. Sometimes observed user usage suddenly disappeared, no clue
what happened.

With 8 mbufs allocated, there is about 6% performance increase using inline.
inlinenon-inline
2780732950309416
28348536962951378072
28230153202954500888
28250600322958939912
28244998042898938284
28108597202944892796
28522294203014273296
27873085002956809852
27933372602958674900
2834762954346352
27854551842925719136
28215286242937380416
28229221362974978604
27766459202947666548
28159525722952316900
28010487402947366984
28514626722946469004

With 16 mbufs allocated, we could still observe obvious performance
difference, though only 1%-2%

inlinenon-inline
55199870845669902680
55384160965737646840
55789340645590165532
55481319725767926840
56255856965831345628
55582828765662223764
54455877685641003924
55590963205775258444
56564379885743969272
54409394045664882412
54988759685785138532
55616528085737123940
55152117165627775604
55505671405630790628
56659642805589568164
55912959005702697308

With 32/64 mbufs allocated, the deviation of the data itself would hide
the performance difference.

So we prefer using inline for performance.
>> Also, it would be nice to have a simple test function in
>> app/test/test_mbuf.c. For instance, you could update
>> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
>> the mbuf allocation from the function. Then it could be called with
>> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
>> all the mbufs of rte_pktmbuf_alloc_bulk().

Don't quite get you. Is it that we write two cases, one case allocate
mbuf through rte_pktmbuf_alloc_bulk and one use rte_pktmbuf_alloc? It is
good to have. I could do this after this patch.
>>
>> Regards,
>> Olivier
>>
>



[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-22 Thread Xie, Huawei
On 2/4/2016 1:24 AM, Olivier MATZ wrote:
> Hi,
>
> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>
>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>> the library ABI and should not be listed in the version map.
>>
>> I assume its inline for performance reasons, but then you lose the
>> benefits of dynamic linking such as ability to fix bugs and/or improve
>> itby just updating the library. Since the point of having a bulk API is
>> to improve performance by reducing the number of calls required, does it
>> really have to be inline? As in, have you actually measured the
>> difference between inline and non-inline and decided its worth all the
>> downsides?
>
> Agree with Panu. It would be interesting to compare the performance
> between inline and non inline to decide whether inlining it or not.

Will update after i gathered more data. inline could show obvious
performance difference in some cases.

>
> Also, it would be nice to have a simple test function in
> app/test/test_mbuf.c. For instance, you could update
> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
> the mbuf allocation from the function. Then it could be called with
> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
> all the mbufs of rte_pktmbuf_alloc_bulk().
>
> Regards,
> Olivier
>



[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-03 Thread Olivier MATZ
Hi,

On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>
> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
> the library ABI and should not be listed in the version map.
>
> I assume its inline for performance reasons, but then you lose the
> benefits of dynamic linking such as ability to fix bugs and/or improve
> itby just updating the library. Since the point of having a bulk API is
> to improve performance by reducing the number of calls required, does it
> really have to be inline? As in, have you actually measured the
> difference between inline and non-inline and decided its worth all the
> downsides?

Agree with Panu. It would be interesting to compare the performance
between inline and non inline to decide whether inlining it or not.

Also, it would be nice to have a simple test function in
app/test/test_mbuf.c. For instance, you could update
test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
the mbuf allocation from the function. Then it could be called with
a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
all the mbufs of rte_pktmbuf_alloc_bulk().

Regards,
Olivier


[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-01-27 Thread Panu Matilainen
On 01/26/2016 07:03 PM, Huawei Xie wrote:
> v6 changes:
>   reflect the changes in release notes and library version map file
>   revise our duff's code style a bit to make it more readable
>
> v5 changes:
>   add comment about duff's device and our variant implementation
>
> v3 changes:
>   move while after case 0
>   add context about duff's device and why we use while loop in the commit
> message
>
> v2 changes:
>   unroll the loop a bit to help the performance
>
> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
>
> There is related thread about this bulk API.
> http://dpdk.org/dev/patchwork/patch/4718/
> Thanks to Konstantin's loop unrolling.
>
> Attached the wiki page about duff's device. It explains the performance
> optimization through loop unwinding, and also the most dramatic use of
> case label fall-through.
> https://en.wikipedia.org/wiki/Duff%27s_device
>
> In our implementation, we use while() loop rather than do{} while() loop
> because we could not assume count is strictly positive. Using while()
> loop saves one line of check if count is zero.
>
> Signed-off-by: Gerald Rogers 
> Signed-off-by: Huawei Xie 
> Acked-by: Konstantin Ananyev 
> ---
>   doc/guides/rel_notes/release_2_3.rst |  3 ++
>   lib/librte_mbuf/rte_mbuf.h   | 55 
> 
>   lib/librte_mbuf/rte_mbuf_version.map |  7 +
>   3 files changed, 65 insertions(+)
>
> diff --git a/doc/guides/rel_notes/release_2_3.rst 
> b/doc/guides/rel_notes/release_2_3.rst
> index 99de186..a52cba3 100644
> --- a/doc/guides/rel_notes/release_2_3.rst
> +++ b/doc/guides/rel_notes/release_2_3.rst
> @@ -4,6 +4,9 @@ DPDK Release 2.3
>   New Features
>   
>
> +* **Enable bulk allocation of mbufs. **
> +  A new function ``rte_pktmbuf_alloc_bulk()`` has been added to allow the 
> user
> +  to allocate a bulk of mbufs.
>
>   Resolved Issues
>   ---
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index f234ac9..b2ed479 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -1336,6 +1336,61 @@ static inline struct rte_mbuf 
> *rte_pktmbuf_alloc(struct rte_mempool *mp)
>   }
>
>   /**
> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to 
> default
> + * values.
> + *
> + *  @param pool
> + *The mempool from which mbufs are allocated.
> + *  @param mbufs
> + *Array of pointers to mbufs
> + *  @param count
> + *Array size
> + *  @return
> + *   - 0: Success
> + */
> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> +  struct rte_mbuf **mbufs, unsigned count)
> +{
> + unsigned idx = 0;
> + int rc;
> +
> + rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> + if (unlikely(rc))
> + return rc;
> +
> + /* To understand duff's device on loop unwinding optimization, see
> +  * https://en.wikipedia.org/wiki/Duff's_device.
> +  * Here while() loop is used rather than do() while{} to avoid extra
> +  * check if count is zero.
> +  */
> + switch (count % 4) {
> + case 0:
> + while (idx != count) {
> + RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> + rte_mbuf_refcnt_set(mbufs[idx], 1);
> + rte_pktmbuf_reset(mbufs[idx]);
> + idx++;
> + case 3:
> + RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> + rte_mbuf_refcnt_set(mbufs[idx], 1);
> + rte_pktmbuf_reset(mbufs[idx]);
> + idx++;
> + case 2:
> + RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> + rte_mbuf_refcnt_set(mbufs[idx], 1);
> + rte_pktmbuf_reset(mbufs[idx]);
> + idx++;
> + case 1:
> + RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> + rte_mbuf_refcnt_set(mbufs[idx], 1);
> + rte_pktmbuf_reset(mbufs[idx]);
> + idx++;
> + }
> + }
> + return 0;
> +}
> +
> +/**
>* Attach packet mbuf to another packet mbuf.
>*
>* After attachment we refer the mbuf we attached as 'indirect',
> diff --git a/lib/librte_mbuf/rte_mbuf_version.map 
> b/lib/librte_mbuf/rte_mbuf_version.map
> index e10f6bd..257c65a 100644
> --- a/lib/librte_mbuf/rte_mbuf_version.map
> +++ b/lib/librte_mbuf/rte_mbuf_version.map
> @@ -18,3 +18,10 @@ DPDK_2.1 {
>   rte_pktmbuf_pool_create;
>
>   } DPDK_2.0;
> +
> +DPDK_2.3 {
> + global:
> +
> + rte_pktmbuf_alloc_bulk;
> +
> +} DPDK_2.1;
>

Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of 
the library ABI and should not be listed in the version map.

I assume its inline for performance reasons, but then you lose the 
benefits of dynamic linking such as ability to fix bugs and/or improve 
itby just updating the library. Since the point of 

[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-01-27 Thread Huawei Xie
v6 changes:
 reflect the changes in release notes and library version map file
 revise our duff's code style a bit to make it more readable

v5 changes:
 add comment about duff's device and our variant implementation

v3 changes:
 move while after case 0
 add context about duff's device and why we use while loop in the commit
message

v2 changes:
 unroll the loop a bit to help the performance

rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.

There is related thread about this bulk API.
http://dpdk.org/dev/patchwork/patch/4718/
Thanks to Konstantin's loop unrolling.

Attached the wiki page about duff's device. It explains the performance
optimization through loop unwinding, and also the most dramatic use of
case label fall-through.
https://en.wikipedia.org/wiki/Duff%27s_device

In our implementation, we use while() loop rather than do{} while() loop
because we could not assume count is strictly positive. Using while()
loop saves one line of check if count is zero.

Signed-off-by: Gerald Rogers 
Signed-off-by: Huawei Xie 
Acked-by: Konstantin Ananyev 
---
 doc/guides/rel_notes/release_2_3.rst |  3 ++
 lib/librte_mbuf/rte_mbuf.h   | 55 
 lib/librte_mbuf/rte_mbuf_version.map |  7 +
 3 files changed, 65 insertions(+)

diff --git a/doc/guides/rel_notes/release_2_3.rst 
b/doc/guides/rel_notes/release_2_3.rst
index 99de186..a52cba3 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -4,6 +4,9 @@ DPDK Release 2.3
 New Features
 

+* **Enable bulk allocation of mbufs. **
+  A new function ``rte_pktmbuf_alloc_bulk()`` has been added to allow the user
+  to allocate a bulk of mbufs.

 Resolved Issues
 ---
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index f234ac9..b2ed479 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1336,6 +1336,61 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct 
rte_mempool *mp)
 }

 /**
+ * Allocate a bulk of mbufs, initialize refcnt and reset the fields to default
+ * values.
+ *
+ *  @param pool
+ *The mempool from which mbufs are allocated.
+ *  @param mbufs
+ *Array of pointers to mbufs
+ *  @param count
+ *Array size
+ *  @return
+ *   - 0: Success
+ */
+static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
+struct rte_mbuf **mbufs, unsigned count)
+{
+   unsigned idx = 0;
+   int rc;
+
+   rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
+   if (unlikely(rc))
+   return rc;
+
+   /* To understand duff's device on loop unwinding optimization, see
+* https://en.wikipedia.org/wiki/Duff's_device.
+* Here while() loop is used rather than do() while{} to avoid extra
+* check if count is zero.
+*/
+   switch (count % 4) {
+   case 0:
+   while (idx != count) {
+   RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+   rte_mbuf_refcnt_set(mbufs[idx], 1);
+   rte_pktmbuf_reset(mbufs[idx]);
+   idx++;
+   case 3:
+   RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+   rte_mbuf_refcnt_set(mbufs[idx], 1);
+   rte_pktmbuf_reset(mbufs[idx]);
+   idx++;
+   case 2:
+   RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+   rte_mbuf_refcnt_set(mbufs[idx], 1);
+   rte_pktmbuf_reset(mbufs[idx]);
+   idx++;
+   case 1:
+   RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
+   rte_mbuf_refcnt_set(mbufs[idx], 1);
+   rte_pktmbuf_reset(mbufs[idx]);
+   idx++;
+   }
+   }
+   return 0;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
diff --git a/lib/librte_mbuf/rte_mbuf_version.map 
b/lib/librte_mbuf/rte_mbuf_version.map
index e10f6bd..257c65a 100644
--- a/lib/librte_mbuf/rte_mbuf_version.map
+++ b/lib/librte_mbuf/rte_mbuf_version.map
@@ -18,3 +18,10 @@ DPDK_2.1 {
rte_pktmbuf_pool_create;

 } DPDK_2.0;
+
+DPDK_2.3 {
+   global:
+
+   rte_pktmbuf_alloc_bulk;
+
+} DPDK_2.1;
-- 
1.8.1.4