On 04/06/2018 04:51 PM, Ilya Maximets wrote:
>>> DPDK mempools are freed when they are no longer needed.
>>> This can happen when a port is removed or a port's mtu is reconfigured so
>>> that a new mempool is used.
>>>
>>> It is possible that an mbuf is attempted to be returned to a freed mempool
>>> from NIC Tx queues and this can lead to a segfault.
>>>
>>> In order to prevent this, only free mempools when they are not needed and
>>> have no in-use mbufs. As this might not be possible immediately, sweep the
>>> mempools anytime a port tries to get a mempool.
>>>
>>
>> Thanks for working on this Kevin. Just a query below. From a testing POV I
>> didn't come across any issues.
>>
>>> Fixes: 8d38823bdf8b ("netdev-dpdk: fix memory leak")
>>> Cc: mark.b.kavanagh81 at gmail.com
>>> Signed-off-by: Kevin Traynor <ktraynor at redhat.com>
>>> ---
>>>
>>> Found on OVS 2.6, but code is operationally similar on recent the branch-
>>> 2.*'s. If the patch is acceptable I can backport to older branches.
>
> This issue was actually rised back in January while discussing mempool issue.
>
Hmmm, thanks for pointing it out. Seems the last mails in the thread
coincided with when I was traveling and I didn't read them back then.
>>
>> Sounds good to me.
>>>
>>> lib/netdev-dpdk.c | 40 +++++++++++++++++++++++++++-------------
>>> 1 file changed, 27 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index c19cedc..9b8e732
>>> 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -590,8 +590,24 @@ dpdk_mp_create(int socket_id, int mtu) }
>>>
>>> +/* Free unused mempools. */
>>> +static void
>>> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex) {
>>> + struct dpdk_mp *dmp, *next;
>>> +
>>> + LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
>>> + if (!dmp->refcount && rte_mempool_full(dmp->mp)) {
>>
>> I hadn't looked at rte_mempool_full() before. I noticed the documentation
>> warns not to use it as part of the datapath but for debug purposes only.
>
> Yes, rte_mempool_full() is racy and could return false-positives, but it's
> the best and only useful function in DPDK API. In practice we can never be
> sure if someone still uses the memory pool. I also used this function in
> my solution for branch 2.8 posted here:
>
> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046100.html
>
The similarities and differences are interesting!
> Memory handling in DPDK is awful. In fact, noone is able to use
> rte_mempool_free(),
> because there is no way correct to check if memory pool is still in use.
>
> There is one dirty hack to check if mempool is really full. It based on the
> internal mempool implementation and the fact that we're not allocating any
> mbufs from the pool:
> 1. Check rte_mempool_ops_get_count(). <-- this exposed externally for some
> reason
> 2. Check rte_mempool_full().
> 3. Check rte_mempool_ops_get_count() again.
> 4. If rte_mempool_full() was 'true' and both calls to
> rte_mempool_ops_get_count() returned same value ----> Mempool really full.
> 5. Here we could safely call rte_mempool_free().
>
Yes, that would work.
The ring count is made before the cache count in rte_mempool_full() and
as we are not getting mbufs I think we should only have the possibility
of mbufs moving from a cache to the ring. In that case we may get a
false negative but wouldn't get a false positive for rte_mempool_full().
Obviously the downside is that it is implementation specific - but maybe
it could be argued the implementation will not change on already
released DPDK versions. What do you think?
>>
>> Does its use add to the downtime in a noticeable way while we reconfigure?
>> I'm thinking of a deployment where the number of lcores is high and we
>> reconfigure often. When cache is enabled, it has to browse the length of all
>> lcores. There may not be an issue here but was curious.
>
I think it should be insignificant compared to actually
allocating/freeing a mempool. I will try to check it.
> That is the interesting question. In my version (referenced above) I used
> watchdog thread to periodically free mempools with zero refcount.
> Not sure which solution is better.
>
The advantage I see of being on the watchdog timer thread is that
probably no one cares how long it takes :-) The advantage of being in
the mempool config is that it runs (and guarantees to run) just before a
new mempool is going to be requested, so there is smaller risk of
additional memory requirements.
In the short term I think we should use the a fix like the patches that
don't cover every possible corner case or the get_count() that isn't the
most elegant. Longer term we can think about what we would want to add
to DPDK to make it easier to do this.
thanks,
Kevin.
>>
>> Did you do any testing around this?
>>
>> Thanks
>> Ian
>>
>>> + ovs_list_remove(&dmp->list_node);
>>> + rte_mempool_free(dmp->mp);
>>> + rte_free(dmp);
>>> + }
>>> + }
>>> +}
>>> +
>>> static struct dpdk_mp *
>>> dpdk_mp_get(int socket_id, int mtu)
>>> {
>>> struct dpdk_mp *dmp;
>>> + bool reuse = false;
>>>
>>> ovs_mutex_lock(&dpdk_mp_mutex);
>>> @@ -599,15 +615,18 @@ dpdk_mp_get(int socket_id, int mtu)
>>> if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>> dmp->refcount++;
>>> - goto out;
>>> + reuse = true;
>>> + break;
>>> }
>>> }
>>> + /* Sweep mempools after reuse or before create. */
>>> + dpdk_mp_sweep();
>>>
>>> - dmp = dpdk_mp_create(socket_id, mtu);
>>> - if (dmp) {
>>> - ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>> + if (!reuse) {
>>> + dmp = dpdk_mp_create(socket_id, mtu);
>>> + if (dmp) {
>>> + ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>> + }
>>> }
>>>
>>> -out:
>>> -
>>> ovs_mutex_unlock(&dpdk_mp_mutex);
>>>
>>> @@ -615,5 +634,5 @@ out:
>>> }
>>>
>>> -/* Release an existing mempool. */
>>> +/* Decrement reference to a mempool. */
>>> static void
>>> dpdk_mp_put(struct dpdk_mp *dmp)
>>> @@ -625,10 +644,5 @@ dpdk_mp_put(struct dpdk_mp *dmp)
>>> ovs_mutex_lock(&dpdk_mp_mutex);
>>> ovs_assert(dmp->refcount);
>>> -
>>> - if (!--dmp->refcount) {
>>> - ovs_list_remove(&dmp->list_node);
>>> - rte_mempool_free(dmp->mp);
>>> - rte_free(dmp);
>>> - }
>>> + dmp->refcount--;
>>> ovs_mutex_unlock(&dpdk_mp_mutex);
>>> }
>>> --
>>> 1.8.3.1
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev at openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev