On 01/26/2018 05:27 PM, Stokes, Ian wrote:
>> -----Original Message-----
>> From: Kevin Traynor [mailto:[email protected]]
>> Sent: Friday, January 26, 2018 3:48 PM
>> To: Ilya Maximets <[email protected]>; Stokes, Ian
>> <[email protected]>; [email protected]
>> Cc: Flavio Leitner <[email protected]>; Loftus, Ciara
>> <[email protected]>; Kavanagh, Mark B <[email protected]>;
>> Jan Scheurich ([email protected]) <[email protected]>;
>> Ben Pfaff ([email protected]) <[email protected]>; [email protected]; Venkatesan
>> Pradeep <[email protected]>
>> Subject: Re: Mempool issue for OVS 2.9
>>
>> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
>>> On 26.01.2018 15:00, Stokes, Ian wrote:
>>>> Hi All,
>>>>
>>>> Recently an issue was raised regarding the move from a single shared
>> mempool model that was in place up to OVS 2.8, to a mempool per port model
>> introduced in 2.9.
>>>>
>>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/04602
>>>> 1.html
>>>>
>>>> The per port mempool model was introduced in September 5th with commit
>> d555d9bded to allow fine grain control on a per port case of memory usage.
>>>>
>>>> In the 'common/shared mempool' model, ports sharing a similar MTU would
>> all share the same buffer mempool (e.g. the most common example of this
>> being that all ports are by default created with a 1500B MTU, and as such
>> share the same mbuf mempool).
>>>>
>>>> This approach had some drawbacks however. For example, with the shared
>> memory pool model a user could exhaust the shared memory pool (for
>> instance by requesting a large number of RXQs for a port), this would
>> cause the vSwitch to crash as any remaining ports would not have the
>> required memory to function. This bug was discovered and reported to the
>> community in late 2016 https://mail.openvswitch.org/pipermail/ovs-
>> discuss/2016-September/042560.html.
>>>>
>>>> The per port mempool patch aimed to avoid such issues by allocating a
>> separate buffer mempool to each port.
>>>>
>>>> An issue has been flagged on ovs-discuss, whereby memory dimensions
>> provided for a given number of ports on OvS 2.6-2.8 may be insufficient to
>> support the same number of ports in OvS 2.9, on account of the per-port
>> mempool model without re-dimensioning extra memory. The effect of this is
>> use case dependent (number of ports, RXQs, MTU settings, number of PMDs
>> etc.) The previous 'common-pool' model was rudimentary in estimating the
>> mempool size and was marked as something that was to be improved upon. The
>> memory allocation calculation for per port model was modified to take the
>> possible configuration factors mentioned into account.
>>>>
>>>> It's unfortunate that this came to light so close to the release code
>> freeze - but better late than never as it is a valid problem to be
>> resolved.
>>>>
>>>> I wanted to highlight some options to the community as I don't think
>> the next steps should be taken in isolation due to the impact this feature
>> has.
>>>>
>>>> There are a number of possibilities for the 2.9 release.
>>>>
>>>> (i) Revert the mempool per port patches and return to the shared
>> mempool model. There are a number of features and refactoring in place on
>> top of the change so this will not be a simple revert. I'm investigating
>> what exactly is involved with this currently.
>>>
>>> This looks like a bad thing to do. Shared memory pools has their own
>>> issues and hides the real memory usage by each port. Also, reverting
>>> seems almost impossible, we'll have to re-implement it from scratch.
>
> I would agree, reverting isn’t as straight forward an option due to the
> amount of commits that were introduced in relation to the per port mempool
> feature over time(listed below for completeness).
>
> netdev-dpdk: Create separate memory pool for each port: d555d9b
> netdev-dpdk: fix management of pre-existing mempools:b6b26021d
> netdev-dpdk: Fix mempool names to reflect socket id: f06546a
> netdev-dpdk: skip init for existing mempools: 837c176
> netdev-dpdk: manage failure in mempool name creation: 65056fd
> netdev-dpdk: Reword mp_size as n_mbufs: ad9b5b9
> netdev-dpdk: Rename dpdk_mp_put as dpdk_mp_free: a08a115
> netdev-dpdk: Fix mp_name leak on snprintf failure: ec6edc8
> netdev-dpdk: Fix dpdk_mp leak in case of EEXIST: 173ef76
> netdev-dpdk: Factor out struct dpdk_mp: 24e78f9
> netdev-dpdk: Remove unused MAX_NB_MBUF: bc57ed9
> netdev-dpdk: Fix mempool creation with large MTU: af5b0da
> netdev-dpdk: Add debug appctl to get mempool information: be48173
>
> Although a lot of these are fixes/formatting, we would have to introduce a
> new series and re-introduce shared model by removing the per port specifics.
> This will require extensive testing also.
>
> Functionality such as the mempool debug tool would also be impacted as well
> (I'm not sure how valuable it would be for 1 shared mempool).
>
> I can look at putting an RFC for this together so as to keep the option on
> the table if it's preferred within the 2.9 time window.
>
>>>>> (ii) Leave the per port mempool implementation as is, flag to users
>> that memory requirements have increased. Extra memory may have to be
>> provided on a per use case basis.
>>>
>>> That's a good point. I prefer this option if possible.
>>
>> With 4 pmds and 4 rxqs per port a user would only be able to add 7 ports
>> per socket now. 9 ports if 1 Rxq. Of course a user may have some headroom
>> in the memory they have allocated but that can't be assumed. It will
>> surely mean some users setup will not work after upgrade - then they will
>> have to debug why.
>>
>> If we just reduce the MIN_NB_MBUF to 8192, it would make it 10 ports/4
>> rxqs, or 13 ports/1 rxqs without additional memory. It still gives 24K/18K
>> buffers per port.
>>
>> What do you think?
>
> This is similar to the RFC you submitted I think.
The RFC tried to make the number of buffers required calculation more
accurate, reduce the number of additional buffers by 75% and allow it go
to zero. The suggestion above just reduces the additional buffers by
50%. Ilya has shown a common case where it would not fix the issue or be
a feasible workaround.
> I take it we still hit the issues flagged by Pradeep but it increases the
> number of ports at least.
>
> The 24K/18K is in and around the region Ilya had flagged below so it might be
> ok. It would be great if we had common config/deployment info to help confirm
> this.
>
>>
>>>
>>>> (iii) Reduce the amount of memory allocated per mempool per port. An
>> RFC to this effect was submitted by Kevin but on follow up the feeling is
>> that it does not resolve the issue adequately.
>>>
>>
>> Right, we discussed it in the other thread. It fixes some of the missed
>> factors for worst case, but Pradeep pointed out there are more. To account
>> for them all would be difficult and mean a lot of extra mbufs per port.
>>
>>> IMHO, we should not reduce the mempool size by hardcoding small value.
>>> Different installations has their own requirements because of
>>> different numbers of HW NICs and CPU cores. I beleive that 20-30K is a
>> normal size per-port for now.
>>>
>>
>> Would be not too far off that range with suggested change above.
>>
>>> How about to add an upper limit for mempool size configurable in boot
>> time?
>>
>> I don't think an OVS user would know how to calculate the number of mbufs.
>> Even if they did, setting it would constitute them okaying that they may
>> run out of mbufs on a port.
>
> I think with the shared memory model as it was before risks running out of
> mbufs also as per the original bug where it was killing the switch.
>
> Below helps with the debug side for users. Ignoring whether they have the
> ability to calculate the required memory for the moment (I agree BTW, I don’t
> think some users would know how to accurately do this), would the new
> parameter still be an issue for backwards compatibility?
Yes, it would.
>
>>
>>> See below code for example (not tested):
>>> ----------------------------------------------------------------------
>>> --------- diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index
>>> 3602180..ed54e06 100644
>>> --- a/lib/dpdk-stub.c
>>> +++ b/lib/dpdk-stub.c
>>> @@ -54,3 +54,9 @@ dpdk_vhost_iommu_enabled(void) {
>>> return false;
>>> }
>>> +
>>> +uint32_t
>>> +dpdk_mempool_size_limit(void)
>>> +{
>>> + return UINT32_MAX;
>>> +}
>>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>>> index 6710d10..219e412 100644
>>> --- a/lib/dpdk.c
>>> +++ b/lib/dpdk.c
>>> @@ -43,6 +43,9 @@ static FILE *log_stream = NULL; /* Stream for
>> DPDK log redirection */
>>> static char *vhost_sock_dir = NULL; /* Location of vhost-user sockets
>> */
>>> static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
>>> support */
>>>
>>> +/* Maximum number of mbufs in a single memory pool. */ static
>>> +uint32_t per_port_mempool_size_limit = UINT32_MAX;
>>> +
>>> static int
>>> process_vhost_flags(char *flag, const char *default_val, int size,
>>> const struct smap *ovs_other_config, @@ -313,6
>>> +316,7 @@ dpdk_init__(const struct smap *ovs_other_config)
>>> int err = 0;
>>> cpu_set_t cpuset;
>>> char *sock_dir_subcomponent;
>>> + uint64_t size_u64;
>>>
>>> log_stream = fopencookie(NULL, "w+", dpdk_log_func);
>>> if (log_stream == NULL) {
>>> @@ -351,6 +355,14 @@ dpdk_init__(const struct smap *ovs_other_config)
>>> VLOG_INFO("IOMMU support for vhost-user-client %s.",
>>> vhost_iommu_enabled ? "enabled" : "disabled");
>>>
>>> + size_u64 = smap_get_ullong(ovs_other_config, "per-port-mempool-
>> size-limit",
>>> + UINT32_MAX);
>>> + if (size_u64 < UINT32_MAX) {
>>> + per_port_mempool_size_limit = size_u64;
>>> + VLOG_INFO("Mempool size for a single port limited to %"PRIu32"
>> mbufs.",
>>> + per_port_mempool_size_limit);
>>> + }
>>> +
>>> argv = grow_argv(&argv, 0, 1);
>>> argc = 1;
>>> argv[0] = xstrdup(ovs_get_program_name()); @@ -494,6 +506,12 @@
>>> dpdk_vhost_iommu_enabled(void)
>>> return vhost_iommu_enabled;
>>> }
>>>
>>> +uint32_t
>>> +dpdk_mempool_size_limit(void)
>>> +{
>>> + return per_port_mempool_size_limit; }
>>> +
>>> void
>>> dpdk_set_lcore_id(unsigned cpu)
>>> {
>>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>>> index dc58d96..4d78895 100644
>>> --- a/lib/dpdk.h
>>> +++ b/lib/dpdk.h
>>> @@ -38,5 +38,6 @@ void dpdk_init(const struct smap *ovs_other_config);
>>> void dpdk_set_lcore_id(unsigned cpu); const char
>>> *dpdk_get_vhost_sock_dir(void); bool dpdk_vhost_iommu_enabled(void);
>>> +uint32_t dpdk_mempool_size_limit(void);
>>>
>>> #endif /* dpdk.h */
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
>>> e834c7e..b2f1072 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -518,7 +518,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>> char mp_name[RTE_MEMPOOL_NAMESIZE];
>>> const char *netdev_name = netdev_get_name(&dev->up);
>>> int socket_id = dev->requested_socket_id;
>>> - uint32_t n_mbufs;
>>> + uint32_t n_mbufs, max_mbufs;
>>> uint32_t hash = hash_string(netdev_name, 0);
>>> struct rte_mempool *mp = NULL;
>>>
>>> @@ -534,6 +534,13 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>> + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>> NETDEV_MAX_BURST
>>> + MIN_NB_MBUF;
>>>
>>> + max_mbufs = dpdk_mempool_size_limit();
>>> + if (max_mbufs < n_mbufs) {
>>> + VLOG_WARN("Mempool size for port %s limited to %"PRIu32
>>> + " mbufs. Original estimation was %"PRIu32" mbufs.",
>>> + netdev_name, max_mbufs, n_mbufs);
>>> + }
>>> +
>>> ovs_mutex_lock(&dpdk_mp_mutex);
>>> do {
>>> /* Full DPDK memory pool name must be unique and cannot be
>>>
>>> ----------------------------------------------------------------------
>>> ---------
>>>
>>> I think that above approach is a good enough workaround because only
>>> user knows how much memory he/she able to allocate for OVS and how
>>> many ports expected.
>>> I'd like the limit to be in megabytes, but I didn't find a good way to
>>> estimate total size of a mempool.
>>>
>>>> (iv) Introduce a feature to allow users to configure mempool as shared
>> or on a per port basis: This would be the best of both worlds but given
>> the proximity to the 2.9 freeze I don't think it's feasible by the end of
>> January.
>>>
>>> This, probably, may solve some issues, but yes, it's definitely not for
>> 2.9.
>>> Good implementation and extensive testing will be required.
>>
>> +1. There may be other solutions too, Pradeep had a different
>> suggestion, but it's not for now.
>
> OK we can agree to remove this option for the moment, at least within the
> period for the OVS code freeze. It will definitely be looked at for 2.10
> however.
>
> Thanks
> Ian
>>
>> thanks,
>> Kevin.
>>
>>>
>>>>
>>>> I'd appreciate peoples input on this ASAP so we can reach a consensus
>> on the next steps.
>>>>
>>>> Thanks
>>>> Ian
>>
>
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss