On 01/26/2018 05:27 PM, Stokes, Ian wrote:
>> -----Original Message-----
>> From: Kevin Traynor [mailto:ktray...@redhat.com]
>> Sent: Friday, January 26, 2018 3:48 PM
>> To: Ilya Maximets <i.maxim...@samsung.com>; Stokes, Ian
>> <ian.sto...@intel.com>; ovs-discuss@openvswitch.org
>> Cc: Flavio Leitner <f...@redhat.com>; Loftus, Ciara
>> <ciara.lof...@intel.com>; Kavanagh, Mark B <mark.b.kavan...@intel.com>;
>> Jan Scheurich (jan.scheur...@ericsson.com) <jan.scheur...@ericsson.com>;
>> Ben Pfaff (b...@ovn.org) <b...@ovn.org>; acon...@redhat.com; Venkatesan
>> Pradeep <venkatesan.prad...@ericsson.com>
>> Subject: Re: Mempool issue for OVS 2.9
>>
>> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
>>> On 26.01.2018 15:00, Stokes, Ian wrote:
>>>> Hi All,
>>>>
>>>> Recently an issue was raised regarding the move from a single shared
>> mempool model that was in place up to OVS 2.8, to a mempool per port model
>> introduced in 2.9.
>>>>
>>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/04602
>>>> 1.html
>>>>
>>>> The per port mempool model was introduced in September 5th with commit
>> d555d9bded to allow fine grain control on a per port case of memory usage.
>>>>
>>>> In the 'common/shared mempool' model, ports sharing a similar MTU would
>> all share the same buffer mempool (e.g. the most common example of this
>> being that all ports are by default created with a 1500B MTU, and as such
>> share the same mbuf mempool).
>>>>
>>>> This approach had some drawbacks however. For example, with the shared
>> memory pool model a user could exhaust the shared memory pool (for
>> instance by requesting a large number of RXQs for a port), this would
>> cause the vSwitch to crash as any remaining ports would not have the
>> required memory to function. This bug was discovered and reported to the
>> community in late 2016 https://mail.openvswitch.org/pipermail/ovs-
>> discuss/2016-September/042560.html.
>>>>
>>>> The per port mempool patch aimed to avoid such issues by allocating a
>> separate buffer mempool to each port.
>>>>
>>>> An issue has been flagged on ovs-discuss, whereby memory dimensions
>> provided for a given number of ports on OvS 2.6-2.8 may be insufficient to
>> support the same number of ports in OvS 2.9, on account of the per-port
>> mempool model without re-dimensioning extra memory. The effect of this is
>> use case dependent (number of ports, RXQs, MTU settings, number of PMDs
>> etc.) The previous 'common-pool' model was rudimentary in estimating the
>> mempool size and was marked as something that was to be improved upon. The
>> memory allocation calculation for per port model was modified to take the
>> possible configuration factors mentioned into account.
>>>>
>>>> It's unfortunate that this came to light so close to the release code
>> freeze - but better late than never as it is a valid problem to be
>> resolved.
>>>>
>>>> I wanted to highlight some options to the community as I don't think
>> the next steps should be taken in isolation due to the impact this feature
>> has.
>>>>
>>>> There are a number of possibilities for the 2.9 release.
>>>>
>>>> (i) Revert the mempool per port patches and return to the shared
>> mempool model. There are a number of features and refactoring in place on
>> top of the change so this will not be a simple revert. I'm investigating
>> what exactly is involved with this currently.
>>>
>>> This looks like a bad thing to do. Shared memory pools has their own
>>> issues and hides the real memory usage by each port. Also, reverting
>>> seems almost impossible, we'll have to re-implement it from scratch.
> 
> I would agree, reverting isn’t as straight forward an option due to the 
> amount of commits that were introduced in relation to the per port mempool 
> feature over time(listed below for completeness).
> 
> netdev-dpdk: Create separate memory pool for each port: d555d9b
> netdev-dpdk: fix management of pre-existing mempools:b6b26021d
> netdev-dpdk: Fix mempool names to reflect socket id: f06546a
> netdev-dpdk: skip init for existing mempools: 837c176
> netdev-dpdk: manage failure in mempool name creation: 65056fd 
> netdev-dpdk: Reword mp_size as n_mbufs: ad9b5b9
> netdev-dpdk: Rename dpdk_mp_put as dpdk_mp_free: a08a115
> netdev-dpdk: Fix mp_name leak on snprintf failure: ec6edc8
> netdev-dpdk: Fix dpdk_mp leak in case of EEXIST: 173ef76
> netdev-dpdk: Factor out struct dpdk_mp: 24e78f9
> netdev-dpdk: Remove unused MAX_NB_MBUF: bc57ed9
> netdev-dpdk: Fix mempool creation with large MTU: af5b0da
> netdev-dpdk: Add debug appctl to get mempool information: be48173
> 
> Although a lot of these are fixes/formatting, we would have to introduce a 
> new series and re-introduce shared model by removing the per port specifics. 
> This will require extensive testing also.
> 
> Functionality such as the mempool debug tool would also be impacted as well 
> (I'm not sure how valuable it would be for 1 shared mempool).
> 
> I can look at putting an RFC for this together so as to keep the option on 
> the table if it's preferred within the 2.9 time window.
> 
>>>>> (ii) Leave the per port mempool implementation as is, flag to users
>> that memory requirements have increased. Extra memory may have to be
>> provided on a per use case basis.
>>>
>>> That's a good point. I prefer this option if possible.
>>
>> With 4 pmds and 4 rxqs per port a user would only be able to add 7 ports
>> per socket now. 9 ports if 1 Rxq. Of course a user may have some headroom
>> in the memory they have allocated but that can't be assumed. It will
>> surely mean some users setup will not work after upgrade - then they will
>> have to debug why.
>>
>> If we just reduce the MIN_NB_MBUF to 8192, it would make it 10 ports/4
>> rxqs, or 13 ports/1 rxqs without additional memory. It still gives 24K/18K
>> buffers per port.
>>
>> What do you think?
> 
> This is similar to the RFC you submitted I think.

The RFC tried to make the number of buffers required calculation more
accurate, reduce the number of additional buffers by 75% and allow it go
to zero. The suggestion above just reduces the additional buffers by
50%. Ilya has shown a common case where it would not fix the issue or be
a feasible workaround.
> I take it we still hit the issues flagged by Pradeep but it increases the 
> number of ports at least.
> 
> The 24K/18K is in and around the region Ilya had flagged below so it might be 
> ok. It would be great if we had common config/deployment info to help confirm 
> this.
> 
>>
>>>
>>>> (iii) Reduce the amount of memory allocated per mempool per port. An
>> RFC to this effect was submitted by Kevin but on follow up the feeling is
>> that it does not resolve the issue adequately.
>>>
>>
>> Right, we discussed it in the other thread. It fixes some of the missed
>> factors for worst case, but Pradeep pointed out there are more. To account
>> for them all would be difficult and mean a lot of extra mbufs per port.
>>
>>> IMHO, we should not reduce the mempool size by hardcoding small value.
>>> Different installations has their own requirements because of
>>> different numbers of HW NICs and CPU cores. I beleive that 20-30K is a
>> normal size per-port for now.
>>>
>>
>> Would be not too far off that range with suggested change above.
>>
>>> How about to add an upper limit for mempool size configurable in boot
>> time?
>>
>> I don't think an OVS user would know how to calculate the number of mbufs.
>> Even if they did, setting it would constitute them okaying that they may
>> run out of mbufs on a port.
> 
> I think with the shared memory model as it was before risks running out of 
> mbufs also as per the original bug where it was killing the switch.
> 
> Below helps with the debug side for users. Ignoring whether they have the 
> ability to calculate the required memory for the moment (I agree BTW, I don’t 
> think some users would know how to accurately do this), would the new 
> parameter still be an issue for backwards compatibility?

Yes, it would.

> 
>>
>>> See below code for example (not tested):
>>> ----------------------------------------------------------------------
>>> --------- diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index
>>> 3602180..ed54e06 100644
>>> --- a/lib/dpdk-stub.c
>>> +++ b/lib/dpdk-stub.c
>>> @@ -54,3 +54,9 @@ dpdk_vhost_iommu_enabled(void)  {
>>>      return false;
>>>  }
>>> +
>>> +uint32_t
>>> +dpdk_mempool_size_limit(void)
>>> +{
>>> +    return UINT32_MAX;
>>> +}
>>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>>> index 6710d10..219e412 100644
>>> --- a/lib/dpdk.c
>>> +++ b/lib/dpdk.c
>>> @@ -43,6 +43,9 @@ static FILE *log_stream = NULL;       /* Stream for
>> DPDK log redirection */
>>>  static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets
>> */
>>>  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
>>> support */
>>>
>>> +/* Maximum number of mbufs in a single memory pool. */ static
>>> +uint32_t per_port_mempool_size_limit = UINT32_MAX;
>>> +
>>>  static int
>>>  process_vhost_flags(char *flag, const char *default_val, int size,
>>>                      const struct smap *ovs_other_config, @@ -313,6
>>> +316,7 @@ dpdk_init__(const struct smap *ovs_other_config)
>>>      int err = 0;
>>>      cpu_set_t cpuset;
>>>      char *sock_dir_subcomponent;
>>> +    uint64_t size_u64;
>>>
>>>      log_stream = fopencookie(NULL, "w+", dpdk_log_func);
>>>      if (log_stream == NULL) {
>>> @@ -351,6 +355,14 @@ dpdk_init__(const struct smap *ovs_other_config)
>>>      VLOG_INFO("IOMMU support for vhost-user-client %s.",
>>>                 vhost_iommu_enabled ? "enabled" : "disabled");
>>>
>>> +    size_u64 = smap_get_ullong(ovs_other_config, "per-port-mempool-
>> size-limit",
>>> +                               UINT32_MAX);
>>> +    if (size_u64 < UINT32_MAX) {
>>> +        per_port_mempool_size_limit = size_u64;
>>> +        VLOG_INFO("Mempool size for a single port limited to %"PRIu32"
>> mbufs.",
>>> +                  per_port_mempool_size_limit);
>>> +    }
>>> +
>>>      argv = grow_argv(&argv, 0, 1);
>>>      argc = 1;
>>>      argv[0] = xstrdup(ovs_get_program_name()); @@ -494,6 +506,12 @@
>>> dpdk_vhost_iommu_enabled(void)
>>>      return vhost_iommu_enabled;
>>>  }
>>>
>>> +uint32_t
>>> +dpdk_mempool_size_limit(void)
>>> +{
>>> +    return per_port_mempool_size_limit; }
>>> +
>>>  void
>>>  dpdk_set_lcore_id(unsigned cpu)
>>>  {
>>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>>> index dc58d96..4d78895 100644
>>> --- a/lib/dpdk.h
>>> +++ b/lib/dpdk.h
>>> @@ -38,5 +38,6 @@ void dpdk_init(const struct smap *ovs_other_config);
>>> void dpdk_set_lcore_id(unsigned cpu);  const char
>>> *dpdk_get_vhost_sock_dir(void);  bool dpdk_vhost_iommu_enabled(void);
>>> +uint32_t dpdk_mempool_size_limit(void);
>>>
>>>  #endif /* dpdk.h */
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
>>> e834c7e..b2f1072 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -518,7 +518,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>      const char *netdev_name = netdev_get_name(&dev->up);
>>>      int socket_id = dev->requested_socket_id;
>>> -    uint32_t n_mbufs;
>>> +    uint32_t n_mbufs, max_mbufs;
>>>      uint32_t hash = hash_string(netdev_name, 0);
>>>      struct rte_mempool *mp = NULL;
>>>
>>> @@ -534,6 +534,13 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>>                + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>> NETDEV_MAX_BURST
>>>                + MIN_NB_MBUF;
>>>
>>> +    max_mbufs = dpdk_mempool_size_limit();
>>> +    if (max_mbufs < n_mbufs) {
>>> +        VLOG_WARN("Mempool size for port %s limited to %"PRIu32
>>> +                  " mbufs. Original estimation was %"PRIu32" mbufs.",
>>> +                  netdev_name, max_mbufs, n_mbufs);
>>> +    }
>>> +
>>>      ovs_mutex_lock(&dpdk_mp_mutex);
>>>      do {
>>>          /* Full DPDK memory pool name must be unique and cannot be
>>>
>>> ----------------------------------------------------------------------
>>> ---------
>>>
>>> I think that above approach is a good enough workaround because only
>>> user knows how much memory he/she able to allocate for OVS and how
>>> many ports expected.
>>> I'd like the limit to be in megabytes, but I didn't find a good way to
>>> estimate total size of a mempool.
>>>
>>>> (iv) Introduce a feature to allow users to configure mempool as shared
>> or on a per port basis: This would be the best of both worlds but given
>> the proximity to the 2.9 freeze I don't think it's feasible by the end of
>> January.
>>>
>>> This, probably, may solve some issues, but yes, it's definitely not for
>> 2.9.
>>> Good implementation and extensive testing will be required.
>>
>> +1. There may be other solutions too, Pradeep had a different
>> suggestion, but it's not for now.
> 
> OK we can agree to remove this option for the moment, at least within the 
> period for the OVS code freeze. It will definitely be looked at for 2.10 
> however.
> 
> Thanks
> Ian
>>
>> thanks,
>> Kevin.
>>
>>>
>>>>
>>>> I'd appreciate peoples input on this ASAP so we can reach a consensus
>> on the next steps.
>>>>
>>>> Thanks
>>>> Ian
>>
> 

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to