On 01/26/2018 05:27 PM, Stokes, Ian wrote: >> -----Original Message----- >> From: Kevin Traynor [mailto:ktray...@redhat.com] >> Sent: Friday, January 26, 2018 3:48 PM >> To: Ilya Maximets <i.maxim...@samsung.com>; Stokes, Ian >> <ian.sto...@intel.com>; ovs-discuss@openvswitch.org >> Cc: Flavio Leitner <f...@redhat.com>; Loftus, Ciara >> <ciara.lof...@intel.com>; Kavanagh, Mark B <mark.b.kavan...@intel.com>; >> Jan Scheurich (jan.scheur...@ericsson.com) <jan.scheur...@ericsson.com>; >> Ben Pfaff (b...@ovn.org) <b...@ovn.org>; acon...@redhat.com; Venkatesan >> Pradeep <venkatesan.prad...@ericsson.com> >> Subject: Re: Mempool issue for OVS 2.9 >> >> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote: >>> On 26.01.2018 15:00, Stokes, Ian wrote: >>>> Hi All, >>>> >>>> Recently an issue was raised regarding the move from a single shared >> mempool model that was in place up to OVS 2.8, to a mempool per port model >> introduced in 2.9. >>>> >>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/04602 >>>> 1.html >>>> >>>> The per port mempool model was introduced in September 5th with commit >> d555d9bded to allow fine grain control on a per port case of memory usage. >>>> >>>> In the 'common/shared mempool' model, ports sharing a similar MTU would >> all share the same buffer mempool (e.g. the most common example of this >> being that all ports are by default created with a 1500B MTU, and as such >> share the same mbuf mempool). >>>> >>>> This approach had some drawbacks however. For example, with the shared >> memory pool model a user could exhaust the shared memory pool (for >> instance by requesting a large number of RXQs for a port), this would >> cause the vSwitch to crash as any remaining ports would not have the >> required memory to function. This bug was discovered and reported to the >> community in late 2016 https://mail.openvswitch.org/pipermail/ovs- >> discuss/2016-September/042560.html. >>>> >>>> The per port mempool patch aimed to avoid such issues by allocating a >> separate buffer mempool to each port. >>>> >>>> An issue has been flagged on ovs-discuss, whereby memory dimensions >> provided for a given number of ports on OvS 2.6-2.8 may be insufficient to >> support the same number of ports in OvS 2.9, on account of the per-port >> mempool model without re-dimensioning extra memory. The effect of this is >> use case dependent (number of ports, RXQs, MTU settings, number of PMDs >> etc.) The previous 'common-pool' model was rudimentary in estimating the >> mempool size and was marked as something that was to be improved upon. The >> memory allocation calculation for per port model was modified to take the >> possible configuration factors mentioned into account. >>>> >>>> It's unfortunate that this came to light so close to the release code >> freeze - but better late than never as it is a valid problem to be >> resolved. >>>> >>>> I wanted to highlight some options to the community as I don't think >> the next steps should be taken in isolation due to the impact this feature >> has. >>>> >>>> There are a number of possibilities for the 2.9 release. >>>> >>>> (i) Revert the mempool per port patches and return to the shared >> mempool model. There are a number of features and refactoring in place on >> top of the change so this will not be a simple revert. I'm investigating >> what exactly is involved with this currently. >>> >>> This looks like a bad thing to do. Shared memory pools has their own >>> issues and hides the real memory usage by each port. Also, reverting >>> seems almost impossible, we'll have to re-implement it from scratch. > > I would agree, reverting isn’t as straight forward an option due to the > amount of commits that were introduced in relation to the per port mempool > feature over time(listed below for completeness). > > netdev-dpdk: Create separate memory pool for each port: d555d9b > netdev-dpdk: fix management of pre-existing mempools:b6b26021d > netdev-dpdk: Fix mempool names to reflect socket id: f06546a > netdev-dpdk: skip init for existing mempools: 837c176 > netdev-dpdk: manage failure in mempool name creation: 65056fd > netdev-dpdk: Reword mp_size as n_mbufs: ad9b5b9 > netdev-dpdk: Rename dpdk_mp_put as dpdk_mp_free: a08a115 > netdev-dpdk: Fix mp_name leak on snprintf failure: ec6edc8 > netdev-dpdk: Fix dpdk_mp leak in case of EEXIST: 173ef76 > netdev-dpdk: Factor out struct dpdk_mp: 24e78f9 > netdev-dpdk: Remove unused MAX_NB_MBUF: bc57ed9 > netdev-dpdk: Fix mempool creation with large MTU: af5b0da > netdev-dpdk: Add debug appctl to get mempool information: be48173 > > Although a lot of these are fixes/formatting, we would have to introduce a > new series and re-introduce shared model by removing the per port specifics. > This will require extensive testing also. > > Functionality such as the mempool debug tool would also be impacted as well > (I'm not sure how valuable it would be for 1 shared mempool). > > I can look at putting an RFC for this together so as to keep the option on > the table if it's preferred within the 2.9 time window. > >>>>> (ii) Leave the per port mempool implementation as is, flag to users >> that memory requirements have increased. Extra memory may have to be >> provided on a per use case basis. >>> >>> That's a good point. I prefer this option if possible. >> >> With 4 pmds and 4 rxqs per port a user would only be able to add 7 ports >> per socket now. 9 ports if 1 Rxq. Of course a user may have some headroom >> in the memory they have allocated but that can't be assumed. It will >> surely mean some users setup will not work after upgrade - then they will >> have to debug why. >> >> If we just reduce the MIN_NB_MBUF to 8192, it would make it 10 ports/4 >> rxqs, or 13 ports/1 rxqs without additional memory. It still gives 24K/18K >> buffers per port. >> >> What do you think? > > This is similar to the RFC you submitted I think.
The RFC tried to make the number of buffers required calculation more accurate, reduce the number of additional buffers by 75% and allow it go to zero. The suggestion above just reduces the additional buffers by 50%. Ilya has shown a common case where it would not fix the issue or be a feasible workaround. > I take it we still hit the issues flagged by Pradeep but it increases the > number of ports at least. > > The 24K/18K is in and around the region Ilya had flagged below so it might be > ok. It would be great if we had common config/deployment info to help confirm > this. > >> >>> >>>> (iii) Reduce the amount of memory allocated per mempool per port. An >> RFC to this effect was submitted by Kevin but on follow up the feeling is >> that it does not resolve the issue adequately. >>> >> >> Right, we discussed it in the other thread. It fixes some of the missed >> factors for worst case, but Pradeep pointed out there are more. To account >> for them all would be difficult and mean a lot of extra mbufs per port. >> >>> IMHO, we should not reduce the mempool size by hardcoding small value. >>> Different installations has their own requirements because of >>> different numbers of HW NICs and CPU cores. I beleive that 20-30K is a >> normal size per-port for now. >>> >> >> Would be not too far off that range with suggested change above. >> >>> How about to add an upper limit for mempool size configurable in boot >> time? >> >> I don't think an OVS user would know how to calculate the number of mbufs. >> Even if they did, setting it would constitute them okaying that they may >> run out of mbufs on a port. > > I think with the shared memory model as it was before risks running out of > mbufs also as per the original bug where it was killing the switch. > > Below helps with the debug side for users. Ignoring whether they have the > ability to calculate the required memory for the moment (I agree BTW, I don’t > think some users would know how to accurately do this), would the new > parameter still be an issue for backwards compatibility? Yes, it would. > >> >>> See below code for example (not tested): >>> ---------------------------------------------------------------------- >>> --------- diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index >>> 3602180..ed54e06 100644 >>> --- a/lib/dpdk-stub.c >>> +++ b/lib/dpdk-stub.c >>> @@ -54,3 +54,9 @@ dpdk_vhost_iommu_enabled(void) { >>> return false; >>> } >>> + >>> +uint32_t >>> +dpdk_mempool_size_limit(void) >>> +{ >>> + return UINT32_MAX; >>> +} >>> diff --git a/lib/dpdk.c b/lib/dpdk.c >>> index 6710d10..219e412 100644 >>> --- a/lib/dpdk.c >>> +++ b/lib/dpdk.c >>> @@ -43,6 +43,9 @@ static FILE *log_stream = NULL; /* Stream for >> DPDK log redirection */ >>> static char *vhost_sock_dir = NULL; /* Location of vhost-user sockets >> */ >>> static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU >>> support */ >>> >>> +/* Maximum number of mbufs in a single memory pool. */ static >>> +uint32_t per_port_mempool_size_limit = UINT32_MAX; >>> + >>> static int >>> process_vhost_flags(char *flag, const char *default_val, int size, >>> const struct smap *ovs_other_config, @@ -313,6 >>> +316,7 @@ dpdk_init__(const struct smap *ovs_other_config) >>> int err = 0; >>> cpu_set_t cpuset; >>> char *sock_dir_subcomponent; >>> + uint64_t size_u64; >>> >>> log_stream = fopencookie(NULL, "w+", dpdk_log_func); >>> if (log_stream == NULL) { >>> @@ -351,6 +355,14 @@ dpdk_init__(const struct smap *ovs_other_config) >>> VLOG_INFO("IOMMU support for vhost-user-client %s.", >>> vhost_iommu_enabled ? "enabled" : "disabled"); >>> >>> + size_u64 = smap_get_ullong(ovs_other_config, "per-port-mempool- >> size-limit", >>> + UINT32_MAX); >>> + if (size_u64 < UINT32_MAX) { >>> + per_port_mempool_size_limit = size_u64; >>> + VLOG_INFO("Mempool size for a single port limited to %"PRIu32" >> mbufs.", >>> + per_port_mempool_size_limit); >>> + } >>> + >>> argv = grow_argv(&argv, 0, 1); >>> argc = 1; >>> argv[0] = xstrdup(ovs_get_program_name()); @@ -494,6 +506,12 @@ >>> dpdk_vhost_iommu_enabled(void) >>> return vhost_iommu_enabled; >>> } >>> >>> +uint32_t >>> +dpdk_mempool_size_limit(void) >>> +{ >>> + return per_port_mempool_size_limit; } >>> + >>> void >>> dpdk_set_lcore_id(unsigned cpu) >>> { >>> diff --git a/lib/dpdk.h b/lib/dpdk.h >>> index dc58d96..4d78895 100644 >>> --- a/lib/dpdk.h >>> +++ b/lib/dpdk.h >>> @@ -38,5 +38,6 @@ void dpdk_init(const struct smap *ovs_other_config); >>> void dpdk_set_lcore_id(unsigned cpu); const char >>> *dpdk_get_vhost_sock_dir(void); bool dpdk_vhost_iommu_enabled(void); >>> +uint32_t dpdk_mempool_size_limit(void); >>> >>> #endif /* dpdk.h */ >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index >>> e834c7e..b2f1072 100644 >>> --- a/lib/netdev-dpdk.c >>> +++ b/lib/netdev-dpdk.c >>> @@ -518,7 +518,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu) >>> char mp_name[RTE_MEMPOOL_NAMESIZE]; >>> const char *netdev_name = netdev_get_name(&dev->up); >>> int socket_id = dev->requested_socket_id; >>> - uint32_t n_mbufs; >>> + uint32_t n_mbufs, max_mbufs; >>> uint32_t hash = hash_string(netdev_name, 0); >>> struct rte_mempool *mp = NULL; >>> >>> @@ -534,6 +534,13 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu) >>> + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * >> NETDEV_MAX_BURST >>> + MIN_NB_MBUF; >>> >>> + max_mbufs = dpdk_mempool_size_limit(); >>> + if (max_mbufs < n_mbufs) { >>> + VLOG_WARN("Mempool size for port %s limited to %"PRIu32 >>> + " mbufs. Original estimation was %"PRIu32" mbufs.", >>> + netdev_name, max_mbufs, n_mbufs); >>> + } >>> + >>> ovs_mutex_lock(&dpdk_mp_mutex); >>> do { >>> /* Full DPDK memory pool name must be unique and cannot be >>> >>> ---------------------------------------------------------------------- >>> --------- >>> >>> I think that above approach is a good enough workaround because only >>> user knows how much memory he/she able to allocate for OVS and how >>> many ports expected. >>> I'd like the limit to be in megabytes, but I didn't find a good way to >>> estimate total size of a mempool. >>> >>>> (iv) Introduce a feature to allow users to configure mempool as shared >> or on a per port basis: This would be the best of both worlds but given >> the proximity to the 2.9 freeze I don't think it's feasible by the end of >> January. >>> >>> This, probably, may solve some issues, but yes, it's definitely not for >> 2.9. >>> Good implementation and extensive testing will be required. >> >> +1. There may be other solutions too, Pradeep had a different >> suggestion, but it's not for now. > > OK we can agree to remove this option for the moment, at least within the > period for the OVS code freeze. It will definitely be looked at for 2.10 > however. > > Thanks > Ian >> >> thanks, >> Kevin. >> >>> >>>> >>>> I'd appreciate peoples input on this ASAP so we can reach a consensus >> on the next steps. >>>> >>>> Thanks >>>> Ian >> > _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss