Re: [ovs-dev] [PATCH v5 1/2] handlers: Create additional handler threads when using CPU isolation

Aaron Conole Wed, 06 Jul 2022 07:06:21 -0700

Michael Santana <[email protected]> writes:

> On 7/4/22 09:46, Ilya Maximets wrote:
>> Hi, Michael.  Thanks for the new version!
> Hi Ilya, thanks for the response
>> On 6/6/22 20:59, Michael Santana wrote:
>>> Additional threads are required to service upcalls when we have CPU
>>> isolation (in per-cpu dispatch mode). The reason additional threads
>>> are required is because it creates a more fair distribution.
>> I think, the description above lacks the information on why we need
>> to create additional threads, and why we choose the number to be a
>> prime number.  Yes, you said that it creates a more fair distribution,
>> but we probably need to describe that more with an example.
>> 
>>> With more
>>> threads we decrease the load of each thread as more threads would
>>> decrease the number of cores each threads is assigned.
>> While that is true, it's not obvious why increasing the number of
>> threads beyond the number of available cores is good for us.
>> The key factor is more fair load distribution among threads, so all
>> cores are utilized, but that is not highlighted.
> Yes agreed.
>
> We want additional threads because we want to minimize the number of
> cores assigned to individual threads. 75 handler threads servicing 100 
> cores is more optimal than 50 threads servicing 100 cores. This is
> because on the 75 handler case, each thread would have to service on 
> average 1.33 cores where as the 50 handler case each thread would have
> to service 2 cores. Less cores assigned to individual threads less to 
> less work that thread has to do. This example ignores the number of
> actual cores available to use for OVS userspace.
>
> On the flip side, we do not wish to create too many threads as we fear
> we end up in a case where we have too many threads and not enough
> active cores OVS can use
>
> Which brings me to my last point. The prime schema is arbitrary (or
> good enough for now). We really just want more threads and there is no
> reason why prime schema is better than the next (at least not without
> knowing how many threads we can get away with adding before kernel
> overhead hurts performance). I think you had mentioned it several
> times but I think we need to do testing to figure out exactly how many
> threads we can add before kernel overhead hurts performance. I
> speculate the prime schema is on the low end and I think that we can
> add more threads without hurting performance than the prime schema
> will allow. We can look at how the upcalls/s rate changes based on the
> number of handlers and the number of active cores. The prime schema is
> an easy solution but it is a little driving blindly without knowing
> more stats.


Let's also not forget that a user tells OVS to be bound only to certain
cores, and will be slightly confused when there are more handler threads
than the number of isolated CPUs.  This will be confusing no matter
what schema is chosen - so it will be important to have some good
documentation, and perhaps an INFO log that tells something about this.

> On the other hand, stats might be different from one system to
> another. What might be good on my test system doesn't necessarily
> translate to another system. LMK what you think about this, if this is
> worth the effort
>
> The prime schema is sufficient as is. I just think that we can improve
> it a little bit. But we dont have to go down that road if need be
>
>
>
>
>> 
>>> Adding as many
>>> threads are there are cores could have a performance hit when the number
>>> of active cores (which all threads have to share) is very low. For this
>>> reason we avoid creating as many threads as there are cores and instead
>>> meet somewhere in the middle.
>>>
>>> The formula used to calculate the number of handler threads to create
>>> is as follows:
>>>
>>> handlers_n = min(next_prime(active_cores+1), total_cores)
>>>
>>> Where next_prime is a function that returns the next numeric prime
>>> number that is strictly greater than active_cores
>>>
>>> Assume default behavior when total_cores <= 2, that is do not create
>>> additional threads when we have less than 2 total cores on the system
>>>
>>> Fixes: b1e517bd2f81 ("dpif-netlink: Introduce per-cpu upcall dispatch.")
>>> Signed-off-by: Michael Santana <[email protected]>
>>> ---
>>>   lib/dpif-netlink.c | 86 ++++++++++++++++++++++++++++++++++++++++++++--
>>>   lib/ovs-thread.c   | 16 +++++++++
>>>   lib/ovs-thread.h   |  1 +
>>>   3 files changed, 101 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
>>> index 06e1e8ca0..e948a829b 100644
>>> --- a/lib/dpif-netlink.c
>>> +++ b/lib/dpif-netlink.c
>>> @@ -31,6 +31,7 @@
>>>   #include <sys/epoll.h>
>>>   #include <sys/stat.h>
>>>   #include <unistd.h>
>>> +#include <math.h>
>>>     #include "bitmap.h"
>>>   #include "dpif-netlink-rtnl.h"
>>> @@ -2506,6 +2507,87 @@ dpif_netlink_handler_uninit(struct dpif_handler 
>>> *handler)
>>>   }
>>>   #endif
>>>   +/*
>>> + * Returns 1 if num is a prime number,
>>> + * Otherwise return 0
>>> + */
>>> +static uint32_t
>> This should just return bool.
> ACK, thanks
>> 
>>> +is_prime(uint32_t num)
>>> +{
>>> +    if ((num < 2) || ((num % 2) == 0)) {
>> There is no need for so many parenthesis.
>> And it might make sense to just write 3 if blocks
>> here to make the code simpler, i.e. check for
>> < 2, == 2 and % 2.
> ACK, thanks
>> 
>>> +        return num == 2;
>>> +    }
>>> +
>>> +    uint32_t i;
>>> +    uint32_t limit = sqrt((float) num);
>>> +    for (i = 3; i <= limit; i += 2) {
>> for (uint64_t i = 3; i * i <= num; i += 2) {
>> There is no real need for sqrt.  I don't think we should be
>> concerned
>> about 'i * i' performance here.
> ACK, I think I got carried way with optimization
>> 
>>> +        if (num % i == 0) {
>>> +            return 0;
>>> +        }
>>> +    }
>>> +
>>> +    return 1;
>>> +}
>>> +
>>> +/*
>>> + * Returns num if num is a prime number. Otherwise returns the next
>>> + * prime greater than num. Search is limited by UINT32_MAX.
>> 2 spaces between sentences, please.
> What do you mean? you mean like this?
> s/number. Otherwise/number.  Otherwise/
> s/num. Search/num.  Search/
>
>> 
>>> + *
>>> + * Returns 0 if no prime has been found between num and UINT32_MAX
>>> + */
>>> +static uint32_t
>>> +next_prime(uint32_t num)
>>> +{
>>> +    if (num < 2) {
>>> +        return 2;
>>> +    }
>>> +    if (num == 2) {
>>> +        return 3;
>> This contradicts the description of the function.  So, this check
>> should probably be done in dpif_netlink_calculate_handlers_num.
> ACK, Yes, you are right
>> 
>>> +    }
>>> +    if (num % 2 == 0) {
>>> +        num++;
>>> +    }
>>> +
>>> +    uint32_t i;
>>> +    for (i = num; i < UINT32_MAX; i += 2) {
>> We may just start this loop from 'num' and increment by 1 instead
>> of 2.  is_prime will return false for even numbers right away, so
>> there is no real need for +=2 optimization here.  And we can also
>> remove 2 out of 3 'if's at the beginning of this function (except
>> the incorrect one).
> I was trying to make the two functions independent of each other, but
> I guess there is no need
>
> ACK
>> 
>>> +        if (is_prime(i)) {
>>> +            return i;
>>> +        }
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +/*
>>> + * Calcuales and returns the number of handler threads needed based
>> s/Calcuales/Calculates/
> ACK, thanks
>> 
>>> + * the following formula:
>>> + *
>>> + * handlers_n = min(next_prime(active_cores+1), total_cores)
>>> + *
>>> + * Where next_prime is a function that returns the next numeric prime
>>> + * number that is strictly greater than active_cores
>>> + */
>>> +static uint32_t
>>> +dpif_netlink_calculate_handlers_num(void)
>> Maybe dpif_netlink_calculate_n_handlers() ?
> ACK, thanks
>> 
>>> +{
>>> +    uint32_t next_prime_num;
>>> +    uint32_t n_handlers = count_cpu_cores();
>>> +    uint32_t total_cores = count_total_cores();
>> Reverse x-mass tree, please.
> ACK, Thanks
>> 
>>> +
>>> +    /*
>>> +     * If we have isolated cores, add additional handler threads to
>>> +     * service inactive cores in the unlikely event that traffic goes
>>> +     * through inactive cores
>> These core are not really inactive, they are just not available to
>> OVS process.  And it can be a very likely event that traffic will
>> go through these cores, in some typical configurations.
> Ageed, I think this comment carried over from before I understood we
> have no way of predicting the likelihood of traffic on cores. Sorry
> this is taking so long...
>
> ACK
>> 
>>> +     */
>>> +    if (n_handlers < total_cores && total_cores > 2) {
>>> +        next_prime_num = next_prime(n_handlers + 1);
>>> +        n_handlers = next_prime_num >= total_cores ?
>>> +                                            total_cores : next_prime_num;
>> Please, breka the line before the '?' and indent it at the
>> beginning of the right-hand expression, i.e. at the start of
>> 'next_prime_num'.
> What do you mean? I dont follow. You mean like this?
> n_handlers = next_prime_num >=
>              total_cores ? total_cores : next_prime_num;
>
>
>> 
>>> +    }
>>> +
>>> +    return n_handlers;
>>> +}
>>> +
>>>   static int
>>>   dpif_netlink_refresh_handlers_cpu_dispatch(struct dpif_netlink *dpif)
>>>       OVS_REQ_WRLOCK(dpif->upcall_lock)
>>> @@ -2515,7 +2597,7 @@ dpif_netlink_refresh_handlers_cpu_dispatch(struct 
>>> dpif_netlink *dpif)
>>>       uint32_t n_handlers;
>>>       uint32_t *upcall_pids;
>>>   -    n_handlers = count_cpu_cores();
>>> +    n_handlers = dpif_netlink_calculate_handlers_num();
>>>       if (dpif->n_handlers != n_handlers) {
>>>           VLOG_DBG("Dispatch mode(per-cpu): initializing %d handlers",
>>>                      n_handlers);
>>> @@ -2755,7 +2837,7 @@ dpif_netlink_number_handlers_required(struct dpif 
>>> *dpif_, uint32_t *n_handlers)
>>>       struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
>>>         if (dpif_netlink_upcall_per_cpu(dpif)) {
>>> -        *n_handlers = count_cpu_cores();
>>> +        *n_handlers = dpif_netlink_calculate_handlers_num();
>>>           return true;
>>>       }
>>>   diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
>>> index 805cba622..2172b3d3f 100644
>>> --- a/lib/ovs-thread.c
>>> +++ b/lib/ovs-thread.c
>>> @@ -663,6 +663,22 @@ count_cpu_cores(void)
>>>       return n_cores > 0 ? n_cores : 0;
>>>   }
>>>   +/* Returns the total number of cores on the system, or 0 if the
>>> + * number cannot be determined. */
>>> +int
>>> +count_total_cores(void) {
>> The '{' should be on the next line.
> Not sure how I didnt catch this
>
> ACK
>> 
>>> +    long int n_cores;
>>> +
>>> +#ifndef _WIN32
>>> +    n_cores = sysconf(_SC_NPROCESSORS_CONF);
>>> +#else
>>> +    n_cores = 0;
>>> +    errno = ENOTSUP;
>>> +#endif
>>> +
>>> +    return n_cores > 0 ? n_cores : 0;
>>> +}
>>> +
>>>   /* Returns 'true' if current thread is PMD thread. */
>>>   bool
>>>   thread_is_pmd(void)
>>> diff --git a/lib/ovs-thread.h b/lib/ovs-thread.h
>>> index 3b444ccdc..aac5e19c9 100644
>>> --- a/lib/ovs-thread.h
>>> +++ b/lib/ovs-thread.h
>>> @@ -522,6 +522,7 @@ bool may_fork(void);
>>>   /* Useful functions related to threading. */
>>>     int count_cpu_cores(void);
>>> +int count_total_cores(void);
>>>   bool thread_is_pmd(void);
>>>     #endif /* ovs-thread.h */
>> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v5 1/2] handlers: Create additional handler threads when using CPU isolation

Reply via email to