> On Mar 30, 2023, at 11:57 AM, Jon Kohler <[email protected]> wrote:
>
>
>
>> On Mar 28, 2023, at 1:32 PM, Mike Pattrick <[email protected]> wrote:
>>
>> On Mon, Mar 27, 2023 at 7:25 AM Eelco Chaudron <[email protected]> wrote:
>>>
>>> Make the read of the current seq->value atomic, i.e., not needing to
>>> acquire the global mutex when reading it. On 64-bit systems, this
>>> incurs no overhead, and it will avoid the mutex and potentially
>>> a system call.
>>>
>>> For incrementing the value followed by waking up the threads, we are
>>> still taking the mutex, so the current behavior is not changing. The
>>> seq_read() behavior is already defined as, "Returns seq's current
>>> sequence number (which could change immediately)". So the change
>>> should not impact the current behavior.
>>>
>>> Signed-off-by: Eelco Chaudron <[email protected]>
>>> ---
>>> lib/ovs-rcu.c | 2 +-
>>> lib/seq.c | 37 ++++++++++++++++---------------------
>>> lib/seq.h | 1 -
>>> 3 files changed, 17 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/lib/ovs-rcu.c b/lib/ovs-rcu.c
>>> index 946aa04d1..9e07d9bab 100644
>>> --- a/lib/ovs-rcu.c
>>> +++ b/lib/ovs-rcu.c
>>> @@ -170,7 +170,7 @@ ovsrcu_try_quiesce(void)
>>> ovs_assert(!single_threaded());
>>> perthread = ovsrcu_perthread_get();
>>> if (!seq_try_lock()) {
>>> - perthread->seqno = seq_read_protected(global_seqno);
>>> + perthread->seqno = seq_read(global_seqno);
>>> if (perthread->cbset) {
>>> ovsrcu_flush_cbset__(perthread, true);
>>> }
>>> diff --git a/lib/seq.c b/lib/seq.c
>>> index 99e5bf8bd..22646f3f8 100644
>>> --- a/lib/seq.c
>>> +++ b/lib/seq.c
>>> @@ -32,7 +32,7 @@ COVERAGE_DEFINE(seq_change);
>>>
>>> /* A sequence number object. */
>>> struct seq {
>>> - uint64_t value OVS_GUARDED;
>>> + atomic_uint64_t value;
>>> struct hmap waiters OVS_GUARDED; /* Contains 'struct seq_waiter's. */
>>> };
>>>
>>> @@ -72,6 +72,7 @@ static void seq_wake_waiters(struct seq *)
>>> OVS_REQUIRES(seq_mutex);
>>> struct seq * OVS_EXCLUDED(seq_mutex)
>>> seq_create(void)
>>> {
>>> + uint64_t seq_value;
>>> struct seq *seq;
>>>
>>> seq_init();
>>> @@ -81,7 +82,8 @@ seq_create(void)
>>> COVERAGE_INC(seq_change);
>>>
>>> ovs_mutex_lock(&seq_mutex);
>>
>> I think it's also possible to get rid of the mutex here. Operations
>> like modifying a bridge or interface can result in calling seq_create
>> dozens of times, which could potentially lead to contention in cases
>> when a lot of interfaces are constantly added or removed. I'm mainly
>> thinking about kubernetes here.
>>
>> Other than that, I found this patch to reduce locking on seq_mutex
>> pretty significantly with a userspace workload:
>>
>> Before: 468727.2/sec
>> After: 229265.8/sec
>>
>> The rest of this patch looks good to me, but what do you think about:
>>
>> diff --git a/lib/seq.c b/lib/seq.c
>> index 22646f3f8..09fdccce5 100644
>> --- a/lib/seq.c
>> +++ b/lib/seq.c
>> @@ -57,7 +57,7 @@ struct seq_thread {
>>
>> static struct ovs_mutex seq_mutex = OVS_MUTEX_INITIALIZER;
>>
>> -static uint64_t seq_next OVS_GUARDED_BY(seq_mutex) = 1;
>> +static atomic_uint64_t seq_next = 1;
>>
>> static pthread_key_t seq_thread_key;
>>
>> @@ -81,11 +81,9 @@ seq_create(void)
>>
>> COVERAGE_INC(seq_change);
>>
>> - ovs_mutex_lock(&seq_mutex);
>> - seq_value = seq_next++;
>> + seq_value = atomic_count_inc64(&seq_next);
>> atomic_store_relaxed(&seq->value, seq_value);
>> hmap_init(&seq->waiters);
>> - ovs_mutex_unlock(&seq_mutex);
>>
>> return seq;
>> }
>> @@ -128,7 +126,7 @@ void
>> seq_change_protected(struct seq *seq)
>> OVS_REQUIRES(seq_mutex)
>> {
>> - uint64_t seq_value = seq_next++;
>> + uint64_t seq_value = atomic_count_inc64(&seq_next);
>>
>> COVERAGE_INC(seq_change);
>>
>
> Mike - Seems like a good idea at first blush?
>
> Eelco,
> This patch has perfect timing, as we’re currently chasing down some
> overhead on large systems, e.g. quad socket or newer chips with many
> dozens/hundreds of processors per system. In these systems, we see
> a lot of overhead from the revalidator threads coming online and
> constantly competing for the mutex in seq_read(), seq_woke(), and
> seq_change(). This leads to ovs-vswitchd being on-CPU constantly at
> somewhere between 20-80% of one core as per top. This time comes from
> spending a ton of time contending for the lock, then just futex'ing in
> and out of sleep as a result.
>
> Along these same lines of thinking, would it be possible to get rid of
> this mutex in seq_change() too? Perhaps we could have some atomic val
> that indicated if there was a waiter, and if so, then take the mutex
> to deal with the wakeup, but if not, don't take the mutex at all?
>
> Thats just trying to further optimize, even with this patch there is
> gains by easing contention on the mutex from all of the read calls, so
> great work!
>
> While I'm on the thought: The next point of contention specifically in
> the revalidator case that I'm tracking is under nl_dump_next() where
> we contend for dump->mutex just to do a short lived update to
> dump->status. Perhaps that can be a spinlock? I'll start a separate
> thread after I try that out.
>
> Jon
Your change inspired me to dig a bit deeper. For the revalidator_sweep()
call that the revalidator threads do, it ends up calling ovsrcu_quiesce() in
a loop, which we see quite a bit of contention under with a lot of futex
sleep/wake.
With your change, we’ll get rid of the potential futex under seq_read, but
we will still acquire mutex in seq_change() as well as potentially in
ovsrcu_flush_cbset(). We could optimize potential locking from 3x to 2x
with your change, and then 1x per ovsrcu_quiesce() with the following
change.
Thoughts?
Happy to send this out as a separate change on top of yours if you’d like.
--- a/lib/ovs-rcu.c
+++ b/lib/ovs-rcu.c
@@ -155,10 +155,12 @@ ovsrcu_quiesce(void)
perthread = ovsrcu_perthread_get();
perthread->seqno = seq_read(global_seqno);
+ seq_lock();
if (perthread->cbset) {
- ovsrcu_flush_cbset(perthread);
+ ovsrcu_flush_cbset__(perthread, true);
}
- seq_change(global_seqno);
+ seq_change_protected(global_seqno);
+ seq_unlock();
ovsrcu_quiesced();
}
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev