Re: [ovs-dev] [PATCH 2/3] classifier: Fix race for prefix tree configuration.

Eelco Chaudron via dev Mon, 26 May 2025 07:35:32 -0700


On 26 May 2025, at 16:09, Ilya Maximets wrote:


> On 5/21/25 9:27 PM, Eelco Chaudron wrote:
>>
>>
>> On 16 May 2025, at 23:25, Ilya Maximets wrote:
>>
>>> The thread fence in the classifier is supposed to ensure that when the
>>> subtable->trie_plen is updated, the actual prefix tree is ready to be
>>> used.  On the write side in trie_init(), the fence is between the
>>> tree configuration and the 'trie_plen' update.  On the reader's side
>>> however, the fence is at the beginning of the classifier_lookup__(),
>>> and both reads of the 'trie_plen' and the accesses to the tree itself
>>> are happening afterwards.  And since both types of the reads are on
>>> the same side of the fence, the fence is kind of pointless and doesn't
>>> guarantee any memory ordering.  So, readers can be accessing partially
>>> initialized prefix trees.
>>
>> Thanks for this patch, and for the excellent description and code comments, 
>> they really made the review a lot easier, especially since I'm not too 
>> familiar with the classification/trie code. Some of the code got me puzzled, 
>> but it was all clear in the end (guess I should have started with Espresso 
>> sooner). I didn’t do any real testing, as the issue is hard to reproduce. 
>> Hopefully, Numan has some time to look into that.
>>
>> One small question, and a nit below. Other than that
>>
>> Acked-by: Eelco Chaudron <echau...@redhat.com>
>>
>>> Another problem with the configuration is that cls->n_tries is updated
>>> without any synchronization as well.  The comment on the fence says
>>> that it also synchronizes for the cls->n_tries, but that doesn't make
>>> a lot of sense.  In practice, cls->n_tries is read multiple times
>>> throughout the classifier_lookup__() and each of these reads may give
>>> a different value if there is a concurrent update, causing the reader
>>> to access trees that are not initialized or in the middle of being
>>> destroyed, leading to OVS crashes while the user updates the flow
>>> table prefixes.
>>>
>>> First thing that needs to be fixed here is to only read cls->n_tries
>>> once to avoid obvious crashes with access to uninitialized trie_ctx[]
>>> entries.
>>>
>>> The second thing is that we need a proper memory synchronization that
>>> will guarantee that our prefix trees are fully initialized when
>>> readers access them.  In the current logic we would need to issue
>>> a thread fence after every read of a subtable->trie_plen value, i.e.,
>>> we'd need a fence per subtable lookup.  This would be very expensive
>>> and wasteful, considering the prefix tree configuration normally
>>> happens only once somewhere at startup.
>>>
>>> What we can do instead is to convert cls->n_tries into atomic and use
>>> it as a synchronization point:
>>>
>>>   Writer (classifier_set_prefix_fields):
>>>
>>>   1. Before making any changes, set cls->n_tries to zero.  Relaxed
>>>      memory order can be used here, because we'll have a full memory
>>>      barrier at the next step.
>>>   2. ovsrcu_synchronize() to wait for all threads to stop using tries.
>>>   3. Update tries while nobody is using them.
>>>   4. Set cls->n_tries to a new value with memory_order_release.
>>>
>>>   Reader (classifier_lookup):
>>>
>>>   1. Read the cls->n_tries with the memory_order_acquire.
>>>   2. Use that once read value throughout.
>>>
>>> RCU in this scenario will ensure that every thread no longer uses the
>>> prefix trees when we're about to change them.  The acquire-release
>>> semantics on the cls->n_tries just saves us from calling the
>>> ovsrcu_synchronize() the second time once we're done with the whole
>>> reconfiguration.  We're just updating the number and making all the
>>> previous changes visible on CPUs that acquire it.
>>>
>>> Alternative solution might be to go full RCU and make the array of
>>> trees itself RCU-protected.  This way we would not need to do any
>>> extra RCU synchronization or managing the memory ordering.  However,
>>> that would mean having multiple layers of RCU with trees and rules
>>> in them potentially surviving multiple grace periods, which I would
>>> like to avoid, if possible.
>>>
>>> Previous code was also trying to be smart and not disable prefix tree
>>> lookups for prefixes that are not changing.  We're sacrificing this
>>> functionality in the name of simpler code.  Attempt to make that work
>>> would either require a full conversion to RCU or a per-subtable
>>> synchronization.  Lookups can be done without the prefix match
>>> optimizations for a brief period of time.  This doesn't affect
>>> correctness of the resulted datapath flows.
>>
>> Just out of curiosity, will this take effect immediately during the next
>> revalidator run triggered by the table change? Or only when a full
>> revalidation is required?
>
> Only on the next full revalidation.  We have this code today that doesn't
> actually act on changes:
>
> void
> ofproto_configure_table(struct ofproto *ofproto, int table_id,
>                         const struct ofproto_table_settings *s)
> {
>     ...
>
>     if (classifier_set_prefix_fields(&table->cls,
>                                      s->prefix_fields, s->n_prefix_fields)) {
>         /* XXX: Trigger revalidation. */
>     }
>
>     ...
> }
>
> But that's not a big concern, as this is just an optimization and doesn't
> affect correctness of these flows.

Thanks for the clarification. I think we are fine (as we have been for a while 
;).


>>
>>> In the actual implementation instead of dropping cls->n_tries to zero
>>> at step one, we can keep the access to the first N tries that are not
>>
>> I’d suggest dropping the word “can” before “keep”, it made me think this
>> wasn’t actually implemented, even though it is.
>
> Makes sense.  I'll remove the word before applying.
>
> Best regards, Ilya Maximets.

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 2/3] classifier: Fix race for prefix tree configuration.

Reply via email to