On 5/21/25 9:27 PM, Eelco Chaudron wrote:
> 
> 
> On 16 May 2025, at 23:25, Ilya Maximets wrote:
> 
>> The thread fence in the classifier is supposed to ensure that when the
>> subtable->trie_plen is updated, the actual prefix tree is ready to be
>> used.  On the write side in trie_init(), the fence is between the
>> tree configuration and the 'trie_plen' update.  On the reader's side
>> however, the fence is at the beginning of the classifier_lookup__(),
>> and both reads of the 'trie_plen' and the accesses to the tree itself
>> are happening afterwards.  And since both types of the reads are on
>> the same side of the fence, the fence is kind of pointless and doesn't
>> guarantee any memory ordering.  So, readers can be accessing partially
>> initialized prefix trees.
> 
> Thanks for this patch, and for the excellent description and code comments, 
> they really made the review a lot easier, especially since I'm not too 
> familiar with the classification/trie code. Some of the code got me puzzled, 
> but it was all clear in the end (guess I should have started with Espresso 
> sooner). I didn’t do any real testing, as the issue is hard to reproduce. 
> Hopefully, Numan has some time to look into that.
> 
> One small question, and a nit below. Other than that
> 
> Acked-by: Eelco Chaudron <echau...@redhat.com>
> 
>> Another problem with the configuration is that cls->n_tries is updated
>> without any synchronization as well.  The comment on the fence says
>> that it also synchronizes for the cls->n_tries, but that doesn't make
>> a lot of sense.  In practice, cls->n_tries is read multiple times
>> throughout the classifier_lookup__() and each of these reads may give
>> a different value if there is a concurrent update, causing the reader
>> to access trees that are not initialized or in the middle of being
>> destroyed, leading to OVS crashes while the user updates the flow
>> table prefixes.
>>
>> First thing that needs to be fixed here is to only read cls->n_tries
>> once to avoid obvious crashes with access to uninitialized trie_ctx[]
>> entries.
>>
>> The second thing is that we need a proper memory synchronization that
>> will guarantee that our prefix trees are fully initialized when
>> readers access them.  In the current logic we would need to issue
>> a thread fence after every read of a subtable->trie_plen value, i.e.,
>> we'd need a fence per subtable lookup.  This would be very expensive
>> and wasteful, considering the prefix tree configuration normally
>> happens only once somewhere at startup.
>>
>> What we can do instead is to convert cls->n_tries into atomic and use
>> it as a synchronization point:
>>
>>   Writer (classifier_set_prefix_fields):
>>
>>   1. Before making any changes, set cls->n_tries to zero.  Relaxed
>>      memory order can be used here, because we'll have a full memory
>>      barrier at the next step.
>>   2. ovsrcu_synchronize() to wait for all threads to stop using tries.
>>   3. Update tries while nobody is using them.
>>   4. Set cls->n_tries to a new value with memory_order_release.
>>
>>   Reader (classifier_lookup):
>>
>>   1. Read the cls->n_tries with the memory_order_acquire.
>>   2. Use that once read value throughout.
>>
>> RCU in this scenario will ensure that every thread no longer uses the
>> prefix trees when we're about to change them.  The acquire-release
>> semantics on the cls->n_tries just saves us from calling the
>> ovsrcu_synchronize() the second time once we're done with the whole
>> reconfiguration.  We're just updating the number and making all the
>> previous changes visible on CPUs that acquire it.
>>
>> Alternative solution might be to go full RCU and make the array of
>> trees itself RCU-protected.  This way we would not need to do any
>> extra RCU synchronization or managing the memory ordering.  However,
>> that would mean having multiple layers of RCU with trees and rules
>> in them potentially surviving multiple grace periods, which I would
>> like to avoid, if possible.
>>
>> Previous code was also trying to be smart and not disable prefix tree
>> lookups for prefixes that are not changing.  We're sacrificing this
>> functionality in the name of simpler code.  Attempt to make that work
>> would either require a full conversion to RCU or a per-subtable
>> synchronization.  Lookups can be done without the prefix match
>> optimizations for a brief period of time.  This doesn't affect
>> correctness of the resulted datapath flows.
> 
> Just out of curiosity, will this take effect immediately during the next
> revalidator run triggered by the table change? Or only when a full
> revalidation is required?

Only on the next full revalidation.  We have this code today that doesn't
actually act on changes:

void
ofproto_configure_table(struct ofproto *ofproto, int table_id,
                        const struct ofproto_table_settings *s)
{
    ...

    if (classifier_set_prefix_fields(&table->cls,
                                     s->prefix_fields, s->n_prefix_fields)) {
        /* XXX: Trigger revalidation. */
    }

    ...
}

But that's not a big concern, as this is just an optimization and doesn't
affect correctness of these flows.

> 
>> In the actual implementation instead of dropping cls->n_tries to zero
>> at step one, we can keep the access to the first N tries that are not
> 
> I’d suggest dropping the word “can” before “keep”, it made me think this
> wasn’t actually implemented, even though it is.

Makes sense.  I'll remove the word before applying.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to