On 5/21/25 9:27 PM, Eelco Chaudron wrote: > > > On 16 May 2025, at 23:25, Ilya Maximets wrote: > >> The thread fence in the classifier is supposed to ensure that when the >> subtable->trie_plen is updated, the actual prefix tree is ready to be >> used. On the write side in trie_init(), the fence is between the >> tree configuration and the 'trie_plen' update. On the reader's side >> however, the fence is at the beginning of the classifier_lookup__(), >> and both reads of the 'trie_plen' and the accesses to the tree itself >> are happening afterwards. And since both types of the reads are on >> the same side of the fence, the fence is kind of pointless and doesn't >> guarantee any memory ordering. So, readers can be accessing partially >> initialized prefix trees. > > Thanks for this patch, and for the excellent description and code comments, > they really made the review a lot easier, especially since I'm not too > familiar with the classification/trie code. Some of the code got me puzzled, > but it was all clear in the end (guess I should have started with Espresso > sooner). I didn’t do any real testing, as the issue is hard to reproduce. > Hopefully, Numan has some time to look into that. > > One small question, and a nit below. Other than that > > Acked-by: Eelco Chaudron <echau...@redhat.com> > >> Another problem with the configuration is that cls->n_tries is updated >> without any synchronization as well. The comment on the fence says >> that it also synchronizes for the cls->n_tries, but that doesn't make >> a lot of sense. In practice, cls->n_tries is read multiple times >> throughout the classifier_lookup__() and each of these reads may give >> a different value if there is a concurrent update, causing the reader >> to access trees that are not initialized or in the middle of being >> destroyed, leading to OVS crashes while the user updates the flow >> table prefixes. >> >> First thing that needs to be fixed here is to only read cls->n_tries >> once to avoid obvious crashes with access to uninitialized trie_ctx[] >> entries. >> >> The second thing is that we need a proper memory synchronization that >> will guarantee that our prefix trees are fully initialized when >> readers access them. In the current logic we would need to issue >> a thread fence after every read of a subtable->trie_plen value, i.e., >> we'd need a fence per subtable lookup. This would be very expensive >> and wasteful, considering the prefix tree configuration normally >> happens only once somewhere at startup. >> >> What we can do instead is to convert cls->n_tries into atomic and use >> it as a synchronization point: >> >> Writer (classifier_set_prefix_fields): >> >> 1. Before making any changes, set cls->n_tries to zero. Relaxed >> memory order can be used here, because we'll have a full memory >> barrier at the next step. >> 2. ovsrcu_synchronize() to wait for all threads to stop using tries. >> 3. Update tries while nobody is using them. >> 4. Set cls->n_tries to a new value with memory_order_release. >> >> Reader (classifier_lookup): >> >> 1. Read the cls->n_tries with the memory_order_acquire. >> 2. Use that once read value throughout. >> >> RCU in this scenario will ensure that every thread no longer uses the >> prefix trees when we're about to change them. The acquire-release >> semantics on the cls->n_tries just saves us from calling the >> ovsrcu_synchronize() the second time once we're done with the whole >> reconfiguration. We're just updating the number and making all the >> previous changes visible on CPUs that acquire it. >> >> Alternative solution might be to go full RCU and make the array of >> trees itself RCU-protected. This way we would not need to do any >> extra RCU synchronization or managing the memory ordering. However, >> that would mean having multiple layers of RCU with trees and rules >> in them potentially surviving multiple grace periods, which I would >> like to avoid, if possible. >> >> Previous code was also trying to be smart and not disable prefix tree >> lookups for prefixes that are not changing. We're sacrificing this >> functionality in the name of simpler code. Attempt to make that work >> would either require a full conversion to RCU or a per-subtable >> synchronization. Lookups can be done without the prefix match >> optimizations for a brief period of time. This doesn't affect >> correctness of the resulted datapath flows. > > Just out of curiosity, will this take effect immediately during the next > revalidator run triggered by the table change? Or only when a full > revalidation is required?
Only on the next full revalidation. We have this code today that doesn't actually act on changes: void ofproto_configure_table(struct ofproto *ofproto, int table_id, const struct ofproto_table_settings *s) { ... if (classifier_set_prefix_fields(&table->cls, s->prefix_fields, s->n_prefix_fields)) { /* XXX: Trigger revalidation. */ } ... } But that's not a big concern, as this is just an optimization and doesn't affect correctness of these flows. > >> In the actual implementation instead of dropping cls->n_tries to zero >> at step one, we can keep the access to the first N tries that are not > > I’d suggest dropping the word “can” before “keep”, it made me think this > wasn’t actually implemented, even though it is. Makes sense. I'll remove the word before applying. Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev