On 26 May 2025, at 16:09, Ilya Maximets wrote:
> On 5/21/25 9:27 PM, Eelco Chaudron wrote: >> >> >> On 16 May 2025, at 23:25, Ilya Maximets wrote: >> >>> The thread fence in the classifier is supposed to ensure that when the >>> subtable->trie_plen is updated, the actual prefix tree is ready to be >>> used. On the write side in trie_init(), the fence is between the >>> tree configuration and the 'trie_plen' update. On the reader's side >>> however, the fence is at the beginning of the classifier_lookup__(), >>> and both reads of the 'trie_plen' and the accesses to the tree itself >>> are happening afterwards. And since both types of the reads are on >>> the same side of the fence, the fence is kind of pointless and doesn't >>> guarantee any memory ordering. So, readers can be accessing partially >>> initialized prefix trees. >> >> Thanks for this patch, and for the excellent description and code comments, >> they really made the review a lot easier, especially since I'm not too >> familiar with the classification/trie code. Some of the code got me puzzled, >> but it was all clear in the end (guess I should have started with Espresso >> sooner). I didn’t do any real testing, as the issue is hard to reproduce. >> Hopefully, Numan has some time to look into that. >> >> One small question, and a nit below. Other than that >> >> Acked-by: Eelco Chaudron <echau...@redhat.com> >> >>> Another problem with the configuration is that cls->n_tries is updated >>> without any synchronization as well. The comment on the fence says >>> that it also synchronizes for the cls->n_tries, but that doesn't make >>> a lot of sense. In practice, cls->n_tries is read multiple times >>> throughout the classifier_lookup__() and each of these reads may give >>> a different value if there is a concurrent update, causing the reader >>> to access trees that are not initialized or in the middle of being >>> destroyed, leading to OVS crashes while the user updates the flow >>> table prefixes. >>> >>> First thing that needs to be fixed here is to only read cls->n_tries >>> once to avoid obvious crashes with access to uninitialized trie_ctx[] >>> entries. >>> >>> The second thing is that we need a proper memory synchronization that >>> will guarantee that our prefix trees are fully initialized when >>> readers access them. In the current logic we would need to issue >>> a thread fence after every read of a subtable->trie_plen value, i.e., >>> we'd need a fence per subtable lookup. This would be very expensive >>> and wasteful, considering the prefix tree configuration normally >>> happens only once somewhere at startup. >>> >>> What we can do instead is to convert cls->n_tries into atomic and use >>> it as a synchronization point: >>> >>> Writer (classifier_set_prefix_fields): >>> >>> 1. Before making any changes, set cls->n_tries to zero. Relaxed >>> memory order can be used here, because we'll have a full memory >>> barrier at the next step. >>> 2. ovsrcu_synchronize() to wait for all threads to stop using tries. >>> 3. Update tries while nobody is using them. >>> 4. Set cls->n_tries to a new value with memory_order_release. >>> >>> Reader (classifier_lookup): >>> >>> 1. Read the cls->n_tries with the memory_order_acquire. >>> 2. Use that once read value throughout. >>> >>> RCU in this scenario will ensure that every thread no longer uses the >>> prefix trees when we're about to change them. The acquire-release >>> semantics on the cls->n_tries just saves us from calling the >>> ovsrcu_synchronize() the second time once we're done with the whole >>> reconfiguration. We're just updating the number and making all the >>> previous changes visible on CPUs that acquire it. >>> >>> Alternative solution might be to go full RCU and make the array of >>> trees itself RCU-protected. This way we would not need to do any >>> extra RCU synchronization or managing the memory ordering. However, >>> that would mean having multiple layers of RCU with trees and rules >>> in them potentially surviving multiple grace periods, which I would >>> like to avoid, if possible. >>> >>> Previous code was also trying to be smart and not disable prefix tree >>> lookups for prefixes that are not changing. We're sacrificing this >>> functionality in the name of simpler code. Attempt to make that work >>> would either require a full conversion to RCU or a per-subtable >>> synchronization. Lookups can be done without the prefix match >>> optimizations for a brief period of time. This doesn't affect >>> correctness of the resulted datapath flows. >> >> Just out of curiosity, will this take effect immediately during the next >> revalidator run triggered by the table change? Or only when a full >> revalidation is required? > > Only on the next full revalidation. We have this code today that doesn't > actually act on changes: > > void > ofproto_configure_table(struct ofproto *ofproto, int table_id, > const struct ofproto_table_settings *s) > { > ... > > if (classifier_set_prefix_fields(&table->cls, > s->prefix_fields, s->n_prefix_fields)) { > /* XXX: Trigger revalidation. */ > } > > ... > } > > But that's not a big concern, as this is just an optimization and doesn't > affect correctness of these flows. Thanks for the clarification. I think we are fine (as we have been for a while ;). >> >>> In the actual implementation instead of dropping cls->n_tries to zero >>> at step one, we can keep the access to the first N tries that are not >> >> I’d suggest dropping the word “can” before “keep”, it made me think this >> wasn’t actually implemented, even though it is. > > Makes sense. I'll remove the word before applying. > > Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev