On 2/4/26 10:37 AM, Dumitru Ceara wrote: > On 2/3/26 8:15 PM, Tiago Matos Carvalho Reis wrote: >> Em ter., 3 de fev. de 2026 às 15:14, Ilya Maximets >> <[email protected]> escreveu: >>> >>> On 2/3/26 6:46 PM, Tiago Matos Carvalho Reis via discuss wrote: >>>> Hi Dumitru, >>>> re-sending this email because I forgot to CC ovs-discuss, sorry. >>>> >>>> Em ter., 3 de fev. de 2026 às 07:16, Dumitru Ceara <[email protected]> >>>> escreveu: >>>>> >>>>> On 2/2/26 9:29 PM, Tiago Matos Carvalho Reis wrote: >>>>>> Hi Dumitru, Mairtin, >>>>>> sorry for delayed response. >>>>>> >>>>> >>>>> Hi Tiago, >>>>> >>>>>> Em qui., 29 de jan. de 2026 às 10:38, Dumitru Ceara >>>>>> <[email protected]> escreveu: >>>>>>> >>>>>>> Hi Tiago, Mairtin, >>>>>>> >>>>>>> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote: >>>>>>>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh >>>>>>>> <[email protected]> escreveu: >>>>>>>>> >>>>>>>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis >>>>>>>>> wrote: >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> I have been working on implementing incremental processing in OVN-IC >>>>>>>>>> and >>>>>>>>>> encountered a design issue regarding how OVN-IC handles multi-AZ >>>>>>>>>> writes. >>>>>>>>>> >>>>>>>>>> The Issue >>>>>>>>>> In a scenario where multiple AZs are connected via OVN-IC, certain >>>>>>>>>> events >>>>>>>>>> trigger all AZs to attempt writing the same data to the ISB/INB >>>>>>>>>> simultaneously. This race condition leads to a constraint violation, >>>>>>>>>> which >>>>>>>>>> causes the transaction to fail and forces a full recompute. >>>>>>>>>> >>>>>>>>>> Example: >>>>>>>>>> A clear example of this can be seen in ovn-ic.c:ts_run: >>>>>>>>>> >>>>>>>>>> if (ctx->ovnisb_txn) { >>>>>>>>>> /* Create ISB Datapath_Binding */ >>>>>>>>>> ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) { >>>>>>>>>> const struct icsbrec_datapath_binding *isb_dp = >>>>>>>>>> shash_find_and_delete(isb_ts_dps, ts->name); >>>>>>>>>> if (!isb_dp) { >>>>>>>>>> /* Allocate tunnel key */ >>>>>>>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, >>>>>>>>>> vxlan_mode, >>>>>>>>>> "transit switch >>>>>>>>>> datapath"); >>>>>>>>>> if (!dp_key) { >>>>>>>>>> continue; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> isb_dp = >>>>>>>>>> icsbrec_datapath_binding_insert(ctx->ovnisb_txn); >>>>>>>>>> icsbrec_datapath_binding_set_transit_switch(isb_dp, >>>>>>>>>> ts->name); >>>>>>>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, >>>>>>>>>> dp_key); >>>>>>>>>> } else if (dp_key_refresh) { >>>>>>>>>> /* Refresh tunnel key since encap mode has changed. >>>>>>>>>> */ >>>>>>>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, >>>>>>>>>> vxlan_mode, >>>>>>>>>> "transit switch >>>>>>>>>> datapath"); >>>>>>>>>> if (dp_key) { >>>>>>>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, >>>>>>>>>> dp_key); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> if (!isb_dp->type) { >>>>>>>>>> icsbrec_datapath_binding_set_type(isb_dp, >>>>>>>>>> "transit-switch"); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> if (!isb_dp->nb_ic_uuid) { >>>>>>>>>> icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp, >>>>>>>>>> >>>>>>>>>> &ts->header_.uuid, >>>>>>>>>> 1); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> struct shash_node *node; >>>>>>>>>> SHASH_FOR_EACH (node, isb_ts_dps) { >>>>>>>>>> icsbrec_datapath_binding_delete(node->data); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> When a new transit-switch is created, every AZ attempts to create >>>>>>>>>> the same >>>>>>>>>> datapath_binding on the ISB. Only one request succeeds; the others >>>>>>>>>> fail >>>>>>>>>> with a "constraint-violation." >>>>>>>>>> >>>>>>>>>> Impact: >>>>>>>>>> This behavior negates the performance benefits of implementing >>>>>>>>>> incremental >>>>>>>>>> processing, as the system falls back to a full recompute upon these >>>>>>>>>> failures. >>>>>>>>>> >>>>>>>>>> For development purposes, I am currently ignoring these errors, but >>>>>>>>>> the >>>>>>>>>> ideal way of fixing this issue is to have a mechanism where only a >>>>>>>>>> single >>>>>>>>>> AZ handles the writes but this would require implementing some >>>>>>>>>> consensus >>>>>>>>>> protocol. >>>>>>>>>> >>>>>>>>>> Does anyone have any advice on how we can fix this issue? >>>>>>>>> ovn-ic in each AZ enumerates all existing ISB datapaths in >>>>>>>>> enumerate_datapaths >>>>>>>>> function, then will attempt to add missing datapaths. Since multilpe >>>>>>>>> AZs >>>>>>>>> will attempt to add the same missing entry, all but the first will >>>>>>>>> fail >>>>>>>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB >>>>>>>>> datapath again, see the entry that succeeded and continue to create NB >>>>>>>>> in local AZ. This solution does cause a transaction error on all but 1 >>>>>>>>> AZ whenever a Transit router is added, but we currently dont have a >>>>>>>>> mechanism to manage this gracefully across multiple AZs. >>>>>>>> >>>>>>>> Hi Mairtin, thanks for the reply. >>>>>>>> >>>>>>>> Since there is no mechanism to manage which AZ should insert the data, >>>>>>>> the only good solution besides implementing a full-fledge consensus >>>>>>>> algorithm >>>>>>>> like Raft to select a leader AZ, that I came up with is to simply set >>>>>>>> an option >>>>>>>> in IC_NB_Global to manually configure a specific AZ as a leader, and >>>>>>>> in the >>>>>>>> code check if the AZ is the leader or not. >>>>>>>> >>>>>>>> Example: >>>>>>>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1 >>>>>>>> >>>>>>>> In the code: >>>>>>>> >>>>>>>> const struct icnbrec_ic_nb_global *icnb_global = >>>>>>>> icnbrec_ic_nb_global_table_first(ic_nb_global_table); >>>>>>>> >>>>>>>> const struct nbrec_nb_global *nb_global = >>>>>>>> nbrec_nb_global_table_first(nb_global_table); >>>>>>>> >>>>>>>> const char *leader = smap_get(&icnb_global->options, "leader") >>>>>>>> if (!strcmp(leader, nb_global->name)) { >>>>>>>> // Insert logic here >>>>>>>> } >>>>>>>> >>>>>>>> Do you have any opinion on this approach? >>>>>>>> >>>>>>> >>>>>>> I was thinking of something a bit different (not too different though). >>>>>>> >>>>>>> The hierarchy is: >>>>>>> >>>>>>> IC-NB >>>>>>> | >>>>>>> ovn-ic (AZ1) ovn-ic (AZ2) ... ovn-ic (AZN) >>>>>>> | >>>>>>> IC-SB >>>>>>> >>>>>>> Conceptually this is similar to the intra-az hierarchy: >>>>>>> >>>>>>> NB >>>>>>> | >>>>>>> ovn-northd (active) ovn-northd (backup) ... ovn-northd (backup) >>>>>>> | >>>>>>> SB >>>>>>> >>>>>>> The way the instances synchronize is by taking the (single) SB database >>>>>>> lock. Only one northd succeeds, so that one becomes the "active". >>>>>>> >>>>>>> What if we do the same for ovn-ic? >>>>>>> >>>>> >>>>> After a closer look at the ovn-ic code I realized this won't really work >>>>> out of the box. We really need all of the ovn-ic daemons to write >>>>> "something" (either the actual ICSB.port_binding or a request towards >>>>> the "leader") in the ICSB. That's because ovn-ic running in AZ2 doesn't >>>>> have access to the NB database running in AZ3. >>>>> >>>>>>> Make all ovn-ic try to take the IC-SB lock. Only the one that succeeds >>>>>>> becomes "active" and may write to the IC-SB. >>>>>>> >>>>>>> That has one implication though: the active instance (it can be any >>>>>>> ovn-ic in any AZ) must also make sure the IC-SB port bindings and >>>>>>> datapaths for other AZs are up to date. Today it only takes care of the >>>>>>> resources for its own AZ. >>>>>> >>>>>> I think implementing a lock is the right path forward for handling these >>>>>> issues. >>>>>> My main question is around the "leader AZ" logic: specifically, how it >>>>>> should >>>>>> handle IC-SB datapaths and port_bindings originating from other AZs. >>>>>> One idea I’m considering is having non-leader AZs proxy their requests >>>>>> to the leader for IC-SB insertion, but I am unsure on how I would >>>>>> implement this. >>>>>> >>>>> >>>>> I think proxy-ing commands might be complicated to get right. What >>>>> about the following approach? >>>>> >>>>> Can we partition the different types of writes ovn-ic needs to do in two >>>>> categories? >>>>> >>>>> a. writes that must be performed only by the "active" ovn-ic (the one >>>>> holding the lock). E.g.: maintaining the ICSB.Datapah_Binding table. >>>>> >>>>> b. writes that can be performed by multiple ovn-ic simultaneously >>>>> (without checking the DB lock). E.g., for ICSB.Port_Binding, an ovn-ic >>>>> instance should never be (over)writing records created by other ovn-ic. >>>>> So in theory they can just all manage their own without any locking just >>>>> fine. >>>> >>>> This is what I was planning initially with the proposal to set an AZ >>>> as leader in IC_NB_Global:options. Events which require only one AZ to >>>> insert data on ICSB (ex: creating a new datapath for a transit >>>> switch/router) would check if the AZ is the leader and then insert, >>>> while other events where a non-AZ leader would need to insert data on >>>> ICSB (ex: creating a new port_binding) would work without any change. >>>> >>>> I will be doing a small proof of concept, but I think this will work. >>> >>> I think, the idea from Dumitru was to still use database locking, but >>> only for some parts. That would require having a separate connection >>> to the database. One with the lock and the other without. So, the >>> instance that has the lock can do common operations through the locked >>> connection and all the instnces can do their own specific operations >>> using a separate connection without a lock. >>> > > That's a good summary, with one note: I didn't suggest having > separate connections to the database. The OVSDB lock is just a way > to guarantee that a single "owner" exists for that lock, but it > doesn't restrict the semantics of what that lock is. From the > OVSDB RFC itself: > > https://datatracker.ietf.org/doc/html/rfc7047#section-4.1.8 > > "The database > server supports an arbitrary number of locks, each of which is > identified by a client-defined ID. At any given time, each lock may > have at most one owner. The precise usage of a lock is determined by > the client. For example, a set of clients may agree that a certain > table can only be written by the owner of a certain lock. OVSDB > itself does not enforce any restrictions on how locks are used -- it > simply ensures that a lock has at most one owner." > > So I thought that we could use the lock to determine if "the ovn-ic > process is <the leader> responsible for syncing common data, e.g., > datapath_binding". > > Regardless of being "leader" or not, each ovn-ic would still sync its > other non-common data (e.g., port bindings) to the IC SB (through the > same connection). > > But now looking at the IDL/ovsdb-cs code, I think we're being too > restrictive there and we don't let IDL/cs clients be as flexible as > the RFC actually allows them to be: > > /* Returns true if 'cs' can be sent a transaction now, false otherwise. This > * is useful for optimization: there is no point in composing and sending a > * transaction if it returns false. */ > bool > ovsdb_cs_may_send_transaction(const struct ovsdb_cs *cs) > { > return (cs->session != NULL > && cs->state == CS_S_MONITORING > && (!cs->data.lock_name || ovsdb_cs_has_lock(cs))); > } > > Ilya, is this something to improve in the ovsdb-cs layer?
IDL doesn't allow for this use case today, that's true. It only handles one lock and acts defensively if this lock is not held when the application attempts to send a transaction. It should be possible to allow transactions even if we do not have the configured lock, but it should be an opt-in behavior to avoid breaking applications that do not check the locking state before attempting their transactions. > > Until then, we do need separate connections to the IC SB: one locked, one not. > >>> So, yes, it's similar to the IC_NB_Global:options proposal, but the >>> leader is chosen automatically via the ovsdb locking mechanism, which >>> avoids the problem of a leader going down. >>> > > Exactly. > >>> Best regards, Ilya Maximets. >> >> Hi Ilya, >> Thanks for the clear explanation, I understand it now. I’m going to >> put together a PoC to test this out and will follow up with the >> results. >> > > Ack, looking forward to see how that goes. > > Thanks, > Dumitru > >> Regards, >> Tiago. >> >>> >>>> >>>> Regards, >>>> Tiago. >>>> >>>>> >>>>>> I will be reading northd's source code to see if there is anything >>>>>> similar to this >>>>>> already implemented, but I would like to know if you have any clue as to >>>>>> how >>>>>> I could implement this. >>>>>> >>>>>> Regards, >>>>>> Tiago. >>>>>> >>>>> >>>>> Regards, >>>>> Dumitru >>>>> >>>>>>> Each ovn-ic, both active and backup are still responsible for writing to >>>>>>> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB >>>>>>> centralized databases. >>>>>>> >>>>>>> I didn't check the code for this into too many details though so there >>>>>>> might be other things to consider. >>>>>>> >>>>>>> What do you think? >>>>>>> >>>>>>> Regards, >>>>>>> Dumitru >>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Tiago Matos >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no >>>>>>>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes >>>>>>>>>> no >>>>>>>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do >>>>>>>>>> dessa >>>>>>>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es >>>>>>>>>> citadas est?o >>>>>>>>>> imediatamente anuladas e proibidas?._ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *?**?Apesar do Magazine Luiza tomar >>>>>>>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja >>>>>>>>>> presente nesse e-mail, a empresa n?o poder? aceitar a >>>>>>>>>> responsabilidade por >>>>>>>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus >>>>>>>>>> anexos?.* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -------------- next part -------------- >>>>>>>>>> An HTML attachment was scrubbed... >>>>>>>>>> URL: >>>>>>>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Tiago, >>>>>>>>> >>>>>>>>> I ran into similar issues when adding transit router support and have >>>>>>>>> added a comment above. I also have been working on OVN-IC related >>>>>>>>> features, so if you would like to discuss above issue further or other >>>>>>>>> OVN-IC work I would like to help. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Mairtin >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Tiago Matos >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
