On 2/3/26 8:15 PM, Tiago Matos Carvalho Reis wrote: > Em ter., 3 de fev. de 2026 às 15:14, Ilya Maximets > <[email protected]> escreveu: >> >> On 2/3/26 6:46 PM, Tiago Matos Carvalho Reis via discuss wrote: >>> Hi Dumitru, >>> re-sending this email because I forgot to CC ovs-discuss, sorry. >>> >>> Em ter., 3 de fev. de 2026 às 07:16, Dumitru Ceara <[email protected]> >>> escreveu: >>>> >>>> On 2/2/26 9:29 PM, Tiago Matos Carvalho Reis wrote: >>>>> Hi Dumitru, Mairtin, >>>>> sorry for delayed response. >>>>> >>>> >>>> Hi Tiago, >>>> >>>>> Em qui., 29 de jan. de 2026 às 10:38, Dumitru Ceara >>>>> <[email protected]> escreveu: >>>>>> >>>>>> Hi Tiago, Mairtin, >>>>>> >>>>>> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote: >>>>>>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh >>>>>>> <[email protected]> escreveu: >>>>>>>> >>>>>>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis >>>>>>>> wrote: >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> I have been working on implementing incremental processing in OVN-IC >>>>>>>>> and >>>>>>>>> encountered a design issue regarding how OVN-IC handles multi-AZ >>>>>>>>> writes. >>>>>>>>> >>>>>>>>> The Issue >>>>>>>>> In a scenario where multiple AZs are connected via OVN-IC, certain >>>>>>>>> events >>>>>>>>> trigger all AZs to attempt writing the same data to the ISB/INB >>>>>>>>> simultaneously. This race condition leads to a constraint violation, >>>>>>>>> which >>>>>>>>> causes the transaction to fail and forces a full recompute. >>>>>>>>> >>>>>>>>> Example: >>>>>>>>> A clear example of this can be seen in ovn-ic.c:ts_run: >>>>>>>>> >>>>>>>>> if (ctx->ovnisb_txn) { >>>>>>>>> /* Create ISB Datapath_Binding */ >>>>>>>>> ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) { >>>>>>>>> const struct icsbrec_datapath_binding *isb_dp = >>>>>>>>> shash_find_and_delete(isb_ts_dps, ts->name); >>>>>>>>> if (!isb_dp) { >>>>>>>>> /* Allocate tunnel key */ >>>>>>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, >>>>>>>>> vxlan_mode, >>>>>>>>> "transit switch >>>>>>>>> datapath"); >>>>>>>>> if (!dp_key) { >>>>>>>>> continue; >>>>>>>>> } >>>>>>>>> >>>>>>>>> isb_dp = >>>>>>>>> icsbrec_datapath_binding_insert(ctx->ovnisb_txn); >>>>>>>>> icsbrec_datapath_binding_set_transit_switch(isb_dp, >>>>>>>>> ts->name); >>>>>>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, >>>>>>>>> dp_key); >>>>>>>>> } else if (dp_key_refresh) { >>>>>>>>> /* Refresh tunnel key since encap mode has changed. */ >>>>>>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, >>>>>>>>> vxlan_mode, >>>>>>>>> "transit switch >>>>>>>>> datapath"); >>>>>>>>> if (dp_key) { >>>>>>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, >>>>>>>>> dp_key); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> if (!isb_dp->type) { >>>>>>>>> icsbrec_datapath_binding_set_type(isb_dp, >>>>>>>>> "transit-switch"); >>>>>>>>> } >>>>>>>>> >>>>>>>>> if (!isb_dp->nb_ic_uuid) { >>>>>>>>> icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp, >>>>>>>>> >>>>>>>>> &ts->header_.uuid, >>>>>>>>> 1); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> struct shash_node *node; >>>>>>>>> SHASH_FOR_EACH (node, isb_ts_dps) { >>>>>>>>> icsbrec_datapath_binding_delete(node->data); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> When a new transit-switch is created, every AZ attempts to create the >>>>>>>>> same >>>>>>>>> datapath_binding on the ISB. Only one request succeeds; the others >>>>>>>>> fail >>>>>>>>> with a "constraint-violation." >>>>>>>>> >>>>>>>>> Impact: >>>>>>>>> This behavior negates the performance benefits of implementing >>>>>>>>> incremental >>>>>>>>> processing, as the system falls back to a full recompute upon these >>>>>>>>> failures. >>>>>>>>> >>>>>>>>> For development purposes, I am currently ignoring these errors, but >>>>>>>>> the >>>>>>>>> ideal way of fixing this issue is to have a mechanism where only a >>>>>>>>> single >>>>>>>>> AZ handles the writes but this would require implementing some >>>>>>>>> consensus >>>>>>>>> protocol. >>>>>>>>> >>>>>>>>> Does anyone have any advice on how we can fix this issue? >>>>>>>> ovn-ic in each AZ enumerates all existing ISB datapaths in >>>>>>>> enumerate_datapaths >>>>>>>> function, then will attempt to add missing datapaths. Since multilpe >>>>>>>> AZs >>>>>>>> will attempt to add the same missing entry, all but the first will fail >>>>>>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB >>>>>>>> datapath again, see the entry that succeeded and continue to create NB >>>>>>>> in local AZ. This solution does cause a transaction error on all but 1 >>>>>>>> AZ whenever a Transit router is added, but we currently dont have a >>>>>>>> mechanism to manage this gracefully across multiple AZs. >>>>>>> >>>>>>> Hi Mairtin, thanks for the reply. >>>>>>> >>>>>>> Since there is no mechanism to manage which AZ should insert the data, >>>>>>> the only good solution besides implementing a full-fledge consensus >>>>>>> algorithm >>>>>>> like Raft to select a leader AZ, that I came up with is to simply set >>>>>>> an option >>>>>>> in IC_NB_Global to manually configure a specific AZ as a leader, and in >>>>>>> the >>>>>>> code check if the AZ is the leader or not. >>>>>>> >>>>>>> Example: >>>>>>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1 >>>>>>> >>>>>>> In the code: >>>>>>> >>>>>>> const struct icnbrec_ic_nb_global *icnb_global = >>>>>>> icnbrec_ic_nb_global_table_first(ic_nb_global_table); >>>>>>> >>>>>>> const struct nbrec_nb_global *nb_global = >>>>>>> nbrec_nb_global_table_first(nb_global_table); >>>>>>> >>>>>>> const char *leader = smap_get(&icnb_global->options, "leader") >>>>>>> if (!strcmp(leader, nb_global->name)) { >>>>>>> // Insert logic here >>>>>>> } >>>>>>> >>>>>>> Do you have any opinion on this approach? >>>>>>> >>>>>> >>>>>> I was thinking of something a bit different (not too different though). >>>>>> >>>>>> The hierarchy is: >>>>>> >>>>>> IC-NB >>>>>> | >>>>>> ovn-ic (AZ1) ovn-ic (AZ2) ... ovn-ic (AZN) >>>>>> | >>>>>> IC-SB >>>>>> >>>>>> Conceptually this is similar to the intra-az hierarchy: >>>>>> >>>>>> NB >>>>>> | >>>>>> ovn-northd (active) ovn-northd (backup) ... ovn-northd (backup) >>>>>> | >>>>>> SB >>>>>> >>>>>> The way the instances synchronize is by taking the (single) SB database >>>>>> lock. Only one northd succeeds, so that one becomes the "active". >>>>>> >>>>>> What if we do the same for ovn-ic? >>>>>> >>>> >>>> After a closer look at the ovn-ic code I realized this won't really work >>>> out of the box. We really need all of the ovn-ic daemons to write >>>> "something" (either the actual ICSB.port_binding or a request towards >>>> the "leader") in the ICSB. That's because ovn-ic running in AZ2 doesn't >>>> have access to the NB database running in AZ3. >>>> >>>>>> Make all ovn-ic try to take the IC-SB lock. Only the one that succeeds >>>>>> becomes "active" and may write to the IC-SB. >>>>>> >>>>>> That has one implication though: the active instance (it can be any >>>>>> ovn-ic in any AZ) must also make sure the IC-SB port bindings and >>>>>> datapaths for other AZs are up to date. Today it only takes care of the >>>>>> resources for its own AZ. >>>>> >>>>> I think implementing a lock is the right path forward for handling these >>>>> issues. >>>>> My main question is around the "leader AZ" logic: specifically, how it >>>>> should >>>>> handle IC-SB datapaths and port_bindings originating from other AZs. >>>>> One idea I’m considering is having non-leader AZs proxy their requests >>>>> to the leader for IC-SB insertion, but I am unsure on how I would >>>>> implement this. >>>>> >>>> >>>> I think proxy-ing commands might be complicated to get right. What >>>> about the following approach? >>>> >>>> Can we partition the different types of writes ovn-ic needs to do in two >>>> categories? >>>> >>>> a. writes that must be performed only by the "active" ovn-ic (the one >>>> holding the lock). E.g.: maintaining the ICSB.Datapah_Binding table. >>>> >>>> b. writes that can be performed by multiple ovn-ic simultaneously >>>> (without checking the DB lock). E.g., for ICSB.Port_Binding, an ovn-ic >>>> instance should never be (over)writing records created by other ovn-ic. >>>> So in theory they can just all manage their own without any locking just >>>> fine. >>> >>> This is what I was planning initially with the proposal to set an AZ >>> as leader in IC_NB_Global:options. Events which require only one AZ to >>> insert data on ICSB (ex: creating a new datapath for a transit >>> switch/router) would check if the AZ is the leader and then insert, >>> while other events where a non-AZ leader would need to insert data on >>> ICSB (ex: creating a new port_binding) would work without any change. >>> >>> I will be doing a small proof of concept, but I think this will work. >> >> I think, the idea from Dumitru was to still use database locking, but >> only for some parts. That would require having a separate connection >> to the database. One with the lock and the other without. So, the >> instance that has the lock can do common operations through the locked >> connection and all the instnces can do their own specific operations >> using a separate connection without a lock. >>
That's a good summary, with one note: I didn't suggest having separate connections to the database. The OVSDB lock is just a way to guarantee that a single "owner" exists for that lock, but it doesn't restrict the semantics of what that lock is. From the OVSDB RFC itself: https://datatracker.ietf.org/doc/html/rfc7047#section-4.1.8 "The database server supports an arbitrary number of locks, each of which is identified by a client-defined ID. At any given time, each lock may have at most one owner. The precise usage of a lock is determined by the client. For example, a set of clients may agree that a certain table can only be written by the owner of a certain lock. OVSDB itself does not enforce any restrictions on how locks are used -- it simply ensures that a lock has at most one owner." So I thought that we could use the lock to determine if "the ovn-ic process is <the leader> responsible for syncing common data, e.g., datapath_binding". Regardless of being "leader" or not, each ovn-ic would still sync its other non-common data (e.g., port bindings) to the IC SB (through the same connection). But now looking at the IDL/ovsdb-cs code, I think we're being too restrictive there and we don't let IDL/cs clients be as flexible as the RFC actually allows them to be: /* Returns true if 'cs' can be sent a transaction now, false otherwise. This * is useful for optimization: there is no point in composing and sending a * transaction if it returns false. */ bool ovsdb_cs_may_send_transaction(const struct ovsdb_cs *cs) { return (cs->session != NULL && cs->state == CS_S_MONITORING && (!cs->data.lock_name || ovsdb_cs_has_lock(cs))); } Ilya, is this something to improve in the ovsdb-cs layer? Until then, we do need separate connections to the IC SB: one locked, one not. >> So, yes, it's similar to the IC_NB_Global:options proposal, but the >> leader is chosen automatically via the ovsdb locking mechanism, which >> avoids the problem of a leader going down. >> Exactly. >> Best regards, Ilya Maximets. > > Hi Ilya, > Thanks for the clear explanation, I understand it now. I’m going to > put together a PoC to test this out and will follow up with the > results. > Ack, looking forward to see how that goes. Thanks, Dumitru > Regards, > Tiago. > >> >>> >>> Regards, >>> Tiago. >>> >>>> >>>>> I will be reading northd's source code to see if there is anything >>>>> similar to this >>>>> already implemented, but I would like to know if you have any clue as to >>>>> how >>>>> I could implement this. >>>>> >>>>> Regards, >>>>> Tiago. >>>>> >>>> >>>> Regards, >>>> Dumitru >>>> >>>>>> Each ovn-ic, both active and backup are still responsible for writing to >>>>>> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB >>>>>> centralized databases. >>>>>> >>>>>> I didn't check the code for this into too many details though so there >>>>>> might be other things to consider. >>>>>> >>>>>> What do you think? >>>>>> >>>>>> Regards, >>>>>> Dumitru >>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Tiago Matos >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no >>>>>>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes >>>>>>>>> no >>>>>>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa >>>>>>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas >>>>>>>>> est?o >>>>>>>>> imediatamente anuladas e proibidas?._ >>>>>>>>> >>>>>>>>> >>>>>>>>> *?**?Apesar do Magazine Luiza tomar >>>>>>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja >>>>>>>>> presente nesse e-mail, a empresa n?o poder? aceitar a >>>>>>>>> responsabilidade por >>>>>>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus >>>>>>>>> anexos?.* >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -------------- next part -------------- >>>>>>>>> An HTML attachment was scrubbed... >>>>>>>>> URL: >>>>>>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm> >>>>>>>> >>>>>>>> >>>>>>>> Hi Tiago, >>>>>>>> >>>>>>>> I ran into similar issues when adding transit router support and have >>>>>>>> added a comment above. I also have been working on OVN-IC related >>>>>>>> features, so if you would like to discuss above issue further or other >>>>>>>> OVN-IC work I would like to help. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Mairtin >>>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Tiago Matos >>>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
