On 2/3/26 6:46 PM, Tiago Matos Carvalho Reis via discuss wrote: > Hi Dumitru, > re-sending this email because I forgot to CC ovs-discuss, sorry. > > Em ter., 3 de fev. de 2026 às 07:16, Dumitru Ceara <[email protected]> > escreveu: >> >> On 2/2/26 9:29 PM, Tiago Matos Carvalho Reis wrote: >>> Hi Dumitru, Mairtin, >>> sorry for delayed response. >>> >> >> Hi Tiago, >> >>> Em qui., 29 de jan. de 2026 às 10:38, Dumitru Ceara >>> <[email protected]> escreveu: >>>> >>>> Hi Tiago, Mairtin, >>>> >>>> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote: >>>>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh >>>>> <[email protected]> escreveu: >>>>>> >>>>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis >>>>>> wrote: >>>>>>> Hi everyone, >>>>>>> >>>>>>> I have been working on implementing incremental processing in OVN-IC and >>>>>>> encountered a design issue regarding how OVN-IC handles multi-AZ writes. >>>>>>> >>>>>>> The Issue >>>>>>> In a scenario where multiple AZs are connected via OVN-IC, certain >>>>>>> events >>>>>>> trigger all AZs to attempt writing the same data to the ISB/INB >>>>>>> simultaneously. This race condition leads to a constraint violation, >>>>>>> which >>>>>>> causes the transaction to fail and forces a full recompute. >>>>>>> >>>>>>> Example: >>>>>>> A clear example of this can be seen in ovn-ic.c:ts_run: >>>>>>> >>>>>>> if (ctx->ovnisb_txn) { >>>>>>> /* Create ISB Datapath_Binding */ >>>>>>> ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) { >>>>>>> const struct icsbrec_datapath_binding *isb_dp = >>>>>>> shash_find_and_delete(isb_ts_dps, ts->name); >>>>>>> if (!isb_dp) { >>>>>>> /* Allocate tunnel key */ >>>>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode, >>>>>>> "transit switch >>>>>>> datapath"); >>>>>>> if (!dp_key) { >>>>>>> continue; >>>>>>> } >>>>>>> >>>>>>> isb_dp = >>>>>>> icsbrec_datapath_binding_insert(ctx->ovnisb_txn); >>>>>>> icsbrec_datapath_binding_set_transit_switch(isb_dp, >>>>>>> ts->name); >>>>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, dp_key); >>>>>>> } else if (dp_key_refresh) { >>>>>>> /* Refresh tunnel key since encap mode has changed. */ >>>>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode, >>>>>>> "transit switch >>>>>>> datapath"); >>>>>>> if (dp_key) { >>>>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, >>>>>>> dp_key); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> if (!isb_dp->type) { >>>>>>> icsbrec_datapath_binding_set_type(isb_dp, >>>>>>> "transit-switch"); >>>>>>> } >>>>>>> >>>>>>> if (!isb_dp->nb_ic_uuid) { >>>>>>> icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp, >>>>>>> >>>>>>> &ts->header_.uuid, >>>>>>> 1); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> struct shash_node *node; >>>>>>> SHASH_FOR_EACH (node, isb_ts_dps) { >>>>>>> icsbrec_datapath_binding_delete(node->data); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> When a new transit-switch is created, every AZ attempts to create the >>>>>>> same >>>>>>> datapath_binding on the ISB. Only one request succeeds; the others fail >>>>>>> with a "constraint-violation." >>>>>>> >>>>>>> Impact: >>>>>>> This behavior negates the performance benefits of implementing >>>>>>> incremental >>>>>>> processing, as the system falls back to a full recompute upon these >>>>>>> failures. >>>>>>> >>>>>>> For development purposes, I am currently ignoring these errors, but the >>>>>>> ideal way of fixing this issue is to have a mechanism where only a >>>>>>> single >>>>>>> AZ handles the writes but this would require implementing some consensus >>>>>>> protocol. >>>>>>> >>>>>>> Does anyone have any advice on how we can fix this issue? >>>>>> ovn-ic in each AZ enumerates all existing ISB datapaths in >>>>>> enumerate_datapaths >>>>>> function, then will attempt to add missing datapaths. Since multilpe AZs >>>>>> will attempt to add the same missing entry, all but the first will fail >>>>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB >>>>>> datapath again, see the entry that succeeded and continue to create NB >>>>>> in local AZ. This solution does cause a transaction error on all but 1 >>>>>> AZ whenever a Transit router is added, but we currently dont have a >>>>>> mechanism to manage this gracefully across multiple AZs. >>>>> >>>>> Hi Mairtin, thanks for the reply. >>>>> >>>>> Since there is no mechanism to manage which AZ should insert the data, >>>>> the only good solution besides implementing a full-fledge consensus >>>>> algorithm >>>>> like Raft to select a leader AZ, that I came up with is to simply set an >>>>> option >>>>> in IC_NB_Global to manually configure a specific AZ as a leader, and in >>>>> the >>>>> code check if the AZ is the leader or not. >>>>> >>>>> Example: >>>>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1 >>>>> >>>>> In the code: >>>>> >>>>> const struct icnbrec_ic_nb_global *icnb_global = >>>>> icnbrec_ic_nb_global_table_first(ic_nb_global_table); >>>>> >>>>> const struct nbrec_nb_global *nb_global = >>>>> nbrec_nb_global_table_first(nb_global_table); >>>>> >>>>> const char *leader = smap_get(&icnb_global->options, "leader") >>>>> if (!strcmp(leader, nb_global->name)) { >>>>> // Insert logic here >>>>> } >>>>> >>>>> Do you have any opinion on this approach? >>>>> >>>> >>>> I was thinking of something a bit different (not too different though). >>>> >>>> The hierarchy is: >>>> >>>> IC-NB >>>> | >>>> ovn-ic (AZ1) ovn-ic (AZ2) ... ovn-ic (AZN) >>>> | >>>> IC-SB >>>> >>>> Conceptually this is similar to the intra-az hierarchy: >>>> >>>> NB >>>> | >>>> ovn-northd (active) ovn-northd (backup) ... ovn-northd (backup) >>>> | >>>> SB >>>> >>>> The way the instances synchronize is by taking the (single) SB database >>>> lock. Only one northd succeeds, so that one becomes the "active". >>>> >>>> What if we do the same for ovn-ic? >>>> >> >> After a closer look at the ovn-ic code I realized this won't really work >> out of the box. We really need all of the ovn-ic daemons to write >> "something" (either the actual ICSB.port_binding or a request towards >> the "leader") in the ICSB. That's because ovn-ic running in AZ2 doesn't >> have access to the NB database running in AZ3. >> >>>> Make all ovn-ic try to take the IC-SB lock. Only the one that succeeds >>>> becomes "active" and may write to the IC-SB. >>>> >>>> That has one implication though: the active instance (it can be any >>>> ovn-ic in any AZ) must also make sure the IC-SB port bindings and >>>> datapaths for other AZs are up to date. Today it only takes care of the >>>> resources for its own AZ. >>> >>> I think implementing a lock is the right path forward for handling these >>> issues. >>> My main question is around the "leader AZ" logic: specifically, how it >>> should >>> handle IC-SB datapaths and port_bindings originating from other AZs. >>> One idea I’m considering is having non-leader AZs proxy their requests >>> to the leader for IC-SB insertion, but I am unsure on how I would >>> implement this. >>> >> >> I think proxy-ing commands might be complicated to get right. What >> about the following approach? >> >> Can we partition the different types of writes ovn-ic needs to do in two >> categories? >> >> a. writes that must be performed only by the "active" ovn-ic (the one >> holding the lock). E.g.: maintaining the ICSB.Datapah_Binding table. >> >> b. writes that can be performed by multiple ovn-ic simultaneously >> (without checking the DB lock). E.g., for ICSB.Port_Binding, an ovn-ic >> instance should never be (over)writing records created by other ovn-ic. >> So in theory they can just all manage their own without any locking just >> fine. > > This is what I was planning initially with the proposal to set an AZ > as leader in IC_NB_Global:options. Events which require only one AZ to > insert data on ICSB (ex: creating a new datapath for a transit > switch/router) would check if the AZ is the leader and then insert, > while other events where a non-AZ leader would need to insert data on > ICSB (ex: creating a new port_binding) would work without any change. > > I will be doing a small proof of concept, but I think this will work.
I think, the idea from Dumitru was to still use database locking, but only for some parts. That would require having a separate connection to the database. One with the lock and the other without. So, the instance that has the lock can do common operations through the locked connection and all the instnces can do their own specific operations using a separate connection without a lock. So, yes, it's similar to the IC_NB_Global:options proposal, but the leader is chosen automatically via the ovsdb locking mechanism, which avoids the problem of a leader going down. Best regards, Ilya Maximets. > > Regards, > Tiago. > >> >>> I will be reading northd's source code to see if there is anything >>> similar to this >>> already implemented, but I would like to know if you have any clue as to how >>> I could implement this. >>> >>> Regards, >>> Tiago. >>> >> >> Regards, >> Dumitru >> >>>> Each ovn-ic, both active and backup are still responsible for writing to >>>> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB >>>> centralized databases. >>>> >>>> I didn't check the code for this into too many details though so there >>>> might be other things to consider. >>>> >>>> What do you think? >>>> >>>> Regards, >>>> Dumitru >>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Tiago Matos >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no >>>>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes no >>>>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa >>>>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas >>>>>>> est?o >>>>>>> imediatamente anuladas e proibidas?._ >>>>>>> >>>>>>> >>>>>>> *?**?Apesar do Magazine Luiza tomar >>>>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja >>>>>>> presente nesse e-mail, a empresa n?o poder? aceitar a responsabilidade >>>>>>> por >>>>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos?.* >>>>>>> >>>>>>> >>>>>>> >>>>>>> -------------- next part -------------- >>>>>>> An HTML attachment was scrubbed... >>>>>>> URL: >>>>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm> >>>>>> >>>>>> >>>>>> Hi Tiago, >>>>>> >>>>>> I ran into similar issues when adding transit router support and have >>>>>> added a comment above. I also have been working on OVN-IC related >>>>>> features, so if you would like to discuss above issue further or other >>>>>> OVN-IC work I would like to help. >>>>>> >>>>>> Regards, >>>>>> Mairtin >>>>>> >>>>> >>>>> >>>>> Regards, >>>>> Tiago Matos >>>>> >>>> >>> >> > _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
