Hi Dumitru, re-sending this email because I forgot to CC ovs-discuss, sorry.
Em ter., 3 de fev. de 2026 às 07:16, Dumitru Ceara <[email protected]> escreveu: > > On 2/2/26 9:29 PM, Tiago Matos Carvalho Reis wrote: > > Hi Dumitru, Mairtin, > > sorry for delayed response. > > > > Hi Tiago, > > > Em qui., 29 de jan. de 2026 às 10:38, Dumitru Ceara > > <[email protected]> escreveu: > >> > >> Hi Tiago, Mairtin, > >> > >> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote: > >>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh > >>> <[email protected]> escreveu: > >>>> > >>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis > >>>> wrote: > >>>>> Hi everyone, > >>>>> > >>>>> I have been working on implementing incremental processing in OVN-IC and > >>>>> encountered a design issue regarding how OVN-IC handles multi-AZ writes. > >>>>> > >>>>> The Issue > >>>>> In a scenario where multiple AZs are connected via OVN-IC, certain > >>>>> events > >>>>> trigger all AZs to attempt writing the same data to the ISB/INB > >>>>> simultaneously. This race condition leads to a constraint violation, > >>>>> which > >>>>> causes the transaction to fail and forces a full recompute. > >>>>> > >>>>> Example: > >>>>> A clear example of this can be seen in ovn-ic.c:ts_run: > >>>>> > >>>>> if (ctx->ovnisb_txn) { > >>>>> /* Create ISB Datapath_Binding */ > >>>>> ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) { > >>>>> const struct icsbrec_datapath_binding *isb_dp = > >>>>> shash_find_and_delete(isb_ts_dps, ts->name); > >>>>> if (!isb_dp) { > >>>>> /* Allocate tunnel key */ > >>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode, > >>>>> "transit switch > >>>>> datapath"); > >>>>> if (!dp_key) { > >>>>> continue; > >>>>> } > >>>>> > >>>>> isb_dp = > >>>>> icsbrec_datapath_binding_insert(ctx->ovnisb_txn); > >>>>> icsbrec_datapath_binding_set_transit_switch(isb_dp, > >>>>> ts->name); > >>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, dp_key); > >>>>> } else if (dp_key_refresh) { > >>>>> /* Refresh tunnel key since encap mode has changed. */ > >>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode, > >>>>> "transit switch > >>>>> datapath"); > >>>>> if (dp_key) { > >>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, > >>>>> dp_key); > >>>>> } > >>>>> } > >>>>> > >>>>> if (!isb_dp->type) { > >>>>> icsbrec_datapath_binding_set_type(isb_dp, > >>>>> "transit-switch"); > >>>>> } > >>>>> > >>>>> if (!isb_dp->nb_ic_uuid) { > >>>>> icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp, > >>>>> > >>>>> &ts->header_.uuid, > >>>>> 1); > >>>>> } > >>>>> } > >>>>> > >>>>> struct shash_node *node; > >>>>> SHASH_FOR_EACH (node, isb_ts_dps) { > >>>>> icsbrec_datapath_binding_delete(node->data); > >>>>> } > >>>>> } > >>>>> > >>>>> When a new transit-switch is created, every AZ attempts to create the > >>>>> same > >>>>> datapath_binding on the ISB. Only one request succeeds; the others fail > >>>>> with a "constraint-violation." > >>>>> > >>>>> Impact: > >>>>> This behavior negates the performance benefits of implementing > >>>>> incremental > >>>>> processing, as the system falls back to a full recompute upon these > >>>>> failures. > >>>>> > >>>>> For development purposes, I am currently ignoring these errors, but the > >>>>> ideal way of fixing this issue is to have a mechanism where only a > >>>>> single > >>>>> AZ handles the writes but this would require implementing some consensus > >>>>> protocol. > >>>>> > >>>>> Does anyone have any advice on how we can fix this issue? > >>>> ovn-ic in each AZ enumerates all existing ISB datapaths in > >>>> enumerate_datapaths > >>>> function, then will attempt to add missing datapaths. Since multilpe AZs > >>>> will attempt to add the same missing entry, all but the first will fail > >>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB > >>>> datapath again, see the entry that succeeded and continue to create NB > >>>> in local AZ. This solution does cause a transaction error on all but 1 > >>>> AZ whenever a Transit router is added, but we currently dont have a > >>>> mechanism to manage this gracefully across multiple AZs. > >>> > >>> Hi Mairtin, thanks for the reply. > >>> > >>> Since there is no mechanism to manage which AZ should insert the data, > >>> the only good solution besides implementing a full-fledge consensus > >>> algorithm > >>> like Raft to select a leader AZ, that I came up with is to simply set an > >>> option > >>> in IC_NB_Global to manually configure a specific AZ as a leader, and in > >>> the > >>> code check if the AZ is the leader or not. > >>> > >>> Example: > >>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1 > >>> > >>> In the code: > >>> > >>> const struct icnbrec_ic_nb_global *icnb_global = > >>> icnbrec_ic_nb_global_table_first(ic_nb_global_table); > >>> > >>> const struct nbrec_nb_global *nb_global = > >>> nbrec_nb_global_table_first(nb_global_table); > >>> > >>> const char *leader = smap_get(&icnb_global->options, "leader") > >>> if (!strcmp(leader, nb_global->name)) { > >>> // Insert logic here > >>> } > >>> > >>> Do you have any opinion on this approach? > >>> > >> > >> I was thinking of something a bit different (not too different though). > >> > >> The hierarchy is: > >> > >> IC-NB > >> | > >> ovn-ic (AZ1) ovn-ic (AZ2) ... ovn-ic (AZN) > >> | > >> IC-SB > >> > >> Conceptually this is similar to the intra-az hierarchy: > >> > >> NB > >> | > >> ovn-northd (active) ovn-northd (backup) ... ovn-northd (backup) > >> | > >> SB > >> > >> The way the instances synchronize is by taking the (single) SB database > >> lock. Only one northd succeeds, so that one becomes the "active". > >> > >> What if we do the same for ovn-ic? > >> > > After a closer look at the ovn-ic code I realized this won't really work > out of the box. We really need all of the ovn-ic daemons to write > "something" (either the actual ICSB.port_binding or a request towards > the "leader") in the ICSB. That's because ovn-ic running in AZ2 doesn't > have access to the NB database running in AZ3. > > >> Make all ovn-ic try to take the IC-SB lock. Only the one that succeeds > >> becomes "active" and may write to the IC-SB. > >> > >> That has one implication though: the active instance (it can be any > >> ovn-ic in any AZ) must also make sure the IC-SB port bindings and > >> datapaths for other AZs are up to date. Today it only takes care of the > >> resources for its own AZ. > > > > I think implementing a lock is the right path forward for handling these > > issues. > > My main question is around the "leader AZ" logic: specifically, how it > > should > > handle IC-SB datapaths and port_bindings originating from other AZs. > > One idea I’m considering is having non-leader AZs proxy their requests > > to the leader for IC-SB insertion, but I am unsure on how I would > > implement this. > > > > I think proxy-ing commands might be complicated to get right. What > about the following approach? > > Can we partition the different types of writes ovn-ic needs to do in two > categories? > > a. writes that must be performed only by the "active" ovn-ic (the one > holding the lock). E.g.: maintaining the ICSB.Datapah_Binding table. > > b. writes that can be performed by multiple ovn-ic simultaneously > (without checking the DB lock). E.g., for ICSB.Port_Binding, an ovn-ic > instance should never be (over)writing records created by other ovn-ic. > So in theory they can just all manage their own without any locking just > fine. This is what I was planning initially with the proposal to set an AZ as leader in IC_NB_Global:options. Events which require only one AZ to insert data on ICSB (ex: creating a new datapath for a transit switch/router) would check if the AZ is the leader and then insert, while other events where a non-AZ leader would need to insert data on ICSB (ex: creating a new port_binding) would work without any change. I will be doing a small proof of concept, but I think this will work. Regards, Tiago. > > > I will be reading northd's source code to see if there is anything > > similar to this > > already implemented, but I would like to know if you have any clue as to how > > I could implement this. > > > > Regards, > > Tiago. > > > > Regards, > Dumitru > > >> Each ovn-ic, both active and backup are still responsible for writing to > >> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB > >> centralized databases. > >> > >> I didn't check the code for this into too many details though so there > >> might be other things to consider. > >> > >> What do you think? > >> > >> Regards, > >> Dumitru > >> > >>>>> > >>>>> Thanks, > >>>>> Tiago Matos > >>>>> > >>>>> -- > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no > >>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes no > >>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa > >>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas > >>>>> est?o > >>>>> imediatamente anuladas e proibidas?._ > >>>>> > >>>>> > >>>>> *?**?Apesar do Magazine Luiza tomar > >>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja > >>>>> presente nesse e-mail, a empresa n?o poder? aceitar a responsabilidade > >>>>> por > >>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos?.* > >>>>> > >>>>> > >>>>> > >>>>> -------------- next part -------------- > >>>>> An HTML attachment was scrubbed... > >>>>> URL: > >>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm> > >>>> > >>>> > >>>> Hi Tiago, > >>>> > >>>> I ran into similar issues when adding transit router support and have > >>>> added a comment above. I also have been working on OVN-IC related > >>>> features, so if you would like to discuss above issue further or other > >>>> OVN-IC work I would like to help. > >>>> > >>>> Regards, > >>>> Mairtin > >>>> > >>> > >>> > >>> Regards, > >>> Tiago Matos > >>> > >> > > > -- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
