Em ter., 3 de fev. de 2026 às 15:14, Ilya Maximets <[email protected]> escreveu: > > On 2/3/26 6:46 PM, Tiago Matos Carvalho Reis via discuss wrote: > > Hi Dumitru, > > re-sending this email because I forgot to CC ovs-discuss, sorry. > > > > Em ter., 3 de fev. de 2026 às 07:16, Dumitru Ceara <[email protected]> > > escreveu: > >> > >> On 2/2/26 9:29 PM, Tiago Matos Carvalho Reis wrote: > >>> Hi Dumitru, Mairtin, > >>> sorry for delayed response. > >>> > >> > >> Hi Tiago, > >> > >>> Em qui., 29 de jan. de 2026 às 10:38, Dumitru Ceara > >>> <[email protected]> escreveu: > >>>> > >>>> Hi Tiago, Mairtin, > >>>> > >>>> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote: > >>>>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh > >>>>> <[email protected]> escreveu: > >>>>>> > >>>>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis > >>>>>> wrote: > >>>>>>> Hi everyone, > >>>>>>> > >>>>>>> I have been working on implementing incremental processing in OVN-IC > >>>>>>> and > >>>>>>> encountered a design issue regarding how OVN-IC handles multi-AZ > >>>>>>> writes. > >>>>>>> > >>>>>>> The Issue > >>>>>>> In a scenario where multiple AZs are connected via OVN-IC, certain > >>>>>>> events > >>>>>>> trigger all AZs to attempt writing the same data to the ISB/INB > >>>>>>> simultaneously. This race condition leads to a constraint violation, > >>>>>>> which > >>>>>>> causes the transaction to fail and forces a full recompute. > >>>>>>> > >>>>>>> Example: > >>>>>>> A clear example of this can be seen in ovn-ic.c:ts_run: > >>>>>>> > >>>>>>> if (ctx->ovnisb_txn) { > >>>>>>> /* Create ISB Datapath_Binding */ > >>>>>>> ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) { > >>>>>>> const struct icsbrec_datapath_binding *isb_dp = > >>>>>>> shash_find_and_delete(isb_ts_dps, ts->name); > >>>>>>> if (!isb_dp) { > >>>>>>> /* Allocate tunnel key */ > >>>>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, > >>>>>>> vxlan_mode, > >>>>>>> "transit switch > >>>>>>> datapath"); > >>>>>>> if (!dp_key) { > >>>>>>> continue; > >>>>>>> } > >>>>>>> > >>>>>>> isb_dp = > >>>>>>> icsbrec_datapath_binding_insert(ctx->ovnisb_txn); > >>>>>>> icsbrec_datapath_binding_set_transit_switch(isb_dp, > >>>>>>> ts->name); > >>>>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, > >>>>>>> dp_key); > >>>>>>> } else if (dp_key_refresh) { > >>>>>>> /* Refresh tunnel key since encap mode has changed. */ > >>>>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, > >>>>>>> vxlan_mode, > >>>>>>> "transit switch > >>>>>>> datapath"); > >>>>>>> if (dp_key) { > >>>>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, > >>>>>>> dp_key); > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> if (!isb_dp->type) { > >>>>>>> icsbrec_datapath_binding_set_type(isb_dp, > >>>>>>> "transit-switch"); > >>>>>>> } > >>>>>>> > >>>>>>> if (!isb_dp->nb_ic_uuid) { > >>>>>>> icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp, > >>>>>>> > >>>>>>> &ts->header_.uuid, > >>>>>>> 1); > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> struct shash_node *node; > >>>>>>> SHASH_FOR_EACH (node, isb_ts_dps) { > >>>>>>> icsbrec_datapath_binding_delete(node->data); > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> When a new transit-switch is created, every AZ attempts to create the > >>>>>>> same > >>>>>>> datapath_binding on the ISB. Only one request succeeds; the others > >>>>>>> fail > >>>>>>> with a "constraint-violation." > >>>>>>> > >>>>>>> Impact: > >>>>>>> This behavior negates the performance benefits of implementing > >>>>>>> incremental > >>>>>>> processing, as the system falls back to a full recompute upon these > >>>>>>> failures. > >>>>>>> > >>>>>>> For development purposes, I am currently ignoring these errors, but > >>>>>>> the > >>>>>>> ideal way of fixing this issue is to have a mechanism where only a > >>>>>>> single > >>>>>>> AZ handles the writes but this would require implementing some > >>>>>>> consensus > >>>>>>> protocol. > >>>>>>> > >>>>>>> Does anyone have any advice on how we can fix this issue? > >>>>>> ovn-ic in each AZ enumerates all existing ISB datapaths in > >>>>>> enumerate_datapaths > >>>>>> function, then will attempt to add missing datapaths. Since multilpe > >>>>>> AZs > >>>>>> will attempt to add the same missing entry, all but the first will fail > >>>>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB > >>>>>> datapath again, see the entry that succeeded and continue to create NB > >>>>>> in local AZ. This solution does cause a transaction error on all but 1 > >>>>>> AZ whenever a Transit router is added, but we currently dont have a > >>>>>> mechanism to manage this gracefully across multiple AZs. > >>>>> > >>>>> Hi Mairtin, thanks for the reply. > >>>>> > >>>>> Since there is no mechanism to manage which AZ should insert the data, > >>>>> the only good solution besides implementing a full-fledge consensus > >>>>> algorithm > >>>>> like Raft to select a leader AZ, that I came up with is to simply set > >>>>> an option > >>>>> in IC_NB_Global to manually configure a specific AZ as a leader, and in > >>>>> the > >>>>> code check if the AZ is the leader or not. > >>>>> > >>>>> Example: > >>>>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1 > >>>>> > >>>>> In the code: > >>>>> > >>>>> const struct icnbrec_ic_nb_global *icnb_global = > >>>>> icnbrec_ic_nb_global_table_first(ic_nb_global_table); > >>>>> > >>>>> const struct nbrec_nb_global *nb_global = > >>>>> nbrec_nb_global_table_first(nb_global_table); > >>>>> > >>>>> const char *leader = smap_get(&icnb_global->options, "leader") > >>>>> if (!strcmp(leader, nb_global->name)) { > >>>>> // Insert logic here > >>>>> } > >>>>> > >>>>> Do you have any opinion on this approach? > >>>>> > >>>> > >>>> I was thinking of something a bit different (not too different though). > >>>> > >>>> The hierarchy is: > >>>> > >>>> IC-NB > >>>> | > >>>> ovn-ic (AZ1) ovn-ic (AZ2) ... ovn-ic (AZN) > >>>> | > >>>> IC-SB > >>>> > >>>> Conceptually this is similar to the intra-az hierarchy: > >>>> > >>>> NB > >>>> | > >>>> ovn-northd (active) ovn-northd (backup) ... ovn-northd (backup) > >>>> | > >>>> SB > >>>> > >>>> The way the instances synchronize is by taking the (single) SB database > >>>> lock. Only one northd succeeds, so that one becomes the "active". > >>>> > >>>> What if we do the same for ovn-ic? > >>>> > >> > >> After a closer look at the ovn-ic code I realized this won't really work > >> out of the box. We really need all of the ovn-ic daemons to write > >> "something" (either the actual ICSB.port_binding or a request towards > >> the "leader") in the ICSB. That's because ovn-ic running in AZ2 doesn't > >> have access to the NB database running in AZ3. > >> > >>>> Make all ovn-ic try to take the IC-SB lock. Only the one that succeeds > >>>> becomes "active" and may write to the IC-SB. > >>>> > >>>> That has one implication though: the active instance (it can be any > >>>> ovn-ic in any AZ) must also make sure the IC-SB port bindings and > >>>> datapaths for other AZs are up to date. Today it only takes care of the > >>>> resources for its own AZ. > >>> > >>> I think implementing a lock is the right path forward for handling these > >>> issues. > >>> My main question is around the "leader AZ" logic: specifically, how it > >>> should > >>> handle IC-SB datapaths and port_bindings originating from other AZs. > >>> One idea I’m considering is having non-leader AZs proxy their requests > >>> to the leader for IC-SB insertion, but I am unsure on how I would > >>> implement this. > >>> > >> > >> I think proxy-ing commands might be complicated to get right. What > >> about the following approach? > >> > >> Can we partition the different types of writes ovn-ic needs to do in two > >> categories? > >> > >> a. writes that must be performed only by the "active" ovn-ic (the one > >> holding the lock). E.g.: maintaining the ICSB.Datapah_Binding table. > >> > >> b. writes that can be performed by multiple ovn-ic simultaneously > >> (without checking the DB lock). E.g., for ICSB.Port_Binding, an ovn-ic > >> instance should never be (over)writing records created by other ovn-ic. > >> So in theory they can just all manage their own without any locking just > >> fine. > > > > This is what I was planning initially with the proposal to set an AZ > > as leader in IC_NB_Global:options. Events which require only one AZ to > > insert data on ICSB (ex: creating a new datapath for a transit > > switch/router) would check if the AZ is the leader and then insert, > > while other events where a non-AZ leader would need to insert data on > > ICSB (ex: creating a new port_binding) would work without any change. > > > > I will be doing a small proof of concept, but I think this will work. > > I think, the idea from Dumitru was to still use database locking, but > only for some parts. That would require having a separate connection > to the database. One with the lock and the other without. So, the > instance that has the lock can do common operations through the locked > connection and all the instnces can do their own specific operations > using a separate connection without a lock. > > So, yes, it's similar to the IC_NB_Global:options proposal, but the > leader is chosen automatically via the ovsdb locking mechanism, which > avoids the problem of a leader going down. > > Best regards, Ilya Maximets.
Hi Ilya, Thanks for the clear explanation, I understand it now. I’m going to put together a PoC to test this out and will follow up with the results. Regards, Tiago. > > > > > Regards, > > Tiago. > > > >> > >>> I will be reading northd's source code to see if there is anything > >>> similar to this > >>> already implemented, but I would like to know if you have any clue as to > >>> how > >>> I could implement this. > >>> > >>> Regards, > >>> Tiago. > >>> > >> > >> Regards, > >> Dumitru > >> > >>>> Each ovn-ic, both active and backup are still responsible for writing to > >>>> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB > >>>> centralized databases. > >>>> > >>>> I didn't check the code for this into too many details though so there > >>>> might be other things to consider. > >>>> > >>>> What do you think? > >>>> > >>>> Regards, > >>>> Dumitru > >>>> > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Tiago Matos > >>>>>>> > >>>>>>> -- > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no > >>>>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes > >>>>>>> no > >>>>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa > >>>>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas > >>>>>>> est?o > >>>>>>> imediatamente anuladas e proibidas?._ > >>>>>>> > >>>>>>> > >>>>>>> *?**?Apesar do Magazine Luiza tomar > >>>>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja > >>>>>>> presente nesse e-mail, a empresa n?o poder? aceitar a > >>>>>>> responsabilidade por > >>>>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus > >>>>>>> anexos?.* > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -------------- next part -------------- > >>>>>>> An HTML attachment was scrubbed... > >>>>>>> URL: > >>>>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm> > >>>>>> > >>>>>> > >>>>>> Hi Tiago, > >>>>>> > >>>>>> I ran into similar issues when adding transit router support and have > >>>>>> added a comment above. I also have been working on OVN-IC related > >>>>>> features, so if you would like to discuss above issue further or other > >>>>>> OVN-IC work I would like to help. > >>>>>> > >>>>>> Regards, > >>>>>> Mairtin > >>>>>> > >>>>> > >>>>> > >>>>> Regards, > >>>>> Tiago Matos > >>>>> > >>>> > >>> > >> > > > -- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
