Hi Dumitru,
re-sending this email because I forgot to CC ovs-discuss, sorry.

Em ter., 3 de fev. de 2026 às 07:16, Dumitru Ceara <[email protected]> escreveu:
>
> On 2/2/26 9:29 PM, Tiago Matos Carvalho Reis wrote:
> > Hi Dumitru, Mairtin,
> > sorry for delayed response.
> >
>
> Hi Tiago,
>
> > Em qui., 29 de jan. de 2026 às 10:38, Dumitru Ceara
> > <[email protected]> escreveu:
> >>
> >> Hi Tiago, Mairtin,
> >>
> >> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote:
> >>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh
> >>> <[email protected]> escreveu:
> >>>>
> >>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis 
> >>>> wrote:
> >>>>> Hi everyone,
> >>>>>
> >>>>> I have been working on implementing incremental processing in OVN-IC and
> >>>>> encountered a design issue regarding how OVN-IC handles multi-AZ writes.
> >>>>>
> >>>>> The Issue
> >>>>> In a scenario where multiple AZs are connected via OVN-IC, certain 
> >>>>> events
> >>>>> trigger all AZs to attempt writing the same data to the ISB/INB
> >>>>> simultaneously. This race condition leads to a constraint violation, 
> >>>>> which
> >>>>> causes the transaction to fail and forces a full recompute.
> >>>>>
> >>>>> Example:
> >>>>> A clear example of this can be seen in ovn-ic.c:ts_run:
> >>>>>
> >>>>>     if (ctx->ovnisb_txn) {
> >>>>>         /* Create ISB Datapath_Binding */
> >>>>>         ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) {
> >>>>>             const struct icsbrec_datapath_binding *isb_dp =
> >>>>>                 shash_find_and_delete(isb_ts_dps, ts->name);
> >>>>>             if (!isb_dp) {
> >>>>>                 /* Allocate tunnel key */
> >>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode,
> >>>>>                                                  "transit switch 
> >>>>> datapath");
> >>>>>                 if (!dp_key) {
> >>>>>                     continue;
> >>>>>                 }
> >>>>>
> >>>>>                 isb_dp = 
> >>>>> icsbrec_datapath_binding_insert(ctx->ovnisb_txn);
> >>>>>                 icsbrec_datapath_binding_set_transit_switch(isb_dp,
> >>>>> ts->name);
> >>>>>                 icsbrec_datapath_binding_set_tunnel_key(isb_dp, dp_key);
> >>>>>             } else if (dp_key_refresh) {
> >>>>>                 /* Refresh tunnel key since encap mode has changed. */
> >>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode,
> >>>>>                                                  "transit switch 
> >>>>> datapath");
> >>>>>                 if (dp_key) {
> >>>>>                     icsbrec_datapath_binding_set_tunnel_key(isb_dp, 
> >>>>> dp_key);
> >>>>>                 }
> >>>>>             }
> >>>>>
> >>>>>             if (!isb_dp->type) {
> >>>>>                 icsbrec_datapath_binding_set_type(isb_dp, 
> >>>>> "transit-switch");
> >>>>>             }
> >>>>>
> >>>>>             if (!isb_dp->nb_ic_uuid) {
> >>>>>                 icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp,
> >>>>>                                                         
> >>>>> &ts->header_.uuid,
> >>>>> 1);
> >>>>>             }
> >>>>>         }
> >>>>>
> >>>>>         struct shash_node *node;
> >>>>>         SHASH_FOR_EACH (node, isb_ts_dps) {
> >>>>>             icsbrec_datapath_binding_delete(node->data);
> >>>>>         }
> >>>>>     }
> >>>>>
> >>>>> When a new transit-switch is created, every AZ attempts to create the 
> >>>>> same
> >>>>> datapath_binding on the ISB. Only one request succeeds; the others fail
> >>>>> with a "constraint-violation."
> >>>>>
> >>>>> Impact:
> >>>>> This behavior negates the performance benefits of implementing 
> >>>>> incremental
> >>>>> processing, as the system falls back to a full recompute upon these
> >>>>> failures.
> >>>>>
> >>>>> For development purposes, I am currently ignoring these errors, but the
> >>>>> ideal way of fixing this issue is to have a mechanism where only a 
> >>>>> single
> >>>>> AZ handles the writes but this would require implementing some consensus
> >>>>> protocol.
> >>>>>
> >>>>> Does anyone have any advice on how we can fix this issue?
> >>>> ovn-ic in each AZ enumerates all existing ISB datapaths in
> >>>> enumerate_datapaths
> >>>> function, then will attempt to add missing datapaths. Since multilpe AZs
> >>>> will attempt to add the same missing entry, all but the first will fail
> >>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB
> >>>> datapath again, see the entry that succeeded and continue to create NB
> >>>> in local AZ. This solution does cause a transaction error on all but 1
> >>>> AZ whenever a Transit router is added, but we currently dont have a
> >>>> mechanism to manage this gracefully across multiple AZs.
> >>>
> >>> Hi Mairtin, thanks for the reply.
> >>>
> >>> Since there is no mechanism to manage which AZ should insert the data,
> >>> the only good solution besides implementing a full-fledge consensus 
> >>> algorithm
> >>> like Raft to select a leader AZ,  that I came up with is to simply set an 
> >>> option
> >>> in IC_NB_Global to manually configure a specific AZ as a leader, and in 
> >>> the
> >>> code check if the AZ is the leader or not.
> >>>
> >>> Example:
> >>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1
> >>>
> >>> In the code:
> >>>
> >>> const struct icnbrec_ic_nb_global *icnb_global =
> >>>     icnbrec_ic_nb_global_table_first(ic_nb_global_table);
> >>>
> >>> const struct nbrec_nb_global *nb_global =
> >>>     nbrec_nb_global_table_first(nb_global_table);
> >>>
> >>> const char *leader = smap_get(&icnb_global->options, "leader")
> >>> if (!strcmp(leader, nb_global->name)) {
> >>> // Insert logic here
> >>> }
> >>>
> >>> Do you have any opinion on this approach?
> >>>
> >>
> >> I was thinking of something a bit different (not too different though).
> >>
> >> The hierarchy is:
> >>
> >>              IC-NB
> >>                |
> >> ovn-ic (AZ1)  ovn-ic (AZ2)  ...  ovn-ic (AZN)
> >>                |
> >>              IC-SB
> >>
> >> Conceptually this is similar to the intra-az hierarchy:
> >>
> >>                       NB
> >>                       |
> >> ovn-northd (active)  ovn-northd (backup)  ...  ovn-northd (backup)
> >>                       |
> >>                       SB
> >>
> >> The way the instances synchronize is by taking the (single) SB database
> >> lock.  Only one northd succeeds, so that one becomes the "active".
> >>
> >> What if we do the same for ovn-ic?
> >>
>
> After a closer look at the ovn-ic code I realized this won't really work
> out of the box.  We really need all of the ovn-ic daemons to write
> "something" (either the actual ICSB.port_binding or a request towards
> the "leader") in the ICSB.  That's because ovn-ic running in AZ2 doesn't
> have access to the NB database running in AZ3.
>
> >> Make all ovn-ic try to take the IC-SB lock.  Only the one that succeeds
> >> becomes "active" and may write to the IC-SB.
> >>
> >> That has one implication though: the active instance (it can be any
> >> ovn-ic in any AZ) must also make sure the IC-SB port bindings and
> >> datapaths for other AZs are up to date.  Today it only takes care of the
> >> resources for its own AZ.
> >
> > I think implementing a lock is the right path forward for handling these 
> > issues.
> > My main question is around the "leader AZ" logic: specifically, how it 
> > should
> > handle IC-SB datapaths and port_bindings originating from other AZs.
> > One idea I’m considering is having non-leader AZs proxy their requests
> > to the leader for IC-SB insertion, but I am unsure on how I would
> > implement this.
> >
>
> I think proxy-ing commands might be complicated to get right.  What
> about the following approach?
>
> Can we partition the different types of writes ovn-ic needs to do in two
> categories?
>
> a. writes that must be performed only by the "active" ovn-ic (the one
> holding the lock).  E.g.: maintaining the ICSB.Datapah_Binding table.
>
> b. writes that can be performed by multiple ovn-ic simultaneously
> (without checking the DB lock).  E.g., for ICSB.Port_Binding, an ovn-ic
> instance should never be (over)writing records created by other ovn-ic.
> So in theory they can just all manage their own without any locking just
> fine.

This is what I was planning initially with the proposal to set an AZ
as leader in IC_NB_Global:options. Events which require only one AZ to
insert data on ICSB (ex: creating a new datapath for a transit
switch/router) would check if the AZ is the leader and then insert,
while other events where a non-AZ leader would need to insert data on
ICSB (ex: creating a new port_binding) would work without any change.

I will be doing a small proof of concept, but I think this will work.

Regards,
Tiago.

>
> > I will be reading northd's source code to see if there is anything
> > similar to this
> > already implemented, but I would like to know if you have any clue as to how
> > I could implement this.
> >
> > Regards,
> > Tiago.
> >
>
> Regards,
> Dumitru
>
> >> Each ovn-ic, both active and backup are still responsible for writing to
> >> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB
> >> centralized databases.
> >>
> >> I didn't check the code for this into too many details though so there
> >> might be other things to consider.
> >>
> >> What do you think?
> >>
> >> Regards,
> >> Dumitru
> >>
> >>>>>
> >>>>> Thanks,
> >>>>> Tiago Matos
> >>>>>
> >>>>> --
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no
> >>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes no
> >>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa
> >>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas 
> >>>>> est?o
> >>>>> imediatamente anuladas e proibidas?._
> >>>>>
> >>>>>
> >>>>> *?**?Apesar do Magazine Luiza tomar
> >>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja
> >>>>> presente nesse e-mail, a empresa n?o poder? aceitar a responsabilidade 
> >>>>> por
> >>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos?.*
> >>>>>
> >>>>>
> >>>>>
> >>>>> -------------- next part --------------
> >>>>> An HTML attachment was scrubbed...
> >>>>> URL: 
> >>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm>
> >>>>
> >>>>
> >>>> Hi Tiago,
> >>>>
> >>>> I ran into similar issues when adding transit router support and have
> >>>> added a comment above. I also have been working on OVN-IC related
> >>>> features, so if you would like to discuss above issue further or other
> >>>> OVN-IC work I would like to help.
> >>>>
> >>>> Regards,
> >>>> Mairtin
> >>>>
> >>>
> >>>
> >>> Regards,
> >>> Tiago Matos
> >>>
> >>
> >
>

-- 




_‘Esta mensagem é direcionada apenas para os endereços constantes no 
cabeçalho inicial. Se você não está listado nos endereços constantes no 
cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
imediatamente anuladas e proibidas’._


* **‘Apesar do Magazine Luiza tomar 
todas as precauções razoáveis para assegurar que nenhum vírus esteja 
presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*



_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to