Em ter., 3 de fev. de 2026 às 15:14, Ilya Maximets
<[email protected]> escreveu:
>
> On 2/3/26 6:46 PM, Tiago Matos Carvalho Reis via discuss wrote:
> > Hi Dumitru,
> > re-sending this email because I forgot to CC ovs-discuss, sorry.
> >
> > Em ter., 3 de fev. de 2026 às 07:16, Dumitru Ceara <[email protected]> 
> > escreveu:
> >>
> >> On 2/2/26 9:29 PM, Tiago Matos Carvalho Reis wrote:
> >>> Hi Dumitru, Mairtin,
> >>> sorry for delayed response.
> >>>
> >>
> >> Hi Tiago,
> >>
> >>> Em qui., 29 de jan. de 2026 às 10:38, Dumitru Ceara
> >>> <[email protected]> escreveu:
> >>>>
> >>>> Hi Tiago, Mairtin,
> >>>>
> >>>> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote:
> >>>>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh
> >>>>> <[email protected]> escreveu:
> >>>>>>
> >>>>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis 
> >>>>>> wrote:
> >>>>>>> Hi everyone,
> >>>>>>>
> >>>>>>> I have been working on implementing incremental processing in OVN-IC 
> >>>>>>> and
> >>>>>>> encountered a design issue regarding how OVN-IC handles multi-AZ 
> >>>>>>> writes.
> >>>>>>>
> >>>>>>> The Issue
> >>>>>>> In a scenario where multiple AZs are connected via OVN-IC, certain 
> >>>>>>> events
> >>>>>>> trigger all AZs to attempt writing the same data to the ISB/INB
> >>>>>>> simultaneously. This race condition leads to a constraint violation, 
> >>>>>>> which
> >>>>>>> causes the transaction to fail and forces a full recompute.
> >>>>>>>
> >>>>>>> Example:
> >>>>>>> A clear example of this can be seen in ovn-ic.c:ts_run:
> >>>>>>>
> >>>>>>>     if (ctx->ovnisb_txn) {
> >>>>>>>         /* Create ISB Datapath_Binding */
> >>>>>>>         ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) {
> >>>>>>>             const struct icsbrec_datapath_binding *isb_dp =
> >>>>>>>                 shash_find_and_delete(isb_ts_dps, ts->name);
> >>>>>>>             if (!isb_dp) {
> >>>>>>>                 /* Allocate tunnel key */
> >>>>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, 
> >>>>>>> vxlan_mode,
> >>>>>>>                                                  "transit switch 
> >>>>>>> datapath");
> >>>>>>>                 if (!dp_key) {
> >>>>>>>                     continue;
> >>>>>>>                 }
> >>>>>>>
> >>>>>>>                 isb_dp = 
> >>>>>>> icsbrec_datapath_binding_insert(ctx->ovnisb_txn);
> >>>>>>>                 icsbrec_datapath_binding_set_transit_switch(isb_dp,
> >>>>>>> ts->name);
> >>>>>>>                 icsbrec_datapath_binding_set_tunnel_key(isb_dp, 
> >>>>>>> dp_key);
> >>>>>>>             } else if (dp_key_refresh) {
> >>>>>>>                 /* Refresh tunnel key since encap mode has changed. */
> >>>>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, 
> >>>>>>> vxlan_mode,
> >>>>>>>                                                  "transit switch 
> >>>>>>> datapath");
> >>>>>>>                 if (dp_key) {
> >>>>>>>                     icsbrec_datapath_binding_set_tunnel_key(isb_dp, 
> >>>>>>> dp_key);
> >>>>>>>                 }
> >>>>>>>             }
> >>>>>>>
> >>>>>>>             if (!isb_dp->type) {
> >>>>>>>                 icsbrec_datapath_binding_set_type(isb_dp, 
> >>>>>>> "transit-switch");
> >>>>>>>             }
> >>>>>>>
> >>>>>>>             if (!isb_dp->nb_ic_uuid) {
> >>>>>>>                 icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp,
> >>>>>>>                                                         
> >>>>>>> &ts->header_.uuid,
> >>>>>>> 1);
> >>>>>>>             }
> >>>>>>>         }
> >>>>>>>
> >>>>>>>         struct shash_node *node;
> >>>>>>>         SHASH_FOR_EACH (node, isb_ts_dps) {
> >>>>>>>             icsbrec_datapath_binding_delete(node->data);
> >>>>>>>         }
> >>>>>>>     }
> >>>>>>>
> >>>>>>> When a new transit-switch is created, every AZ attempts to create the 
> >>>>>>> same
> >>>>>>> datapath_binding on the ISB. Only one request succeeds; the others 
> >>>>>>> fail
> >>>>>>> with a "constraint-violation."
> >>>>>>>
> >>>>>>> Impact:
> >>>>>>> This behavior negates the performance benefits of implementing 
> >>>>>>> incremental
> >>>>>>> processing, as the system falls back to a full recompute upon these
> >>>>>>> failures.
> >>>>>>>
> >>>>>>> For development purposes, I am currently ignoring these errors, but 
> >>>>>>> the
> >>>>>>> ideal way of fixing this issue is to have a mechanism where only a 
> >>>>>>> single
> >>>>>>> AZ handles the writes but this would require implementing some 
> >>>>>>> consensus
> >>>>>>> protocol.
> >>>>>>>
> >>>>>>> Does anyone have any advice on how we can fix this issue?
> >>>>>> ovn-ic in each AZ enumerates all existing ISB datapaths in
> >>>>>> enumerate_datapaths
> >>>>>> function, then will attempt to add missing datapaths. Since multilpe 
> >>>>>> AZs
> >>>>>> will attempt to add the same missing entry, all but the first will fail
> >>>>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB
> >>>>>> datapath again, see the entry that succeeded and continue to create NB
> >>>>>> in local AZ. This solution does cause a transaction error on all but 1
> >>>>>> AZ whenever a Transit router is added, but we currently dont have a
> >>>>>> mechanism to manage this gracefully across multiple AZs.
> >>>>>
> >>>>> Hi Mairtin, thanks for the reply.
> >>>>>
> >>>>> Since there is no mechanism to manage which AZ should insert the data,
> >>>>> the only good solution besides implementing a full-fledge consensus 
> >>>>> algorithm
> >>>>> like Raft to select a leader AZ,  that I came up with is to simply set 
> >>>>> an option
> >>>>> in IC_NB_Global to manually configure a specific AZ as a leader, and in 
> >>>>> the
> >>>>> code check if the AZ is the leader or not.
> >>>>>
> >>>>> Example:
> >>>>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1
> >>>>>
> >>>>> In the code:
> >>>>>
> >>>>> const struct icnbrec_ic_nb_global *icnb_global =
> >>>>>     icnbrec_ic_nb_global_table_first(ic_nb_global_table);
> >>>>>
> >>>>> const struct nbrec_nb_global *nb_global =
> >>>>>     nbrec_nb_global_table_first(nb_global_table);
> >>>>>
> >>>>> const char *leader = smap_get(&icnb_global->options, "leader")
> >>>>> if (!strcmp(leader, nb_global->name)) {
> >>>>> // Insert logic here
> >>>>> }
> >>>>>
> >>>>> Do you have any opinion on this approach?
> >>>>>
> >>>>
> >>>> I was thinking of something a bit different (not too different though).
> >>>>
> >>>> The hierarchy is:
> >>>>
> >>>>              IC-NB
> >>>>                |
> >>>> ovn-ic (AZ1)  ovn-ic (AZ2)  ...  ovn-ic (AZN)
> >>>>                |
> >>>>              IC-SB
> >>>>
> >>>> Conceptually this is similar to the intra-az hierarchy:
> >>>>
> >>>>                       NB
> >>>>                       |
> >>>> ovn-northd (active)  ovn-northd (backup)  ...  ovn-northd (backup)
> >>>>                       |
> >>>>                       SB
> >>>>
> >>>> The way the instances synchronize is by taking the (single) SB database
> >>>> lock.  Only one northd succeeds, so that one becomes the "active".
> >>>>
> >>>> What if we do the same for ovn-ic?
> >>>>
> >>
> >> After a closer look at the ovn-ic code I realized this won't really work
> >> out of the box.  We really need all of the ovn-ic daemons to write
> >> "something" (either the actual ICSB.port_binding or a request towards
> >> the "leader") in the ICSB.  That's because ovn-ic running in AZ2 doesn't
> >> have access to the NB database running in AZ3.
> >>
> >>>> Make all ovn-ic try to take the IC-SB lock.  Only the one that succeeds
> >>>> becomes "active" and may write to the IC-SB.
> >>>>
> >>>> That has one implication though: the active instance (it can be any
> >>>> ovn-ic in any AZ) must also make sure the IC-SB port bindings and
> >>>> datapaths for other AZs are up to date.  Today it only takes care of the
> >>>> resources for its own AZ.
> >>>
> >>> I think implementing a lock is the right path forward for handling these 
> >>> issues.
> >>> My main question is around the "leader AZ" logic: specifically, how it 
> >>> should
> >>> handle IC-SB datapaths and port_bindings originating from other AZs.
> >>> One idea I’m considering is having non-leader AZs proxy their requests
> >>> to the leader for IC-SB insertion, but I am unsure on how I would
> >>> implement this.
> >>>
> >>
> >> I think proxy-ing commands might be complicated to get right.  What
> >> about the following approach?
> >>
> >> Can we partition the different types of writes ovn-ic needs to do in two
> >> categories?
> >>
> >> a. writes that must be performed only by the "active" ovn-ic (the one
> >> holding the lock).  E.g.: maintaining the ICSB.Datapah_Binding table.
> >>
> >> b. writes that can be performed by multiple ovn-ic simultaneously
> >> (without checking the DB lock).  E.g., for ICSB.Port_Binding, an ovn-ic
> >> instance should never be (over)writing records created by other ovn-ic.
> >> So in theory they can just all manage their own without any locking just
> >> fine.
> >
> > This is what I was planning initially with the proposal to set an AZ
> > as leader in IC_NB_Global:options. Events which require only one AZ to
> > insert data on ICSB (ex: creating a new datapath for a transit
> > switch/router) would check if the AZ is the leader and then insert,
> > while other events where a non-AZ leader would need to insert data on
> > ICSB (ex: creating a new port_binding) would work without any change.
> >
> > I will be doing a small proof of concept, but I think this will work.
>
> I think, the idea from Dumitru was to still use database locking, but
> only for some parts.  That would require having a separate connection
> to the database.  One with the lock and the other without.  So, the
> instance that has the lock can do common operations through the locked
> connection and all the instnces can do their own specific operations
> using a separate connection without a lock.
>
> So, yes, it's similar to the IC_NB_Global:options proposal, but the
> leader is chosen automatically via the ovsdb locking mechanism, which
> avoids the problem of a leader going down.
>
> Best regards, Ilya Maximets.

Hi Ilya,
Thanks for the clear explanation, I understand it now. I’m going to
put together a PoC to test this out and will follow up with the
results.

Regards,
Tiago.

>
> >
> > Regards,
> > Tiago.
> >
> >>
> >>> I will be reading northd's source code to see if there is anything
> >>> similar to this
> >>> already implemented, but I would like to know if you have any clue as to 
> >>> how
> >>> I could implement this.
> >>>
> >>> Regards,
> >>> Tiago.
> >>>
> >>
> >> Regards,
> >> Dumitru
> >>
> >>>> Each ovn-ic, both active and backup are still responsible for writing to
> >>>> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB
> >>>> centralized databases.
> >>>>
> >>>> I didn't check the code for this into too many details though so there
> >>>> might be other things to consider.
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Regards,
> >>>> Dumitru
> >>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Tiago Matos
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no
> >>>>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes 
> >>>>>>> no
> >>>>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa
> >>>>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas 
> >>>>>>> est?o
> >>>>>>> imediatamente anuladas e proibidas?._
> >>>>>>>
> >>>>>>>
> >>>>>>> *?**?Apesar do Magazine Luiza tomar
> >>>>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja
> >>>>>>> presente nesse e-mail, a empresa n?o poder? aceitar a 
> >>>>>>> responsabilidade por
> >>>>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus 
> >>>>>>> anexos?.*
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> -------------- next part --------------
> >>>>>>> An HTML attachment was scrubbed...
> >>>>>>> URL: 
> >>>>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm>
> >>>>>>
> >>>>>>
> >>>>>> Hi Tiago,
> >>>>>>
> >>>>>> I ran into similar issues when adding transit router support and have
> >>>>>> added a comment above. I also have been working on OVN-IC related
> >>>>>> features, so if you would like to discuss above issue further or other
> >>>>>> OVN-IC work I would like to help.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Mairtin
> >>>>>>
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>> Tiago Matos
> >>>>>
> >>>>
> >>>
> >>
> >
>

-- 




_‘Esta mensagem é direcionada apenas para os endereços constantes no 
cabeçalho inicial. Se você não está listado nos endereços constantes no 
cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
imediatamente anuladas e proibidas’._


* **‘Apesar do Magazine Luiza tomar 
todas as precauções razoáveis para assegurar que nenhum vírus esteja 
presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*



_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to