On 2/4/26 10:37 AM, Dumitru Ceara wrote:
> On 2/3/26 8:15 PM, Tiago Matos Carvalho Reis wrote:
>> Em ter., 3 de fev. de 2026 às 15:14, Ilya Maximets
>> <[email protected]> escreveu:
>>>
>>> On 2/3/26 6:46 PM, Tiago Matos Carvalho Reis via discuss wrote:
>>>> Hi Dumitru,
>>>> re-sending this email because I forgot to CC ovs-discuss, sorry.
>>>>
>>>> Em ter., 3 de fev. de 2026 às 07:16, Dumitru Ceara <[email protected]> 
>>>> escreveu:
>>>>>
>>>>> On 2/2/26 9:29 PM, Tiago Matos Carvalho Reis wrote:
>>>>>> Hi Dumitru, Mairtin,
>>>>>> sorry for delayed response.
>>>>>>
>>>>>
>>>>> Hi Tiago,
>>>>>
>>>>>> Em qui., 29 de jan. de 2026 às 10:38, Dumitru Ceara
>>>>>> <[email protected]> escreveu:
>>>>>>>
>>>>>>> Hi Tiago, Mairtin,
>>>>>>>
>>>>>>> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote:
>>>>>>>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh
>>>>>>>> <[email protected]> escreveu:
>>>>>>>>>
>>>>>>>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis 
>>>>>>>>> wrote:
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I have been working on implementing incremental processing in OVN-IC 
>>>>>>>>>> and
>>>>>>>>>> encountered a design issue regarding how OVN-IC handles multi-AZ 
>>>>>>>>>> writes.
>>>>>>>>>>
>>>>>>>>>> The Issue
>>>>>>>>>> In a scenario where multiple AZs are connected via OVN-IC, certain 
>>>>>>>>>> events
>>>>>>>>>> trigger all AZs to attempt writing the same data to the ISB/INB
>>>>>>>>>> simultaneously. This race condition leads to a constraint violation, 
>>>>>>>>>> which
>>>>>>>>>> causes the transaction to fail and forces a full recompute.
>>>>>>>>>>
>>>>>>>>>> Example:
>>>>>>>>>> A clear example of this can be seen in ovn-ic.c:ts_run:
>>>>>>>>>>
>>>>>>>>>>     if (ctx->ovnisb_txn) {
>>>>>>>>>>         /* Create ISB Datapath_Binding */
>>>>>>>>>>         ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) {
>>>>>>>>>>             const struct icsbrec_datapath_binding *isb_dp =
>>>>>>>>>>                 shash_find_and_delete(isb_ts_dps, ts->name);
>>>>>>>>>>             if (!isb_dp) {
>>>>>>>>>>                 /* Allocate tunnel key */
>>>>>>>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, 
>>>>>>>>>> vxlan_mode,
>>>>>>>>>>                                                  "transit switch 
>>>>>>>>>> datapath");
>>>>>>>>>>                 if (!dp_key) {
>>>>>>>>>>                     continue;
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>                 isb_dp = 
>>>>>>>>>> icsbrec_datapath_binding_insert(ctx->ovnisb_txn);
>>>>>>>>>>                 icsbrec_datapath_binding_set_transit_switch(isb_dp,
>>>>>>>>>> ts->name);
>>>>>>>>>>                 icsbrec_datapath_binding_set_tunnel_key(isb_dp, 
>>>>>>>>>> dp_key);
>>>>>>>>>>             } else if (dp_key_refresh) {
>>>>>>>>>>                 /* Refresh tunnel key since encap mode has changed. 
>>>>>>>>>> */
>>>>>>>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, 
>>>>>>>>>> vxlan_mode,
>>>>>>>>>>                                                  "transit switch 
>>>>>>>>>> datapath");
>>>>>>>>>>                 if (dp_key) {
>>>>>>>>>>                     icsbrec_datapath_binding_set_tunnel_key(isb_dp, 
>>>>>>>>>> dp_key);
>>>>>>>>>>                 }
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>             if (!isb_dp->type) {
>>>>>>>>>>                 icsbrec_datapath_binding_set_type(isb_dp, 
>>>>>>>>>> "transit-switch");
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>             if (!isb_dp->nb_ic_uuid) {
>>>>>>>>>>                 icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp,
>>>>>>>>>>                                                         
>>>>>>>>>> &ts->header_.uuid,
>>>>>>>>>> 1);
>>>>>>>>>>             }
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         struct shash_node *node;
>>>>>>>>>>         SHASH_FOR_EACH (node, isb_ts_dps) {
>>>>>>>>>>             icsbrec_datapath_binding_delete(node->data);
>>>>>>>>>>         }
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>> When a new transit-switch is created, every AZ attempts to create 
>>>>>>>>>> the same
>>>>>>>>>> datapath_binding on the ISB. Only one request succeeds; the others 
>>>>>>>>>> fail
>>>>>>>>>> with a "constraint-violation."
>>>>>>>>>>
>>>>>>>>>> Impact:
>>>>>>>>>> This behavior negates the performance benefits of implementing 
>>>>>>>>>> incremental
>>>>>>>>>> processing, as the system falls back to a full recompute upon these
>>>>>>>>>> failures.
>>>>>>>>>>
>>>>>>>>>> For development purposes, I am currently ignoring these errors, but 
>>>>>>>>>> the
>>>>>>>>>> ideal way of fixing this issue is to have a mechanism where only a 
>>>>>>>>>> single
>>>>>>>>>> AZ handles the writes but this would require implementing some 
>>>>>>>>>> consensus
>>>>>>>>>> protocol.
>>>>>>>>>>
>>>>>>>>>> Does anyone have any advice on how we can fix this issue?
>>>>>>>>> ovn-ic in each AZ enumerates all existing ISB datapaths in
>>>>>>>>> enumerate_datapaths
>>>>>>>>> function, then will attempt to add missing datapaths. Since multilpe 
>>>>>>>>> AZs
>>>>>>>>> will attempt to add the same missing entry, all but the first will 
>>>>>>>>> fail
>>>>>>>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB
>>>>>>>>> datapath again, see the entry that succeeded and continue to create NB
>>>>>>>>> in local AZ. This solution does cause a transaction error on all but 1
>>>>>>>>> AZ whenever a Transit router is added, but we currently dont have a
>>>>>>>>> mechanism to manage this gracefully across multiple AZs.
>>>>>>>>
>>>>>>>> Hi Mairtin, thanks for the reply.
>>>>>>>>
>>>>>>>> Since there is no mechanism to manage which AZ should insert the data,
>>>>>>>> the only good solution besides implementing a full-fledge consensus 
>>>>>>>> algorithm
>>>>>>>> like Raft to select a leader AZ,  that I came up with is to simply set 
>>>>>>>> an option
>>>>>>>> in IC_NB_Global to manually configure a specific AZ as a leader, and 
>>>>>>>> in the
>>>>>>>> code check if the AZ is the leader or not.
>>>>>>>>
>>>>>>>> Example:
>>>>>>>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1
>>>>>>>>
>>>>>>>> In the code:
>>>>>>>>
>>>>>>>> const struct icnbrec_ic_nb_global *icnb_global =
>>>>>>>>     icnbrec_ic_nb_global_table_first(ic_nb_global_table);
>>>>>>>>
>>>>>>>> const struct nbrec_nb_global *nb_global =
>>>>>>>>     nbrec_nb_global_table_first(nb_global_table);
>>>>>>>>
>>>>>>>> const char *leader = smap_get(&icnb_global->options, "leader")
>>>>>>>> if (!strcmp(leader, nb_global->name)) {
>>>>>>>> // Insert logic here
>>>>>>>> }
>>>>>>>>
>>>>>>>> Do you have any opinion on this approach?
>>>>>>>>
>>>>>>>
>>>>>>> I was thinking of something a bit different (not too different though).
>>>>>>>
>>>>>>> The hierarchy is:
>>>>>>>
>>>>>>>              IC-NB
>>>>>>>                |
>>>>>>> ovn-ic (AZ1)  ovn-ic (AZ2)  ...  ovn-ic (AZN)
>>>>>>>                |
>>>>>>>              IC-SB
>>>>>>>
>>>>>>> Conceptually this is similar to the intra-az hierarchy:
>>>>>>>
>>>>>>>                       NB
>>>>>>>                       |
>>>>>>> ovn-northd (active)  ovn-northd (backup)  ...  ovn-northd (backup)
>>>>>>>                       |
>>>>>>>                       SB
>>>>>>>
>>>>>>> The way the instances synchronize is by taking the (single) SB database
>>>>>>> lock.  Only one northd succeeds, so that one becomes the "active".
>>>>>>>
>>>>>>> What if we do the same for ovn-ic?
>>>>>>>
>>>>>
>>>>> After a closer look at the ovn-ic code I realized this won't really work
>>>>> out of the box.  We really need all of the ovn-ic daemons to write
>>>>> "something" (either the actual ICSB.port_binding or a request towards
>>>>> the "leader") in the ICSB.  That's because ovn-ic running in AZ2 doesn't
>>>>> have access to the NB database running in AZ3.
>>>>>
>>>>>>> Make all ovn-ic try to take the IC-SB lock.  Only the one that succeeds
>>>>>>> becomes "active" and may write to the IC-SB.
>>>>>>>
>>>>>>> That has one implication though: the active instance (it can be any
>>>>>>> ovn-ic in any AZ) must also make sure the IC-SB port bindings and
>>>>>>> datapaths for other AZs are up to date.  Today it only takes care of the
>>>>>>> resources for its own AZ.
>>>>>>
>>>>>> I think implementing a lock is the right path forward for handling these 
>>>>>> issues.
>>>>>> My main question is around the "leader AZ" logic: specifically, how it 
>>>>>> should
>>>>>> handle IC-SB datapaths and port_bindings originating from other AZs.
>>>>>> One idea I’m considering is having non-leader AZs proxy their requests
>>>>>> to the leader for IC-SB insertion, but I am unsure on how I would
>>>>>> implement this.
>>>>>>
>>>>>
>>>>> I think proxy-ing commands might be complicated to get right.  What
>>>>> about the following approach?
>>>>>
>>>>> Can we partition the different types of writes ovn-ic needs to do in two
>>>>> categories?
>>>>>
>>>>> a. writes that must be performed only by the "active" ovn-ic (the one
>>>>> holding the lock).  E.g.: maintaining the ICSB.Datapah_Binding table.
>>>>>
>>>>> b. writes that can be performed by multiple ovn-ic simultaneously
>>>>> (without checking the DB lock).  E.g., for ICSB.Port_Binding, an ovn-ic
>>>>> instance should never be (over)writing records created by other ovn-ic.
>>>>> So in theory they can just all manage their own without any locking just
>>>>> fine.
>>>>
>>>> This is what I was planning initially with the proposal to set an AZ
>>>> as leader in IC_NB_Global:options. Events which require only one AZ to
>>>> insert data on ICSB (ex: creating a new datapath for a transit
>>>> switch/router) would check if the AZ is the leader and then insert,
>>>> while other events where a non-AZ leader would need to insert data on
>>>> ICSB (ex: creating a new port_binding) would work without any change.
>>>>
>>>> I will be doing a small proof of concept, but I think this will work.
>>>
>>> I think, the idea from Dumitru was to still use database locking, but
>>> only for some parts.  That would require having a separate connection
>>> to the database.  One with the lock and the other without.  So, the
>>> instance that has the lock can do common operations through the locked
>>> connection and all the instnces can do their own specific operations
>>> using a separate connection without a lock.
>>>
> 
> That's a good summary, with one note: I didn't suggest having
> separate connections to the database.  The OVSDB lock is just a way
> to guarantee that a single "owner" exists for that lock, but it
> doesn't restrict the semantics of what that lock is.  From the
> OVSDB RFC itself:
> 
> https://datatracker.ietf.org/doc/html/rfc7047#section-4.1.8
> 
> "The database
>    server supports an arbitrary number of locks, each of which is
>    identified by a client-defined ID.  At any given time, each lock may
>    have at most one owner.  The precise usage of a lock is determined by
>    the client.  For example, a set of clients may agree that a certain
>    table can only be written by the owner of a certain lock.  OVSDB
>    itself does not enforce any restrictions on how locks are used -- it
>    simply ensures that a lock has at most one owner."
> 
> So I thought that we could use the lock to determine if "the ovn-ic
> process is <the leader> responsible for syncing common data, e.g.,
> datapath_binding".
> 
> Regardless of being "leader" or not, each ovn-ic would still sync its
> other non-common data (e.g., port bindings) to the IC SB (through the
> same connection).
> 
> But now looking at the IDL/ovsdb-cs code, I think we're being too
> restrictive there and we don't let IDL/cs clients be as flexible as
> the RFC actually allows them to be:
> 
> /* Returns true if 'cs' can be sent a transaction now, false otherwise.  This
>  * is useful for optimization: there is no point in composing and sending a
>  * transaction if it returns false. */
> bool
> ovsdb_cs_may_send_transaction(const struct ovsdb_cs *cs)
> {
>     return (cs->session != NULL
>             && cs->state == CS_S_MONITORING
>             && (!cs->data.lock_name || ovsdb_cs_has_lock(cs)));
> }
> 
> Ilya, is this something to improve in the ovsdb-cs layer?

IDL doesn't allow for this use case today, that's true.  It only
handles one lock and acts defensively if this lock is not held
when the application attempts to send a transaction.

It should be possible to allow transactions even if we do not have
the configured lock, but it should be an opt-in behavior to avoid
breaking applications that do not check the locking state before
attempting their transactions.

> 
> Until then, we do need separate connections to the IC SB: one locked, one not.
> 
>>> So, yes, it's similar to the IC_NB_Global:options proposal, but the
>>> leader is chosen automatically via the ovsdb locking mechanism, which
>>> avoids the problem of a leader going down.
>>>
> 
> Exactly.
> 
>>> Best regards, Ilya Maximets.
>>
>> Hi Ilya,
>> Thanks for the clear explanation, I understand it now. I’m going to
>> put together a PoC to test this out and will follow up with the
>> results.
>>
> 
> Ack, looking forward to see how that goes.
> 
> Thanks,
> Dumitru
> 
>> Regards,
>> Tiago.
>>
>>>
>>>>
>>>> Regards,
>>>> Tiago.
>>>>
>>>>>
>>>>>> I will be reading northd's source code to see if there is anything
>>>>>> similar to this
>>>>>> already implemented, but I would like to know if you have any clue as to 
>>>>>> how
>>>>>> I could implement this.
>>>>>>
>>>>>> Regards,
>>>>>> Tiago.
>>>>>>
>>>>>
>>>>> Regards,
>>>>> Dumitru
>>>>>
>>>>>>> Each ovn-ic, both active and backup are still responsible for writing to
>>>>>>> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB
>>>>>>> centralized databases.
>>>>>>>
>>>>>>> I didn't check the code for this into too many details though so there
>>>>>>> might be other things to consider.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Dumitru
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Tiago Matos
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no
>>>>>>>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes 
>>>>>>>>>> no
>>>>>>>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do 
>>>>>>>>>> dessa
>>>>>>>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es 
>>>>>>>>>> citadas est?o
>>>>>>>>>> imediatamente anuladas e proibidas?._
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *?**?Apesar do Magazine Luiza tomar
>>>>>>>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja
>>>>>>>>>> presente nesse e-mail, a empresa n?o poder? aceitar a 
>>>>>>>>>> responsabilidade por
>>>>>>>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus 
>>>>>>>>>> anexos?.*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -------------- next part --------------
>>>>>>>>>> An HTML attachment was scrubbed...
>>>>>>>>>> URL: 
>>>>>>>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Tiago,
>>>>>>>>>
>>>>>>>>> I ran into similar issues when adding transit router support and have
>>>>>>>>> added a comment above. I also have been working on OVN-IC related
>>>>>>>>> features, so if you would like to discuss above issue further or other
>>>>>>>>> OVN-IC work I would like to help.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Mairtin
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Tiago Matos
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to