On 2/3/26 8:15 PM, Tiago Matos Carvalho Reis wrote:
> Em ter., 3 de fev. de 2026 às 15:14, Ilya Maximets
> <[email protected]> escreveu:
>>
>> On 2/3/26 6:46 PM, Tiago Matos Carvalho Reis via discuss wrote:
>>> Hi Dumitru,
>>> re-sending this email because I forgot to CC ovs-discuss, sorry.
>>>
>>> Em ter., 3 de fev. de 2026 às 07:16, Dumitru Ceara <[email protected]> 
>>> escreveu:
>>>>
>>>> On 2/2/26 9:29 PM, Tiago Matos Carvalho Reis wrote:
>>>>> Hi Dumitru, Mairtin,
>>>>> sorry for delayed response.
>>>>>
>>>>
>>>> Hi Tiago,
>>>>
>>>>> Em qui., 29 de jan. de 2026 às 10:38, Dumitru Ceara
>>>>> <[email protected]> escreveu:
>>>>>>
>>>>>> Hi Tiago, Mairtin,
>>>>>>
>>>>>> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote:
>>>>>>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh
>>>>>>> <[email protected]> escreveu:
>>>>>>>>
>>>>>>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis 
>>>>>>>> wrote:
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I have been working on implementing incremental processing in OVN-IC 
>>>>>>>>> and
>>>>>>>>> encountered a design issue regarding how OVN-IC handles multi-AZ 
>>>>>>>>> writes.
>>>>>>>>>
>>>>>>>>> The Issue
>>>>>>>>> In a scenario where multiple AZs are connected via OVN-IC, certain 
>>>>>>>>> events
>>>>>>>>> trigger all AZs to attempt writing the same data to the ISB/INB
>>>>>>>>> simultaneously. This race condition leads to a constraint violation, 
>>>>>>>>> which
>>>>>>>>> causes the transaction to fail and forces a full recompute.
>>>>>>>>>
>>>>>>>>> Example:
>>>>>>>>> A clear example of this can be seen in ovn-ic.c:ts_run:
>>>>>>>>>
>>>>>>>>>     if (ctx->ovnisb_txn) {
>>>>>>>>>         /* Create ISB Datapath_Binding */
>>>>>>>>>         ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) {
>>>>>>>>>             const struct icsbrec_datapath_binding *isb_dp =
>>>>>>>>>                 shash_find_and_delete(isb_ts_dps, ts->name);
>>>>>>>>>             if (!isb_dp) {
>>>>>>>>>                 /* Allocate tunnel key */
>>>>>>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, 
>>>>>>>>> vxlan_mode,
>>>>>>>>>                                                  "transit switch 
>>>>>>>>> datapath");
>>>>>>>>>                 if (!dp_key) {
>>>>>>>>>                     continue;
>>>>>>>>>                 }
>>>>>>>>>
>>>>>>>>>                 isb_dp = 
>>>>>>>>> icsbrec_datapath_binding_insert(ctx->ovnisb_txn);
>>>>>>>>>                 icsbrec_datapath_binding_set_transit_switch(isb_dp,
>>>>>>>>> ts->name);
>>>>>>>>>                 icsbrec_datapath_binding_set_tunnel_key(isb_dp, 
>>>>>>>>> dp_key);
>>>>>>>>>             } else if (dp_key_refresh) {
>>>>>>>>>                 /* Refresh tunnel key since encap mode has changed. */
>>>>>>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, 
>>>>>>>>> vxlan_mode,
>>>>>>>>>                                                  "transit switch 
>>>>>>>>> datapath");
>>>>>>>>>                 if (dp_key) {
>>>>>>>>>                     icsbrec_datapath_binding_set_tunnel_key(isb_dp, 
>>>>>>>>> dp_key);
>>>>>>>>>                 }
>>>>>>>>>             }
>>>>>>>>>
>>>>>>>>>             if (!isb_dp->type) {
>>>>>>>>>                 icsbrec_datapath_binding_set_type(isb_dp, 
>>>>>>>>> "transit-switch");
>>>>>>>>>             }
>>>>>>>>>
>>>>>>>>>             if (!isb_dp->nb_ic_uuid) {
>>>>>>>>>                 icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp,
>>>>>>>>>                                                         
>>>>>>>>> &ts->header_.uuid,
>>>>>>>>> 1);
>>>>>>>>>             }
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>         struct shash_node *node;
>>>>>>>>>         SHASH_FOR_EACH (node, isb_ts_dps) {
>>>>>>>>>             icsbrec_datapath_binding_delete(node->data);
>>>>>>>>>         }
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>> When a new transit-switch is created, every AZ attempts to create the 
>>>>>>>>> same
>>>>>>>>> datapath_binding on the ISB. Only one request succeeds; the others 
>>>>>>>>> fail
>>>>>>>>> with a "constraint-violation."
>>>>>>>>>
>>>>>>>>> Impact:
>>>>>>>>> This behavior negates the performance benefits of implementing 
>>>>>>>>> incremental
>>>>>>>>> processing, as the system falls back to a full recompute upon these
>>>>>>>>> failures.
>>>>>>>>>
>>>>>>>>> For development purposes, I am currently ignoring these errors, but 
>>>>>>>>> the
>>>>>>>>> ideal way of fixing this issue is to have a mechanism where only a 
>>>>>>>>> single
>>>>>>>>> AZ handles the writes but this would require implementing some 
>>>>>>>>> consensus
>>>>>>>>> protocol.
>>>>>>>>>
>>>>>>>>> Does anyone have any advice on how we can fix this issue?
>>>>>>>> ovn-ic in each AZ enumerates all existing ISB datapaths in
>>>>>>>> enumerate_datapaths
>>>>>>>> function, then will attempt to add missing datapaths. Since multilpe 
>>>>>>>> AZs
>>>>>>>> will attempt to add the same missing entry, all but the first will fail
>>>>>>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB
>>>>>>>> datapath again, see the entry that succeeded and continue to create NB
>>>>>>>> in local AZ. This solution does cause a transaction error on all but 1
>>>>>>>> AZ whenever a Transit router is added, but we currently dont have a
>>>>>>>> mechanism to manage this gracefully across multiple AZs.
>>>>>>>
>>>>>>> Hi Mairtin, thanks for the reply.
>>>>>>>
>>>>>>> Since there is no mechanism to manage which AZ should insert the data,
>>>>>>> the only good solution besides implementing a full-fledge consensus 
>>>>>>> algorithm
>>>>>>> like Raft to select a leader AZ,  that I came up with is to simply set 
>>>>>>> an option
>>>>>>> in IC_NB_Global to manually configure a specific AZ as a leader, and in 
>>>>>>> the
>>>>>>> code check if the AZ is the leader or not.
>>>>>>>
>>>>>>> Example:
>>>>>>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1
>>>>>>>
>>>>>>> In the code:
>>>>>>>
>>>>>>> const struct icnbrec_ic_nb_global *icnb_global =
>>>>>>>     icnbrec_ic_nb_global_table_first(ic_nb_global_table);
>>>>>>>
>>>>>>> const struct nbrec_nb_global *nb_global =
>>>>>>>     nbrec_nb_global_table_first(nb_global_table);
>>>>>>>
>>>>>>> const char *leader = smap_get(&icnb_global->options, "leader")
>>>>>>> if (!strcmp(leader, nb_global->name)) {
>>>>>>> // Insert logic here
>>>>>>> }
>>>>>>>
>>>>>>> Do you have any opinion on this approach?
>>>>>>>
>>>>>>
>>>>>> I was thinking of something a bit different (not too different though).
>>>>>>
>>>>>> The hierarchy is:
>>>>>>
>>>>>>              IC-NB
>>>>>>                |
>>>>>> ovn-ic (AZ1)  ovn-ic (AZ2)  ...  ovn-ic (AZN)
>>>>>>                |
>>>>>>              IC-SB
>>>>>>
>>>>>> Conceptually this is similar to the intra-az hierarchy:
>>>>>>
>>>>>>                       NB
>>>>>>                       |
>>>>>> ovn-northd (active)  ovn-northd (backup)  ...  ovn-northd (backup)
>>>>>>                       |
>>>>>>                       SB
>>>>>>
>>>>>> The way the instances synchronize is by taking the (single) SB database
>>>>>> lock.  Only one northd succeeds, so that one becomes the "active".
>>>>>>
>>>>>> What if we do the same for ovn-ic?
>>>>>>
>>>>
>>>> After a closer look at the ovn-ic code I realized this won't really work
>>>> out of the box.  We really need all of the ovn-ic daemons to write
>>>> "something" (either the actual ICSB.port_binding or a request towards
>>>> the "leader") in the ICSB.  That's because ovn-ic running in AZ2 doesn't
>>>> have access to the NB database running in AZ3.
>>>>
>>>>>> Make all ovn-ic try to take the IC-SB lock.  Only the one that succeeds
>>>>>> becomes "active" and may write to the IC-SB.
>>>>>>
>>>>>> That has one implication though: the active instance (it can be any
>>>>>> ovn-ic in any AZ) must also make sure the IC-SB port bindings and
>>>>>> datapaths for other AZs are up to date.  Today it only takes care of the
>>>>>> resources for its own AZ.
>>>>>
>>>>> I think implementing a lock is the right path forward for handling these 
>>>>> issues.
>>>>> My main question is around the "leader AZ" logic: specifically, how it 
>>>>> should
>>>>> handle IC-SB datapaths and port_bindings originating from other AZs.
>>>>> One idea I’m considering is having non-leader AZs proxy their requests
>>>>> to the leader for IC-SB insertion, but I am unsure on how I would
>>>>> implement this.
>>>>>
>>>>
>>>> I think proxy-ing commands might be complicated to get right.  What
>>>> about the following approach?
>>>>
>>>> Can we partition the different types of writes ovn-ic needs to do in two
>>>> categories?
>>>>
>>>> a. writes that must be performed only by the "active" ovn-ic (the one
>>>> holding the lock).  E.g.: maintaining the ICSB.Datapah_Binding table.
>>>>
>>>> b. writes that can be performed by multiple ovn-ic simultaneously
>>>> (without checking the DB lock).  E.g., for ICSB.Port_Binding, an ovn-ic
>>>> instance should never be (over)writing records created by other ovn-ic.
>>>> So in theory they can just all manage their own without any locking just
>>>> fine.
>>>
>>> This is what I was planning initially with the proposal to set an AZ
>>> as leader in IC_NB_Global:options. Events which require only one AZ to
>>> insert data on ICSB (ex: creating a new datapath for a transit
>>> switch/router) would check if the AZ is the leader and then insert,
>>> while other events where a non-AZ leader would need to insert data on
>>> ICSB (ex: creating a new port_binding) would work without any change.
>>>
>>> I will be doing a small proof of concept, but I think this will work.
>>
>> I think, the idea from Dumitru was to still use database locking, but
>> only for some parts.  That would require having a separate connection
>> to the database.  One with the lock and the other without.  So, the
>> instance that has the lock can do common operations through the locked
>> connection and all the instnces can do their own specific operations
>> using a separate connection without a lock.
>>

That's a good summary, with one note: I didn't suggest having
separate connections to the database.  The OVSDB lock is just a way
to guarantee that a single "owner" exists for that lock, but it
doesn't restrict the semantics of what that lock is.  From the
OVSDB RFC itself:

https://datatracker.ietf.org/doc/html/rfc7047#section-4.1.8

"The database
   server supports an arbitrary number of locks, each of which is
   identified by a client-defined ID.  At any given time, each lock may
   have at most one owner.  The precise usage of a lock is determined by
   the client.  For example, a set of clients may agree that a certain
   table can only be written by the owner of a certain lock.  OVSDB
   itself does not enforce any restrictions on how locks are used -- it
   simply ensures that a lock has at most one owner."

So I thought that we could use the lock to determine if "the ovn-ic
process is <the leader> responsible for syncing common data, e.g.,
datapath_binding".

Regardless of being "leader" or not, each ovn-ic would still sync its
other non-common data (e.g., port bindings) to the IC SB (through the
same connection).

But now looking at the IDL/ovsdb-cs code, I think we're being too
restrictive there and we don't let IDL/cs clients be as flexible as
the RFC actually allows them to be:

/* Returns true if 'cs' can be sent a transaction now, false otherwise.  This
 * is useful for optimization: there is no point in composing and sending a
 * transaction if it returns false. */
bool
ovsdb_cs_may_send_transaction(const struct ovsdb_cs *cs)
{
    return (cs->session != NULL
            && cs->state == CS_S_MONITORING
            && (!cs->data.lock_name || ovsdb_cs_has_lock(cs)));
}

Ilya, is this something to improve in the ovsdb-cs layer?

Until then, we do need separate connections to the IC SB: one locked, one not.

>> So, yes, it's similar to the IC_NB_Global:options proposal, but the
>> leader is chosen automatically via the ovsdb locking mechanism, which
>> avoids the problem of a leader going down.
>>

Exactly.

>> Best regards, Ilya Maximets.
> 
> Hi Ilya,
> Thanks for the clear explanation, I understand it now. I’m going to
> put together a PoC to test this out and will follow up with the
> results.
> 

Ack, looking forward to see how that goes.

Thanks,
Dumitru

> Regards,
> Tiago.
> 
>>
>>>
>>> Regards,
>>> Tiago.
>>>
>>>>
>>>>> I will be reading northd's source code to see if there is anything
>>>>> similar to this
>>>>> already implemented, but I would like to know if you have any clue as to 
>>>>> how
>>>>> I could implement this.
>>>>>
>>>>> Regards,
>>>>> Tiago.
>>>>>
>>>>
>>>> Regards,
>>>> Dumitru
>>>>
>>>>>> Each ovn-ic, both active and backup are still responsible for writing to
>>>>>> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB
>>>>>> centralized databases.
>>>>>>
>>>>>> I didn't check the code for this into too many details though so there
>>>>>> might be other things to consider.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Regards,
>>>>>> Dumitru
>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Tiago Matos
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no
>>>>>>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes 
>>>>>>>>> no
>>>>>>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa
>>>>>>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas 
>>>>>>>>> est?o
>>>>>>>>> imediatamente anuladas e proibidas?._
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *?**?Apesar do Magazine Luiza tomar
>>>>>>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja
>>>>>>>>> presente nesse e-mail, a empresa n?o poder? aceitar a 
>>>>>>>>> responsabilidade por
>>>>>>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus 
>>>>>>>>> anexos?.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------- next part --------------
>>>>>>>>> An HTML attachment was scrubbed...
>>>>>>>>> URL: 
>>>>>>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Tiago,
>>>>>>>>
>>>>>>>> I ran into similar issues when adding transit router support and have
>>>>>>>> added a comment above. I also have been working on OVN-IC related
>>>>>>>> features, so if you would like to discuss above issue further or other
>>>>>>>> OVN-IC work I would like to help.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mairtin
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Tiago Matos
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to