On 1/29/26 3:26 PM, Mairtin O'Loingsigh wrote:
> On Thu, Jan 29, 2026 at 02:38:14PM +0100, Dumitru Ceara wrote:
>> Hi Tiago, Mairtin,
>>
>> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote:
>>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh
>>> <[email protected]> escreveu:
>>>>
>>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis wrote:
>>>>> Hi everyone,
>>>>>
>>>>> I have been working on implementing incremental processing in OVN-IC and
>>>>> encountered a design issue regarding how OVN-IC handles multi-AZ writes.
>>>>>
>>>>> The Issue
>>>>> In a scenario where multiple AZs are connected via OVN-IC, certain events
>>>>> trigger all AZs to attempt writing the same data to the ISB/INB
>>>>> simultaneously. This race condition leads to a constraint violation, which
>>>>> causes the transaction to fail and forces a full recompute.
>>>>>
>>>>> Example:
>>>>> A clear example of this can be seen in ovn-ic.c:ts_run:
>>>>>
>>>>>     if (ctx->ovnisb_txn) {
>>>>>         /* Create ISB Datapath_Binding */
>>>>>         ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) {
>>>>>             const struct icsbrec_datapath_binding *isb_dp =
>>>>>                 shash_find_and_delete(isb_ts_dps, ts->name);
>>>>>             if (!isb_dp) {
>>>>>                 /* Allocate tunnel key */
>>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode,
>>>>>                                                  "transit switch 
>>>>> datapath");
>>>>>                 if (!dp_key) {
>>>>>                     continue;
>>>>>                 }
>>>>>
>>>>>                 isb_dp = icsbrec_datapath_binding_insert(ctx->ovnisb_txn);
>>>>>                 icsbrec_datapath_binding_set_transit_switch(isb_dp,
>>>>> ts->name);
>>>>>                 icsbrec_datapath_binding_set_tunnel_key(isb_dp, dp_key);
>>>>>             } else if (dp_key_refresh) {
>>>>>                 /* Refresh tunnel key since encap mode has changed. */
>>>>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode,
>>>>>                                                  "transit switch 
>>>>> datapath");
>>>>>                 if (dp_key) {
>>>>>                     icsbrec_datapath_binding_set_tunnel_key(isb_dp, 
>>>>> dp_key);
>>>>>                 }
>>>>>             }
>>>>>
>>>>>             if (!isb_dp->type) {
>>>>>                 icsbrec_datapath_binding_set_type(isb_dp, 
>>>>> "transit-switch");
>>>>>             }
>>>>>
>>>>>             if (!isb_dp->nb_ic_uuid) {
>>>>>                 icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp,
>>>>>                                                         &ts->header_.uuid,
>>>>> 1);
>>>>>             }
>>>>>         }
>>>>>
>>>>>         struct shash_node *node;
>>>>>         SHASH_FOR_EACH (node, isb_ts_dps) {
>>>>>             icsbrec_datapath_binding_delete(node->data);
>>>>>         }
>>>>>     }
>>>>>
>>>>> When a new transit-switch is created, every AZ attempts to create the same
>>>>> datapath_binding on the ISB. Only one request succeeds; the others fail
>>>>> with a "constraint-violation."
>>>>>
>>>>> Impact:
>>>>> This behavior negates the performance benefits of implementing incremental
>>>>> processing, as the system falls back to a full recompute upon these
>>>>> failures.
>>>>>
>>>>> For development purposes, I am currently ignoring these errors, but the
>>>>> ideal way of fixing this issue is to have a mechanism where only a single
>>>>> AZ handles the writes but this would require implementing some consensus
>>>>> protocol.
>>>>>
>>>>> Does anyone have any advice on how we can fix this issue?
>>>> ovn-ic in each AZ enumerates all existing ISB datapaths in
>>>> enumerate_datapaths
>>>> function, then will attempt to add missing datapaths. Since multilpe AZs
>>>> will attempt to add the same missing entry, all but the first will fail
>>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB
>>>> datapath again, see the entry that succeeded and continue to create NB
>>>> in local AZ. This solution does cause a transaction error on all but 1
>>>> AZ whenever a Transit router is added, but we currently dont have a
>>>> mechanism to manage this gracefully across multiple AZs.
>>>
>>> Hi Mairtin, thanks for the reply.
>>>
>>> Since there is no mechanism to manage which AZ should insert the data,
>>> the only good solution besides implementing a full-fledge consensus 
>>> algorithm
>>> like Raft to select a leader AZ,  that I came up with is to simply set an 
>>> option
>>> in IC_NB_Global to manually configure a specific AZ as a leader, and in the
>>> code check if the AZ is the leader or not.
>>>
>>> Example:
>>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1
>>>
>>> In the code:
>>>
>>> const struct icnbrec_ic_nb_global *icnb_global =
>>>     icnbrec_ic_nb_global_table_first(ic_nb_global_table);
>>>
>>> const struct nbrec_nb_global *nb_global =
>>>     nbrec_nb_global_table_first(nb_global_table);
>>>
>>> const char *leader = smap_get(&icnb_global->options, "leader")
>>> if (!strcmp(leader, nb_global->name)) {
>>> // Insert logic here
>>> }
>>>
>>> Do you have any opinion on this approach?
>>>
>>
>> I was thinking of something a bit different (not too different though).
>>
>> The hierarchy is:
>>
>>              IC-NB
>>                |
>> ovn-ic (AZ1)  ovn-ic (AZ2)  ...  ovn-ic (AZN)
>>                |
>>              IC-SB
>>
>> Conceptually this is similar to the intra-az hierarchy:
>>
>>                       NB
>>                       |
>> ovn-northd (active)  ovn-northd (backup)  ...  ovn-northd (backup)
>>                       |
>>                       SB
>>
>> The way the instances synchronize is by taking the (single) SB database
>> lock.  Only one northd succeeds, so that one becomes the "active".
>>
>> What if we do the same for ovn-ic?
>>
>> Make all ovn-ic try to take the IC-SB lock.  Only the one that succeeds
>> becomes "active" and may write to the IC-SB.
>>
>> That has one implication though: the active instance (it can be any
>> ovn-ic in any AZ) must also make sure the IC-SB port bindings and
>> datapaths for other AZs are up to date.  Today it only takes care of the
>> resources for its own AZ.
> 
>>
>> Each ovn-ic, both active and backup are still responsible for writing to
>> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB
>> centralized databases.
>>
>> I didn't check the code for this into too many details though so there
>> might be other things to consider.
>>
>> What do you think?
>>
>> Regards,
>> Dumitru
>>
>>>>>
>>>>> Thanks,
>>>>> Tiago Matos
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no
>>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes no
>>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa
>>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas 
>>>>> est?o
>>>>> imediatamente anuladas e proibidas?._
>>>>>
>>>>>
>>>>> *?**?Apesar do Magazine Luiza tomar
>>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja
>>>>> presente nesse e-mail, a empresa n?o poder? aceitar a responsabilidade por
>>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos?.*
>>>>>
>>>>>
>>>>>
>>>>> -------------- next part --------------
>>>>> An HTML attachment was scrubbed...
>>>>> URL: 
>>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm>
>>>>
>>>>
>>>> Hi Tiago,
>>>>
>>>> I ran into similar issues when adding transit router support and have
>>>> added a comment above. I also have been working on OVN-IC related
>>>> features, so if you would like to discuss above issue further or other
>>>> OVN-IC work I would like to help.
>>>>
>>>> Regards,
>>>> Mairtin
>>>>
>>>
>>>
>>> Regards,
>>> Tiago Matos
>>>
>>
> 
> Hi Dumitru,
> 

Hi Mairtin,

> A lock similar to northd seems like a good solution, do you think
> serializing access to ISB might have a significant negative performance
> impact?
> 

It's hard to quantify, I guess.  On one hand we'd have a single ovn-ic
instance doing all the IC-SB writes (we're not really serializing, we're
doing RW / RO).  But on the other hand we wouldn't have transaction
failures every time a datapath is added.

However, I suspect that in most cases the data in the IC databases can
be handled by a single instance of ovn-ic without excessive resource uses.

It would be great to test though.

Regards,
Dumitru

> Mairtin
> 

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to