On 2/5/25 2:00 PM, Dumitru Ceara wrote:
> On 2/5/25 11:50 AM, Aleksandr Smirnov (K2 Cloud) wrote:
>> Hi guys,
>>
> Hi Aleksandr,
>
>> We experienced some problems with CMS when watching nb_cfg/sb_cfg up.
>> A problem is that inappropriate time delay happens between nb_cfg bump
>> and sb_cfg bump.
>> A research shows that the reason is double run of northd engine combined
>> with improper sequence of calls
>> engine <-> counter update.
>>
>> Exactly, making the change in logical network structure combined with
>> nb_cfg bump results in the following
>> execution sequence:
>>
>> 1. 'update3' notification of changed made in ovn-nbctl
>> 2. engine recompute run, (long time here)
>> 3. 'transact' request to NB DB + SB DB, send results of engine's work
>> 4. 'update3' + 'reply' received. this update3 fires back our changes we
>> just sent to SB DB.
>> 5. engine recompute second run, (long time here, because change handles
>> are not implemented in branch-22)
>> 6. Call update_sequence_numbers(), sb_cfg is sent to NB DB.
>>
>> Thus, time delay between p4 - p6 need to be eliminated.
>>
>> For main branch, where all change handlers are implemented, p.5 takes a
>> short time so, problem is not so visible.
>>
> Is there a chance that we hit this on main too though?  E.g. when some
> of the change handlers fail to incrementally process inputs?

Yes, there is. I just checked a behavior of the main branch in the same 
test environment.
An execution sequence is the same.
So, if, by some reason, the engine falls info full recompute, the 
problem will becomes visible.

>> I have prepared a patch that is going to fix this problem in our
>> environment, made over branch-22.09, see in attachment.
>>
> I can't see the attachment (maybe mailman scrubbed it?).

Ok, place it in text right below:

diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 1ebb85498..d0f9b96ad 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -536,6 +536,17 @@ update_sequence_numbers(int64_t loop_start_time,
          }
      }
  }
+
+static bool
+sb_cfg_is_updated(struct ovsdb_idl *ovnnb_idl,
+                  struct ovsdb_idl_loop *sb_loop)
+{
+    const struct nbrec_nb_global *nb = nbrec_nb_global_first(ovnnb_idl);
+
+    return (nb && sb_loop->cur_cfg && nb->sb_cfg != sb_loop->cur_cfg)
+           ? true : false;
+}
+
  
  static void
  usage(void)
@@ -853,6 +864,7 @@ main(int argc, char *argv[])
              simap_destroy(&usage);
          }

+        bool clear_idl_track = true;
          if (!state.paused) {
              if (!ovsdb_idl_has_lock(ovnsb_idl_loop.idl) &&
                  !ovsdb_idl_is_lock_contended(ovnsb_idl_loop.idl))
@@ -903,8 +915,26 @@ main(int argc, char *argv[])

              if (ovsdb_idl_has_lock(ovnsb_idl_loop.idl)) {
                  int64_t loop_start_time = time_wall_msec();
-                inc_proc_northd_run(ovnnb_txn, ovnsb_txn, recompute);
-                recompute = false;
+
+                /* Check if sb_cfg can be propagated.
+                   Meaning is that the engine already made a bunch of 
updates
+                   and those updated are already committed to SB DB.
+                   So, it's time to bump sb_cfg in the NB DB.
+                   However, the engine will run again as response to 
'update3'
+                   notification from the server that reflects our recent
+                  'transact'. This will delay sb_cfg bump for an 
inappropriate
+                  long time. So, the solution is to skip engine run for 
this
+                  iteration and make bump transaction that delivers sb_cfg
+                  immediately. Engine will be run during the next 
iteration.
+                  We also preserve DB tracked info for the next engine run
+                  (see clear_idl_track). */
+                if 
(!sb_cfg_is_updated(ovnnb_idl_loop.idl,&ovnsb_idl_loop)) {
+                    inc_proc_northd_run(ovnnb_txn, ovnsb_txn, recompute);
+                    recompute = false;
+                } else {
+                    clear_idl_track = false;
+                }
+
                  if (ovnsb_txn) {
                      check_and_add_supported_dhcp_opts_to_sb_db(
                                   ovnsb_txn, ovnsb_idl_loop.idl);
@@ -968,8 +998,10 @@ main(int argc, char *argv[])
              recompute = true;
          }

-        ovsdb_idl_track_clear(ovnnb_idl_loop.idl);
-        ovsdb_idl_track_clear(ovnsb_idl_loop.idl);
+        if (clear_idl_track) {
+            ovsdb_idl_track_clear(ovnnb_idl_loop.idl);
+            ovsdb_idl_track_clear(ovnsb_idl_loop.idl);
+        }

          unixctl_server_run(unixctl);
          unixctl_server_wait(unixctl);
-- 

>
> 22.09 is not actively supported anymore.  Our currently supported
> release branches are: 25.03 (starting next week), 24.09 and 24.03 LTS.
> 22.03 LTS is supported only for critical security fixes.
>
> https://www.ovn.org/en/releases/all_releases/
>
> So, do you think your patch will be beneficial on branch-24.0x too?
>
>> So guys, if you want to have it in official ovn repo, let me know, I
>> will post the patch to patchwork.
>>
>> best regards,
>>
>> Alexander
>>
> Regards,
> Dumitru
>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to