On 6/8/26 4:42 PM, [email protected] wrote:
> From: Loke Berne <[email protected]>
> 
> Large scale OVN deployments commonly disable the per-chassis nb_cfg
> write-back mechanism by setting options:enable_chassis_nb_cfg_update
> to false.  With thousands of hypervisors each writing their nb_cfg
> completion back to Chassis_Private on every generation, the resulting
> write amplification can overload the southbound OVSDB cluster.
> Disabling write-back eliminates this pressure but also removes the
> only existing signal for measuring how long a northbound change takes
> to reach each hypervisor.
> 
> OVN_Northbound already records nb_cfg_timestamp in NB_Global when
> ovn-northd advances nb_cfg, but hypervisors connect to the southbound
> database only.  This patch adds the same timestamp to SB_Global,
> written atomically with each nb_cfg update.  ovn-controller reads
> this value and stores it in the local OVS bridge external_ids as
> ovn-nb-cfg-sb-ts alongside the existing ovn-nb-cfg-ts (local
> completion time).  An external collector such as ovs_exporter can
> read both values from the bridge and compute per-chassis propagation
> latency histograms without any writes to the southbound database,
> keeping measurement overhead independent of fleet size.
> 
> Placing the timestamp in SB_Global rather than requiring collectors
> to reach the northbound database means it travels transparently
> through any relay or VPN between the southbound cluster and the
> hypervisor, naturally including that transit in the measurement.
> 
> Testing: confirmed in OVN sandbox and a two-container central/HV
> setup that nb_cfg_timestamp is written to SB_Global on each nb_cfg
> advance, propagated to br-int external_ids as ovn-nb-cfg-sb-ts, and
> continues to update correctly when enable_chassis_nb_cfg_update is
> set to false.
> 
> Signed-off-by: Loke Berne <[email protected]>
> Assisted-by: Claude Sonnet 4.6
> Submitted-at: https://github.com/ovn-org/ovn/pull/306
> Signed-off-by: Numan Siddique <[email protected]>
> ---

Hi Loke,

Thanks for the patch!

On top of Ilya's comment about the schema version, I have some of my own
too.

Let me know if you need help with getting v2 posted for review.  If
needed I can also do that for you if you point me to your dev branch or
PR once you have it.

>  br-controller/ovn-br-controller.c |  2 +-
>  controller/if-status.c            |  2 +-
>  controller/ovn-controller.c       | 83 +++++++++++++++++++++++++------
>  lib/ofctrl-seqno.c                | 21 +++++++-
>  lib/ofctrl-seqno.h                | 11 +++-
>  lib/test-ofctrl-seqno.c           |  2 +-
>  northd/ovn-northd.c               |  3 ++
>  ovn-sb.ovsschema                  |  5 +-
>  ovn-sb.xml                        |  9 ++++

I think we need to add a NEWS entry item for this user visible change.

Also, would it be possible to extend (or add a new) the "nb_cfg
timestamp" test in our testsuite to cover this new feture?

https://github.com/ovn-org/ovn/blob/main/tests/ovn.at#L31397

>  9 files changed, 115 insertions(+), 23 deletions(-)
> 
> diff --git a/br-controller/ovn-br-controller.c 
> b/br-controller/ovn-br-controller.c
> index 93526a2f6d..e20a110513 100644
> --- a/br-controller/ovn-br-controller.c
> +++ b/br-controller/ovn-br-controller.c
> @@ -299,7 +299,7 @@ main(int argc OVS_UNUSED, char *argv[] OVS_UNUSED)
>              ofctrl_seqno_update_create(
>                  ofctrl_seq_type_br_cfg,
>                  
> get_ovnbr_cfg(ovnbrrec_br_global_table_get(ovnbr_idl_loop.idl),
> -                              ovnbr_cond_seqno, ovnbr_expected_cond_seqno));
> +                              ovnbr_cond_seqno, ovnbr_expected_cond_seqno), 
> 0);
>  
>              br_ofctrls_put(ofctrl_seqno_get_req_cfg(),
>                             engine_node_changed(&en_lflow_output),
> diff --git a/controller/if-status.c b/controller/if-status.c
> index 6c6e9b27b1..65e745e250 100644
> --- a/controller/if-status.c
> +++ b/controller/if-status.c
> @@ -712,7 +712,7 @@ if_status_mgr_update(struct if_status_mgr *mgr,
>      if (new_ifaces) {
>          mgr->iface_seqno++;
>          ofctrl_seqno_update_create(mgr->iface_seq_type_pb_cfg,
> -                                   mgr->iface_seqno);
> +                                   mgr->iface_seqno, 0);
>          VLOG_DBG("Seqno requested: %"PRIu32, mgr->iface_seqno);
>      }
>  }
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index ad094a4543..765948442c 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -141,6 +141,7 @@ static unixctl_cb_func debug_delay_nb_cfg_report;
>  
>  #define OVS_NB_CFG_NAME "ovn-nb-cfg"
>  #define OVS_NB_CFG_TS_NAME "ovn-nb-cfg-ts"
> +#define OVS_NB_CFG_SB_TS_NAME "ovn-nb-cfg-sb-ts"

We should probably document this in ovn-controller.8.xml like we do with
the others.

>  #define OVS_STARTUP_TS_NAME "ovn-startup-ts"
>  
>  struct br_int_remote {
> @@ -825,28 +826,62 @@ struct ed_type_ct_zones {
>  };
>  
>  
> +/* Returns the current SB_Global.nb_cfg and, if 'ts_out' is non-NULL, also
> + * the matching SB_Global.nb_cfg_timestamp.  The pair is always read from
> + * the same SB_Global snapshot so callers can rely on (nb_cfg, ts) being
> + * consistent.
> + *
> + * 'nb_cfg_timestamp' is the wall-clock time northd wrote nb_cfg to SB.
> + * The delta between that and the local completion time is the per-chassis
> + * end-to-end propagation latency (northd compile + SB write + relay
> + * fan-out + ovn-controller engine + ofctrl barrier ack).
> + *
> + * If a monitor condition change is in flight the cached pair from the
> + * previous call is returned, because updates received between the request
> + * and the cond ack could be from before the SB_Global value we're trying
> + * to read.
> + */
>  static uint64_t
>  get_nb_cfg(const struct sbrec_sb_global_table *sb_global_table,
> -           unsigned int cond_seqno, unsigned int expected_cond_seqno)
> +           unsigned int cond_seqno, unsigned int expected_cond_seqno,
> +           int64_t *ts_out)

Should we return a struct here instead of returning half of the output
through the ts_out arg?

>  {
>      static uint64_t nb_cfg = 0;
> +    static int64_t nb_cfg_ts = 0;

Let's store nb_cfg_ts as uint64_t too?  Deep down, in the seqno
structures we store unsigned ints anyway.

>  
> -    /* Delay getting nb_cfg if there are monitor condition changes
> -     * in flight.  It might be that those changes would instruct the
> -     * server to send updates that happened before SB_Global.nb_cfg.
> -     */
> -    if (cond_seqno != expected_cond_seqno) {
> -        return nb_cfg;
> +    if (cond_seqno == expected_cond_seqno) {
> +        const struct sbrec_sb_global *sb
> +          = sbrec_sb_global_table_first(sb_global_table);

Nit: indentation is off here.

> +        nb_cfg = sb ? sb->nb_cfg : 0;
> +        nb_cfg_ts = sb ? sb->nb_cfg_timestamp : 0;
> +    }
> +
> +    if (ts_out) {
> +        *ts_out = nb_cfg_ts;
>      }
>  
> -    const struct sbrec_sb_global *sb
> -        = sbrec_sb_global_table_first(sb_global_table);
> -    nb_cfg = sb ? sb->nb_cfg : 0;
>      return nb_cfg;
>  }
>  
>  /* Propagates the local cfg seqno, 'cur_cfg', to the chassis_private record
>   * and to the local OVS DB.
> + *
> + * The br-int external_ids triplet (ovn-nb-cfg, ovn-nb-cfg-ts,
> + * ovn-nb-cfg-sb-ts) is stamped unconditionally, independent of
> + * 'enable_ch_nb_cfg_update'.  The SB chassis_private writeback remains
> + * gated by 'enable_ch_nb_cfg_update' for deployments that want to suppress
> + * the per-bump SB write load.  An external exporter watching br-int can
> + * compute the per-chassis propagation delta as
> + *     (ovn-nb-cfg-ts - ovn-nb-cfg-sb-ts)
> + * regardless of the writeback setting.
> + *
> + * 'ovn-nb-cfg-sb-ts' is the SB_Global.nb_cfg_timestamp that was paired
> + * with this cur_cfg at the moment the barrier was queued (snapshotted via
> + * ofctrl-seqno's req_ts).  This avoids the pitfall of pairing the just-
> + * acked cur_cfg with whatever SB_Global timestamp happens to be current
> + * now -- on a fast-churning fleet SB_Global may already have advanced
> + * past cur_cfg by the time the barrier acks, which would under-report
> + * the delta.
>   */
>  static void
>  store_nb_cfg(struct ovsdb_idl_txn *sb_txn, struct ovsdb_idl_txn *ovs_txn,
> @@ -858,6 +893,7 @@ store_nb_cfg(struct ovsdb_idl_txn *sb_txn, struct 
> ovsdb_idl_txn *ovs_txn,
>      struct ofctrl_acked_seqnos *acked_nb_cfg_seqnos =
>          ofctrl_acked_seqnos_get(ofctrl_seq_type_nb_cfg);
>      uint64_t cur_cfg = acked_nb_cfg_seqnos->last_acked;
> +    uint64_t cur_cfg_sb_ts = acked_nb_cfg_seqnos->last_acked_req_ts;
>      int64_t startup_ts = daemon_startup_ts();
>  
>      if (ovs_txn && br_int
> @@ -894,6 +930,13 @@ store_nb_cfg(struct ovsdb_idl_txn *sb_txn, struct 
> ovsdb_idl_txn *ovs_txn,
>                                                   cur_cfg_str);
>          ovsrec_bridge_update_external_ids_setkey(br_int, OVS_NB_CFG_TS_NAME,
>                                                   cur_cfg_ts_str);
> +        if (cur_cfg_sb_ts) {

Why do we skip the 0 case?  Should we clear if cur_cfg_sb_ts is 0 instead?

> +            char *sb_ts_str = xasprintf("%"PRIu64, cur_cfg_sb_ts);
> +            ovsrec_bridge_update_external_ids_setkey(br_int,
> +                                                     OVS_NB_CFG_SB_TS_NAME,
> +                                                     sb_ts_str);
> +            free(sb_ts_str);
> +        }
>          free(cur_cfg_ts_str);
>          free(cur_cfg_str);
>      }
> @@ -8200,12 +8243,20 @@ main(int argc, char *argv[])
>                                       chassis, mac_cache_data);
>                      }
>  
> -                    ofctrl_seqno_update_create(
> -                        ofctrl_seq_type_nb_cfg,
> -                        get_nb_cfg(sbrec_sb_global_table_get(
> -                                                       ovnsb_idl_loop.idl),
> -                                              ovnsb_cond_seqno,
> -                                              ovnsb_expected_cond_seqno));
> +                    /* Snapshot (nb_cfg, sb_ts) atomically from SB_Global
> +                     * and pair them through the barrier ack so the
> +                     * eventual completion can be attributed to the
> +                     * timestamp that corresponded to this exact nb_cfg
> +                     * generation -- not whatever SB_Global value has
> +                     * moved on to by the time the barrier acks. */
> +                    int64_t sb_nb_cfg_ts = 0;
> +                    uint64_t sb_nb_cfg = get_nb_cfg(
> +                        sbrec_sb_global_table_get(ovnsb_idl_loop.idl),
> +                        ovnsb_cond_seqno, ovnsb_expected_cond_seqno,
> +                        &sb_nb_cfg_ts);
> +                    ofctrl_seqno_update_create(ofctrl_seq_type_nb_cfg,
> +                                               sb_nb_cfg,
> +                                               (uint64_t) sb_nb_cfg_ts);

Is this cast really needed?

>  
>                      struct local_binding_data *binding_data =
>                          runtime_data ? &runtime_data->lbinding_data : NULL;
> diff --git a/lib/ofctrl-seqno.c b/lib/ofctrl-seqno.c
> index 83c17c0e52..7c613dc0a4 100644
> --- a/lib/ofctrl-seqno.c
> +++ b/lib/ofctrl-seqno.c
> @@ -36,6 +36,11 @@ struct ofctrl_seqno_update {
>                                  * application.
>                                  */
>      uint64_t req_cfg;          /* Application specific seqno. */
> +    uint64_t req_ts;           /* Opaque per-request timestamp captured by
> +                                * the caller at the moment the update was
> +                                * queued.  Carried through to the acked
> +                                * state so consumers can pair the acked
> +                                * seqno with the input that produced it. */
>  };
>  
>  /* List of in flight sequence number updates. */
> @@ -51,6 +56,9 @@ struct ofctrl_seqno_state {
>                                   * application consumed acked requests.
>                                   */
>      uint64_t cur_cfg;           /* Last acked application seqno. */
> +    uint64_t cur_cfg_req_ts;    /* req_ts that was paired with cur_cfg when
> +                                 * the update was queued.  0 if the caller
> +                                 * didn't supply a timestamp. */
>      uint64_t req_cfg;           /* Last requested application seqno. */
>  };
>  
> @@ -73,6 +81,7 @@ ofctrl_acked_seqnos_get(size_t seqno_type)
>      struct ofctrl_acked_seqnos *acked_seqnos = xmalloc(sizeof *acked_seqnos);
>      acked_seqnos->acked = vector_clone(&state->acked_cfgs);
>      acked_seqnos->last_acked = state->cur_cfg;
> +    acked_seqnos->last_acked_req_ts = state->cur_cfg_req_ts;
>  
>      vector_clear(&state->acked_cfgs);
>      if (vector_capacity(&state->acked_cfgs) >= VECTOR_THRESHOLD) {
> @@ -140,6 +149,7 @@ ofctrl_seqno_add_type(void)
>      struct ofctrl_seqno_state state = (struct ofctrl_seqno_state) {
>          .acked_cfgs = VECTOR_EMPTY_INITIALIZER(uint64_t),
>          .cur_cfg = 0,
> +        .cur_cfg_req_ts = 0,
>          .req_cfg = 0,
>      };
>      vector_push(&ofctrl_seqno_states, &state);
> @@ -149,9 +159,16 @@ ofctrl_seqno_add_type(void)
>  
>  /* Creates a new seqno update request for an application specific
>   * 'seqno_type'.
> + *
> + * 'req_ts' is an opaque per-request timestamp captured by the caller (for
> + * example, the SB_Global nb_cfg_timestamp at the moment we read 'new_cfg').
> + * It is carried unchanged to ofctrl_acked_seqnos_get() so consumers can
> + * pair the eventual ack with the input state that produced it.  Callers
> + * that don't need this pairing should pass 0.
>   */
>  void
> -ofctrl_seqno_update_create(size_t seqno_type, uint64_t new_cfg)
> +ofctrl_seqno_update_create(size_t seqno_type, uint64_t new_cfg,
> +                           uint64_t req_ts)

Maybe we should add a ofctrl_stamped_seqno_update_create() wrapper?  We
call ofctrl_seqno_update_create() with a non-zero req_ts only in one
place in our code base.

>  {
>      struct ofctrl_seqno_state *state = ofctrl_seqno_state_get(seqno_type);
>  
> @@ -169,6 +186,7 @@ ofctrl_seqno_update_create(size_t seqno_type, uint64_t 
> new_cfg)
>          .seqno_type = seqno_type,
>          .flow_cfg = ofctrl_req_seqno,
>          .req_cfg = new_cfg,
> +        .req_ts = req_ts,
>      };
>      vector_push(&ofctrl_seqno_updates, &update);
>  }
> @@ -190,6 +208,7 @@ ofctrl_seqno_run(uint64_t flow_cfg)
>          struct ofctrl_seqno_state *state =
>              ofctrl_seqno_state_get(update->seqno_type);
>          state->cur_cfg = update->req_cfg;
> +        state->cur_cfg_req_ts = update->req_ts;
>          vector_push(&state->acked_cfgs, &update->req_cfg);
>  
>          index++;
> diff --git a/lib/ofctrl-seqno.h b/lib/ofctrl-seqno.h
> index faa97cc535..66b666fe4d 100644
> --- a/lib/ofctrl-seqno.h
> +++ b/lib/ofctrl-seqno.h
> @@ -23,10 +23,18 @@
>  
>  /* Collection of acked ofctrl_seqno_update requests and the most recent
>   * 'last_acked' value.
> + *
> + * 'last_acked_req_ts' carries the opaque timestamp that was associated with
> + * 'last_acked' at the time the seqno was requested.  Consumers that don't
> + * pass a timestamp at create time will see 0 here.  The timestamp is
> + * preserved across the barrier so that applications can pair the acked
> + * config seqno with the input state that produced it (e.g. SB_Global
> + * nb_cfg_timestamp at the moment we asked OVS to barrier).
>   */
>  struct ofctrl_acked_seqnos {
>      struct vector acked;
>      uint64_t last_acked;
> +    uint64_t last_acked_req_ts;
>  };
>  
>  struct ofctrl_acked_seqnos *ofctrl_acked_seqnos_get(size_t seqno_type);
> @@ -35,7 +43,8 @@ bool ofctrl_acked_seqnos_contains(const struct 
> ofctrl_acked_seqnos *seqnos,
>                                    uint64_t val);
>  
>  size_t ofctrl_seqno_add_type(void);
> -void ofctrl_seqno_update_create(size_t seqno_type, uint64_t new_cfg);
> +void ofctrl_seqno_update_create(size_t seqno_type, uint64_t new_cfg,
> +                                uint64_t req_ts);
>  void ofctrl_seqno_run(uint64_t flow_cfg);
>  uint64_t ofctrl_seqno_get_req_cfg(void);
>  void ofctrl_seqno_flush(void);
> diff --git a/lib/test-ofctrl-seqno.c b/lib/test-ofctrl-seqno.c
> index 7d478c033e..b31a50ecdc 100644
> --- a/lib/test-ofctrl-seqno.c
> +++ b/lib/test-ofctrl-seqno.c
> @@ -123,7 +123,7 @@ test_ofctrl_seqno_ack_seqnos(struct ovs_cmdl_context *ctx)
>                                          &app_seqno)) {
>                  return;
>              }
> -            ofctrl_seqno_update_create(i, app_seqno);
> +            ofctrl_seqno_update_create(i, app_seqno, 0);
>          }
>      }
>      printf("ofctrl-seqno-req-cfg: %u\n", n_reqs);
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index c3c198f2f3..0eaef6de44 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -541,6 +541,7 @@ update_sequence_numbers(int64_t loop_start_time,
>       * Also set up to update sb_cfg once our southbound transaction commits. 
> */
>      if (nb->nb_cfg != sb->nb_cfg) {
>          sbrec_sb_global_set_nb_cfg(sb, nb->nb_cfg);
> +        sbrec_sb_global_set_nb_cfg_timestamp(sb, loop_start_time);
>          nbrec_nb_global_set_nb_cfg_timestamp(nb, loop_start_time);
>      }
>      sb_loop->next_cfg = nb->nb_cfg;
> @@ -944,6 +945,8 @@ main(int argc, char *argv[])
>  
>      /* Disable alerting for pure write-only columns. */
>      ovsdb_idl_omit_alert(ovnsb_idl_loop.idl, &sbrec_sb_global_col_nb_cfg);
> +    ovsdb_idl_omit_alert(ovnsb_idl_loop.idl,
> +                         &sbrec_sb_global_col_nb_cfg_timestamp);
>      ovsdb_idl_omit_alert(ovnsb_idl_loop.idl, &sbrec_address_set_col_name);
>      ovsdb_idl_omit_alert(ovnsb_idl_loop.idl, 
> &sbrec_address_set_col_addresses);
>      for (size_t i = 0; i < SBREC_LOGICAL_FLOW_N_COLUMNS; i++) {
> diff --git a/ovn-sb.ovsschema b/ovn-sb.ovsschema
> index d9a91739cc..afe691a0d3 100644
> --- a/ovn-sb.ovsschema
> +++ b/ovn-sb.ovsschema
> @@ -1,11 +1,12 @@
>  {
>      "name": "OVN_Southbound",
> -    "version": "21.8.0",
> -    "cksum": "614397313 36713",
> +    "version": "21.8.1",
> +    "cksum": "3241242866 36779",
>      "tables": {
>          "SB_Global": {
>              "columns": {
>                  "nb_cfg": {"type": {"key": "integer"}},
> +                "nb_cfg_timestamp": {"type": {"key": "integer"}},
>                  "external_ids": {
>                      "type": {"key": "string", "value": "string",
>                               "min": 0, "max": "unlimited"}},
> diff --git a/ovn-sb.xml b/ovn-sb.xml
> index e45b63d73f..c1883db2e3 100644
> --- a/ovn-sb.xml
> +++ b/ovn-sb.xml
> @@ -156,6 +156,15 @@
>          the southbound database to bring it up to date with these changes, it
>          updates this column to the same value.
>        </column>
> +
> +      <column name="nb_cfg_timestamp">
> +        The time at which <code>ovn-northd</code> last wrote
> +        <ref column="nb_cfg"/> to the southbound database, in milliseconds
> +        since the Unix epoch.  Set atomically with each update to
> +        <ref column="nb_cfg"/>.  Hypervisors read this value to measure
> +        end-to-end propagation latency from northbound commit to local
> +        datapath programming completion.
> +      </column>
>      </group>
>  
>      <group title="Common Columns">

Regards,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to