I have some comment myself regarding the documentation:

On Wed, Apr 6, 2022 at 4:36 PM Han Zhou <[email protected]> wrote:
>
> Whenever OpenFlow connection between ovn-controller and OVS is
> connected/reconnected, typically during ovn-controller/OVS
> restart/upgrade, ovn-controller would:
> 1. clears all the existing flows after the initial hand-shaking
> 2. compute the new flows
> 3. install the new flows to OVS.
>
> In large scale environments when there are a big number of flows
> needs to be computed by ovn-controller, the step 2 above can take
> very long time (from seconds to minutes, depending on the scale and
> topology), which would cause a data plane down time.

I should make it more clear that both step 2 and 3 can take a very long
time. This patch reduces the down time of step 2 only. And I should change
the word "avoid" in the patch title to "reduce".
(To avoid the down time introduced by step 3, I will work on another patch
to address it separately.)

>
> The purpose of this patch is to avoid data plane down time during
> restart/upgrade.

I should make it clear that this patch helps for ovn-controller
restart/upgrade, but not for OVS restart/upgrade, because as soon as OVS
restarts, the OVS flows are gone, and postponing the flow clearing won't
help for this.

> It adds a new state S_WAIT_BEFORE_CLEAR to the ofctrl
> state machine, so that ovn-controller waits for a period of time before
> clearing the existing OVS flows while it is computing the new flows.
> Ideally, ovn-controller should clear the flows after it completes the
> new flows computing. However, it is difficult for ovn-controller to
> judge if it has generated all the new flows (or the flows that are
> sufficient to avoid down time), because flows computation depends on
> iterations of SB monitor data processing and condition changes. So,
> instead of try to determine by ovn-controller itself, this patch
> provides a configuration "ovn-ofctrl-wait-before-clear" for users to
> determine the feasible time to wait, according to the scale. As
> mentioned in the document, the output of
>
>   ovn-appctl -t ovn-controller stopwatch/show flow-generation
>
> can help users determining what value to set.

This command alone may be insufficient. It would be better to trigger a
recompute before this command:
ovn-appctl -t ovn-controller inc-engine/recompute

I will update the document in v2. I think the rest of the patch is still
valid for review. Please review and comment.

Thanks,
Han

>
> Signed-off-by: Han Zhou <[email protected]>
> ---
>  controller/ofctrl.c             | 87 ++++++++++++++++++++++++++++-----
>  controller/ofctrl.h             |  4 +-
>  controller/ovn-controller.8.xml | 27 ++++++++++
>  controller/ovn-controller.c     |  6 +--
>  controller/pinctrl.c            |  2 +
>  tests/ovn-controller.at         | 52 ++++++++++++++++++++
>  6 files changed, 161 insertions(+), 17 deletions(-)
>
> diff --git a/controller/ofctrl.c b/controller/ofctrl.c
> index a67dab024..8239c27f6 100644
> --- a/controller/ofctrl.c
> +++ b/controller/ofctrl.c
> @@ -47,6 +47,7 @@
>  #include "physical.h"
>  #include "openvswitch/rconn.h"
>  #include "socket-util.h"
> +#include "timeval.h"
>  #include "util.h"
>  #include "vswitch-idl.h"
>
> @@ -297,6 +298,7 @@ static unsigned int seqno;
>      STATE(S_NEW)                                \
>      STATE(S_TLV_TABLE_REQUESTED)                \
>      STATE(S_TLV_TABLE_MOD_SENT)                 \
> +    STATE(S_WAIT_BEFORE_CLEAR)                  \
>      STATE(S_CLEAR_FLOWS)                        \
>      STATE(S_UPDATE_FLOWS)
>  enum ofctrl_state {
> @@ -339,6 +341,14 @@ static uint64_t cur_cfg;
>  /* Current state. */
>  static enum ofctrl_state state;
>
> +/* The time (ms) to stay in the state S_WAIT_BEFORE_CLEAR. Read from
> + * external_ids: ovn-ofctrl-wait-before-clear. */
> +static unsigned int wait_before_clear_time = 0;
> +
> +/* The time when the state S_WAIT_BEFORE_CLEAR should complete.
> + * If the timer is not started yet, it is set to 0. */
> +static long long int wait_before_clear_expire = 0;
> +
>  /* Transaction IDs for messages in flight to the switch. */
>  static ovs_be32 xid, xid2;
>
> @@ -444,18 +454,19 @@ recv_S_NEW(const struct ofp_header *oh OVS_UNUSED,
>   * If we receive an NXT_TLV_TABLE_REPLY:
>   *
>   *     - If it contains our tunnel metadata option, assign its field ID
to
> - *       mff_ovn_geneve and transition to S_CLEAR_FLOWS.
> + *       mff_ovn_geneve and transition to S_WAIT_BEFORE_CLEAR.
>   *
>   *     - Otherwise, if there is an unused tunnel metadata field ID, send
>   *       NXT_TLV_TABLE_MOD and OFPT_BARRIER_REQUEST, and transition to
>   *       S_TLV_TABLE_MOD_SENT.
>   *
>   *     - Otherwise, log an error, disable Geneve, and transition to
> - *       S_CLEAR_FLOWS.
> + *       S_WAIT_BEFORE_CLEAR.
>   *
>   * If we receive an OFPT_ERROR:
>   *
> - *     - Log an error, disable Geneve, and transition to S_CLEAR_FLOWS.
*/
> + *     - Log an error, disable Geneve, and transition to
S_WAIT_BEFORE_CLEAR.
> + */
>
>  static void
>  run_S_TLV_TABLE_REQUESTED(void)
> @@ -482,7 +493,7 @@ process_tlv_table_reply(const struct
ofputil_tlv_table_reply *reply)
>                  return false;
>              } else {
>                  mff_ovn_geneve = MFF_TUN_METADATA0 + map->index;
> -                state = S_CLEAR_FLOWS;
> +                state = S_WAIT_BEFORE_CLEAR;
>                  return true;
>              }
>          }
> @@ -549,7 +560,7 @@ recv_S_TLV_TABLE_REQUESTED(const struct ofp_header
*oh, enum ofptype type,
>
>      /* Error path. */
>      mff_ovn_geneve = 0;
> -    state = S_CLEAR_FLOWS;
> +    state = S_WAIT_BEFORE_CLEAR;
>  }
>
>  /* S_TLV_TABLE_MOD_SENT, when NXT_TLV_TABLE_MOD and OFPT_BARRIER_REQUEST
> @@ -561,12 +572,12 @@ recv_S_TLV_TABLE_REQUESTED(const struct ofp_header
*oh, enum ofptype type,
>   *       raced with some other controller.  Transition to S_NEW.
>   *
>   *     - Otherwise, log an error, disable Geneve, and transition to
> - *       S_CLEAR_FLOWS.
> + *       S_WAIT_BEFORE_CLEAR.
>   *
>   * If we receive OFPT_BARRIER_REPLY:
>   *
>   *     - Set the tunnel metadata field ID to the one that we requested.
> - *       Transition to S_CLEAR_FLOWS.
> + *       Transition to S_WAIT_BEFORE_CLEAR.
>   */
>
>  static void
> @@ -581,7 +592,7 @@ recv_S_TLV_TABLE_MOD_SENT(const struct ofp_header
*oh, enum ofptype type,
>      if (oh->xid != xid && oh->xid != xid2) {
>          ofctrl_recv(oh, type);
>      } else if (oh->xid == xid2 && type == OFPTYPE_BARRIER_REPLY) {
> -        state = S_CLEAR_FLOWS;
> +        state = S_WAIT_BEFORE_CLEAR;
>      } else if (oh->xid == xid && type == OFPTYPE_ERROR) {
>          enum ofperr error = ofperr_decode_msg(oh, NULL);
>          if (error == OFPERR_NXTTMFC_ALREADY_MAPPED ||
> @@ -605,7 +616,36 @@ recv_S_TLV_TABLE_MOD_SENT(const struct ofp_header
*oh, enum ofptype type,
>      return;
>
>  error:
> -    state = S_CLEAR_FLOWS;
> +    state = S_WAIT_BEFORE_CLEAR;
> +}
> +
> +/* S_WAIT_BEFORE_CLEAR, we are almost ready to set up flows, but just
wait for
> + * a while until the initial flow compute to complete before we clear the
> + * existing flows in OVS, so that we won't end up with an empty flow
table,
> + * which may cause data plane down time. */
> +static void
> +run_S_WAIT_BEFORE_CLEAR(void)
> +{
> +    if (!wait_before_clear_time ||
> +        (wait_before_clear_expire &&
> +         time_msec() >= wait_before_clear_expire)) {
> +        wait_before_clear_expire = 0;
> +        state = S_CLEAR_FLOWS;
> +        return;
> +    }
> +
> +    if (!wait_before_clear_expire) {
> +        /* Start the timer. */
> +        wait_before_clear_expire = time_msec() + wait_before_clear_time;
> +    }
> +    poll_timer_wait_until(wait_before_clear_expire);
> +}
> +
> +static void
> +recv_S_WAIT_BEFORE_CLEAR(const struct ofp_header *oh, enum ofptype type,
> +                         struct shash *pending_ct_zones OVS_UNUSED)
> +{
> +    ofctrl_recv(oh, type);
>  }
>
>  /* S_CLEAR_FLOWS, after we've established a Geneve metadata field ID and
it's
> @@ -744,7 +784,9 @@ ofctrl_get_mf_field_id(void)
>   * hypervisor on which we are running.  Attempts to negotiate a Geneve
option
>   * field for class OVN_GENEVE_CLASS, type OVN_GENEVE_TYPE. */
>  void
> -ofctrl_run(const struct ovsrec_bridge *br_int, struct shash
*pending_ct_zones)
> +ofctrl_run(const struct ovsrec_bridge *br_int,
> +           const struct ovsrec_open_vswitch_table *ovs_table,
> +           struct shash *pending_ct_zones)
>  {
>      char *target = xasprintf("unix:%s/%s.mgmt", ovs_rundir(),
br_int->name);
>      if (strcmp(target, rconn_get_target(swconn))) {
> @@ -771,6 +813,16 @@ ofctrl_run(const struct ovsrec_bridge *br_int,
struct shash *pending_ct_zones)
>              }
>          }
>      }
> +    const struct ovsrec_open_vswitch *cfg =
> +        ovsrec_open_vswitch_table_first(ovs_table);
> +    ovs_assert(cfg);
> +    unsigned int _wait_before_clear_time =
> +        smap_get_uint(&cfg->external_ids,
"ovn-ofctrl-wait-before-clear", 0);
> +    if (_wait_before_clear_time != wait_before_clear_time) {
> +        VLOG_INFO("ofctrl-wait-before-clear is now %u ms (was %u ms)",
> +                  _wait_before_clear_time, wait_before_clear_time);
> +        wait_before_clear_time = _wait_before_clear_time;
> +    }
>
>      bool progress = true;
>      for (int i = 0; progress && i < 50; i++) {
> @@ -2456,16 +2508,25 @@ update_installed_flows_by_track(struct
ovn_desired_flow_table *flow_table,
>      }
>  }
>
> +bool
> +ofctrl_has_backlog(void)
> +{
> +    if (rconn_packet_counter_n_packets(tx_counter)
> +        || rconn_get_version(swconn) < 0) {
> +        return true;
> +    }
> +    return false;
> +}
> +
>  /* The flow table can be updated if the connection to the switch is up
and
>   * in the correct state and not backlogged with existing flow_mods.  (Our
>   * criteria for being backlogged appear very conservative, but the socket
>   * between ovn-controller and OVS provides some buffering.) */
> -bool
> +static bool
>  ofctrl_can_put(void)
>  {
>      if (state != S_UPDATE_FLOWS
> -        || rconn_packet_counter_n_packets(tx_counter)
> -        || rconn_get_version(swconn) < 0) {
> +        || ofctrl_has_backlog()) {
>          return false;
>      }
>      return true;
> diff --git a/controller/ofctrl.h b/controller/ofctrl.h
> index ad8f4be65..330b0b6ca 100644
> --- a/controller/ofctrl.h
> +++ b/controller/ofctrl.h
> @@ -28,6 +28,7 @@ struct hmap;
>  struct match;
>  struct ofpbuf;
>  struct ovsrec_bridge;
> +struct ovsrec_open_vswitch_table;
>  struct sbrec_meter_table;
>  struct shash;
>
> @@ -50,6 +51,7 @@ void ofctrl_init(struct ovn_extend_table *group_table,
>                   struct ovn_extend_table *meter_table,
>                   int inactivity_probe_interval);
>  void ofctrl_run(const struct ovsrec_bridge *br_int,
> +                const struct ovsrec_open_vswitch_table *,
>                  struct shash *pending_ct_zones);
>  enum mf_field_id ofctrl_get_mf_field_id(void);
>  void ofctrl_put(struct ovn_desired_flow_table *lflow_table,
> @@ -59,7 +61,7 @@ void ofctrl_put(struct ovn_desired_flow_table
*lflow_table,
>                  uint64_t nb_cfg,
>                  bool lflow_changed,
>                  bool pflow_changed);
> -bool ofctrl_can_put(void);
> +bool ofctrl_has_backlog(void);
>  void ofctrl_wait(void);
>  void ofctrl_destroy(void);
>  uint64_t ofctrl_get_cur_cfg(void);
> diff --git a/controller/ovn-controller.8.xml
b/controller/ovn-controller.8.xml
> index cc9a7d1c2..94d873d6a 100644
> --- a/controller/ovn-controller.8.xml
> +++ b/controller/ovn-controller.8.xml
> @@ -257,6 +257,33 @@
>          The default value is considered false if this option is not
defined.
>        </dd>
>
> +      <dt><code>external_ids:ovn-ofctrl-wait-before-clear</code></dt>
> +      <dd>
> +        The time, in milliseconds, to wait before clearing flows in OVS
after
> +        OpenFlow connection/reconnection during
<code>ovn-controller</code>
> +        initialization or OVS restart. The purpose of this wait is to
give
> +        time for <code>ovn-controller</code> to compute the new flows
before
> +        clearing existing ones, to avoid data plane down time during
> +        <code>ovn-controller</code> or OVS upgrade/restart at large scale
> +        environments where recomputing the flows takes more than a few
seconds
> +        or even longer. It is difficult for <code>ovn-controller</code>
> +        to determine when the new flows computing is completed, because
of the
> +        dynamics in the cloud environments, which is why this
configuration is
> +        provided for users to adjust based on the scale of the
environment. By
> +        default, it is 0, which means clearing existing flows without
waiting. Not
> +        setting the value, or setting it too small, may result in data
plane down
> +        time during upgrade/restart, while setting it too big may result
in extra
> +        control plane latency of applying new changes of CMS during
> +        upgrade/restart. It is recommended to run the below command in
the target
> +        environment and set this configuration slightly bigger than the
> +        <code>Maximum</code> shown in the output:
> +        <ul>
> +          <li>
> +            <code>ovn-appctl -t ovn-controller stopwatch/show
flow-generation</code>
> +          </li>
> +        </ul>
> +      </dd>
> +
>        <dt><code>external_ids:ovn-enable-lflow-cache</code></dt>
>        <dd>
>          The boolean flag indicates if <code>ovn-controller</code> should
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index dd52b45bf..bbfeab919 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -3705,7 +3705,7 @@ main(int argc, char *argv[])
>              if (br_int) {
>                  ct_zones_data = engine_get_data(&en_ct_zones);
>                  if (ct_zones_data) {
> -                    ofctrl_run(br_int, &ct_zones_data->pending);
> +                    ofctrl_run(br_int, ovs_table,
&ct_zones_data->pending);
>                  }
>
>                  if (chassis) {
> @@ -3720,7 +3720,7 @@ main(int argc, char *argv[])
>                      stopwatch_start(CONTROLLER_LOOP_STOPWATCH_NAME,
>                                      time_msec());
>                      if (ovnsb_idl_txn) {
> -                        if (!ofctrl_can_put()) {
> +                        if (ofctrl_has_backlog()) {
>                              /* When there are in-flight messages pending
to
>                               * ovs-vswitchd, we should hold on
recomputing so
>                               * that the previous flow installations
won't be
> @@ -3733,7 +3733,7 @@ main(int argc, char *argv[])
>                               * change tracking is improved, we can
simply skip
>                               * this round of engine_run and continue
processing
>                               * acculated changes incrementally later when
> -                             * ofctrl_can_put() returns true. */
> +                             * ofctrl_has_backlog() returns false. */
>                              engine_run(false);
>                          } else {
>                              engine_run(true);
> diff --git a/controller/pinctrl.c b/controller/pinctrl.c
> index 25b37ee88..48bb56134 100644
> --- a/controller/pinctrl.c
> +++ b/controller/pinctrl.c
> @@ -3441,6 +3441,7 @@ pinctrl_set_br_int_name_(char *br_int_name)
>  {
>      if (br_int_name && (!pinctrl.br_int_name ||
strcmp(pinctrl.br_int_name,
>                                                         br_int_name))) {
> +        VLOG_INFO("pinctrl set br_int_name");
>          free(pinctrl.br_int_name);
>          pinctrl.br_int_name = xstrdup(br_int_name);
>          /* Notify pinctrl_handler that integration bridge is
> @@ -3479,6 +3480,7 @@ pinctrl_run(struct ovsdb_idl_txn *ovnsb_idl_txn,
>              const struct shash *local_active_ports_ipv6_pd,
>              const struct shash *local_active_ports_ras)
>  {
> +    VLOG_INFO("pinctrl_run");
>      ovs_mutex_lock(&pinctrl_mutex);
>      pinctrl_set_br_int_name_(br_int->name);
>      run_put_mac_bindings(ovnsb_idl_txn, sbrec_datapath_binding_by_key,
> diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
> index 88f230541..88ec2f269 100644
> --- a/tests/ovn-controller.at
> +++ b/tests/ovn-controller.at
> @@ -2102,3 +2102,55 @@ AT_CHECK([echo $(($lflow_run_new -
$lflow_run_old))], [0], [2
>
>  OVN_CLEANUP([hv1])
>  AT_CLEANUP
> +
> +AT_SETUP([ovn-controller - ofctrl wait before clearing flows])
> +
> +ovn_start
> +
> +net_add n1
> +sim_add hv1
> +as hv1
> +check ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.1
> +check ovs-vsctl -- add-port br-int hv1-vif1 -- \
> +    set interface hv1-vif1 external-ids:iface-id=ls1-lp1
> +
> +check ovn-nbctl ls-add ls1
> +
> +check ovn-nbctl --wait=hv lsp-add ls1 ls1-lp1 \
> +-- lsp-set-addresses ls1-lp1 "f0:00:00:00:00:01 10.1.2.3"
> +
> +# Set 5 seconds wait time before clearing OVS flows.
> +check ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=5000
> +
> +# Stop ovn-controller
> +OVS_APP_EXIT_AND_WAIT([ovn-controller])
> +
> +# The old OVS flows should remain (this is regardless of the
configuration)
> +AT_CHECK([ovs-ofctl dump-flows br-int | grep 10.1.2.3], [0], [ignore])
> +
> +# Make a change to the ls1-lp1's IP
> +check ovn-nbctl --wait=sb lsp-set-addresses ls1-lp1 "f0:00:00:00:00:01
10.1.2.4"
> +
> +# Start ovn-controller, which should compute new flows but not apply them
> +# until the wait time is completed.
> +start_daemon ovn-controller
> +sleep 2
> +
> +# Check in the middle of the wait.
> +lflow_run_1=$(ovn-appctl -t ovn-controller coverage/read-counter
lflow_run)
> +AT_CHECK([ovs-ofctl dump-flows br-int | grep 10.1.2.3], [0], [ignore])
> +
> +sleep 5
> +
> +# Check after the wait
> +OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep 10.1.2.4], [0],
[ignore])
> +lflow_run_2=$(ovn-appctl -t ovn-controller coverage/read-counter
lflow_run)
> +
> +# Verify that the flow compute completed during the wait (after the wait
it
> +# merely installs the new flows).
> +AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2
> +])
> +
> +OVN_CLEANUP([hv1])
> +AT_CLEANUP
> --
> 2.30.2
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to