Re: [ovs-dev] [PATCH v2] dpif-netdev: Load based PMD sleeping.

Ilya Maximets Mon, 19 Dec 2022 08:19:05 -0800

On 12/16/22 18:50, Kevin Traynor wrote:
> Sleep for an incremental amount of time if none of the Rx queues
> assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
> on an polling iteration of the PMD.
> 
> Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
> sleep time to zero (i.e. no sleep).
> 
> Sleep time will be increased by 1 uS on each iteration where
> the low load conditions remain up to a total of the max sleep
> time which has a default of 250 uS.


Hi, Kevin.  Thanks for the patch!

The feature seems interesting.  At least, as an experimental feature
for users to try out.  See some comments below.

> 
> The feature is off by default and can be enabled by:
> ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true
> 
> The max sleep time per iteration can be set e.g. to set to 500 uS:
> ovs-vsctl set Open_vSwitch . other_config:pmd-powersave-maxsleep=500

Do we actually need two separate options for this?
What about dropping the general 'pmd-powersave' option and only
keeping the max sleep configuration with '0' being a default and
meaning no sleep?

We may recommend some value in the docs, but it will ultimately be
a user's decision.

We might also drop the 'powersave' part from the knob and just have
'pmd-max-sleep'.  But I have no strong opinion on this.

The single max sleep option can be extended in the future to
accept a list of 'core:value' pairs for a fine grained per-PMD
control, if necessary, without breaking backward compatibility.
But that is probably not needed right now.

> 
> Also add new stats to pmd-perf-show to get visibility of operation
> e.g.
> 
> <snip>
>   - No-sleep hit:            36445  ( 98.4 % of busy it)
>  Sleep time:               3350902  uS ( 34 us/it avg.)> <snip>
> 
> Signed-off-by: Kevin Traynor <[email protected]>
> 
> ---
> v2:
> - Updated to mark feature as experimental as there is still discussion
>   on it's operation and control knobs
> - Added pmd-powersave-maxsleep to set the max requested sleep time
> - Added unit tests for pmd-powersave and pmd-powersave-maxsleep config
>   knobs
> - Added docs to explain that requested sleep time and actual sleep time
>   may differ
> - Added actual measurement of sleep time instead of reporting requested
>   time
> - Removed Max sleep hit statistics
> - Added total sleep time statistic for the length of the measurement
>   period (avg. uS per iteration still exists also)
> - Updated other statistics to account for sleep time
> - Some renaming
> - Replaced xnanosleep with nanosleep to avoid having to start/end
>   quiesce for every sleep (this may KO this feature on Windows)

Maybe convert a current xnanosleep with a

static void
xnanosleep__(uint64_t nanoseconds, bool need_to_quiesce)

and create 2 wrappers with true/false as arguments:
xnanosleep() and xnanosleep_no_quiesce() ?
Or something like that?

I didn't test, but the current code might break the windows build,
not only this particular function.

> - Limited max requested sleep to max PMD quiesce time (10 ms)
> - Adapted ALB measurement about whether a PMD is overloaded to account
>   for time spent sleeping
> ---
>  Documentation/topics/dpdk/pmd.rst | 46 +++++++++++++++++
>  lib/dpif-netdev-perf.c            | 26 ++++++++--
>  lib/dpif-netdev-perf.h            |  5 +-
>  lib/dpif-netdev.c                 | 86 +++++++++++++++++++++++++++++--
>  tests/pmd.at                      | 43 ++++++++++++++++
>  vswitchd/vswitch.xml              | 34 ++++++++++++
>  6 files changed, 229 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst 
> b/Documentation/topics/dpdk/pmd.rst
> index b259cc8b3..abc552029 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -312,4 +312,50 @@ reassignment due to PMD Auto Load Balance. For example, 
> this could be set
>  (in min) such that a reassignment is triggered at most every few hours.
>  
> +PMD Power Saving (Experimental)
> +-------------------------------
> +
> +PMD threads constantly poll Rx queues which are assigned to them. In order to
> +reduce the CPU cycles they use, they can sleep for small periods of time
> +when there is no load or very-low load from all the Rx queues they poll.
> +
> +This can be enabled by::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
> +
> +With this enabled a PMD may request to sleep by an incrementing amount of 
> time
> +up to a maximum time. If at any point the threshold of at least half a batch 
> of
> +packets (i.e. 16) is received from an Rx queue that the PMD is polling  is 
> met,
> +the sleep time will be reset to 0. (i.e. no sleep).
> +
> +The default maximum sleep time is set 250 us. A user can configure a new
> +maximum requested sleep time in uS. e.g. to set to 1 ms::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave-maxsleep=1000
> +
> +Sleeping in a PMD thread will mean there is a period of time when the PMD
> +thread will not process packets. Sleep times requested are not guaranteed
> +and can differ significantly depending on system configuration. The actual
> +time not processing packets will be determined by the sleep and processor
> +wake-up times and should be tested with each system configuration.
> +
> +Sleep time statistics for 10 secs can be seen with::
> +
> +    $ ovs-appctl dpif-netdev/pmd-stats-clear \
> +        && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
> +
> +Example output, showing that the 16 packet no-sleep threshhold occurred in
> +98.2% of busy iterations and there was an average sleep time of
> +33 us per iteration::
> +
> +     No-sleep hit:             119043  ( 98.2 % of busy it)

I'm not sure the word 'hit' is appropriate here.
Maybe 'No-sleep iterations' ?

And the word 'it' is a bit confusing in this context,
the full word might be better for understanding.

It's fine to move around alignment in the output here.

> +     Sleep time:              10638025 uS ( 33 us/it avg.)


uS/us inconsistency.

> +
> +.. note::
> +
> +    If there is a sudden spike of packets while the PMD thread is sleeping 
> and
> +    the processor is in a low-power state it may result in some lost packets 
> or
> +    extra latency before the PMD thread returns to processing packets at full
> +    rate.
> +
>  .. _ovs-vswitchd(8):
>      http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
> index a2a7d8f0b..16445c68f 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -231,4 +231,6 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
> pmd_perf_stats *s,
>      uint64_t idle_iter = s->pkts.bin[0];
>      uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
> +    uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT];
> +    uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES];
>  
>      ds_put_format(str,
> @@ -236,11 +238,17 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
> pmd_perf_stats *s,
>              "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
>              "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> -            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
> -            tot_iter, tot_cycles * us_per_cycle / tot_iter,
> +            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> +            "  - No-sleep hit:     %12"PRIu64"  (%5.1f %% of busy it)\n"
> +            " Sleep time:          %12.0f  uS (%3.0f us/it avg.)\n",
> +            tot_iter,
> +            (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter,
>              tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
>              idle_iter,
>              100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>              busy_iter,
> -            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
> +            no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0,
> +            tot_sleep_cycles * us_per_cycle,
> +            tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0);
>      if (rx_packets > 0) {
>          ds_put_format(str,
> @@ -519,5 +527,6 @@ OVS_REQUIRES(s->stats_mutex)
>  void
>  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics)
> +                       int tx_packets, bool no_sleep_hit,
> +                       uint64_t sleep_cycles, bool full_metrics)
>  {
>      uint64_t now_tsc = cycles_counter_update(s);
> @@ -526,5 +535,5 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int 
> rx_packets,
>      char *reason = NULL;
>  
> -    cycles = now_tsc - s->start_tsc;
> +    cycles = now_tsc - s->start_tsc - sleep_cycles;
>      s->current.timestamp = s->iteration_cnt;
>      s->current.cycles = cycles;
> @@ -540,4 +549,11 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int 
> rx_packets,
>      histogram_add_sample(&s->pkts, rx_packets);
>  
> +    if (no_sleep_hit) {
> +        pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1);
> +    }
> +    if (sleep_cycles) {
> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles);
> +    }
> +
>      if (!full_metrics) {
>          return;
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 9673dddd8..94ffb16cf 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -81,4 +81,6 @@ enum pmd_stat_type {
>      PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>      PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
> +    PMD_PWR_NO_SLEEP_HIT,   /* Iterations with Rx above no-sleep thresh. */

HIT/ITER

> +    PMD_PWR_SLEEP_CYCLES,   /* Total cycles slept to save power. */
>      PMD_N_STATS
>  };
> @@ -409,5 +411,6 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);
>  void
>  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics);
> +                       int tx_packets, bool no_sleep_hit,
> +                       uint64_t sleep_cycles, bool full_metrics);
>  
>  /* Formatting the output of commands. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 2c08a71c8..586abb58d 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = {
>  #define PMD_RCU_QUIESCE_INTERVAL 10000LL
>  
> +/* Default max time in uS for a pmd thread to sleep based on load. */
> +#define PMD_PWR_MAX_SLEEP 250

We should have a _US suffix here.

> +/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
> +#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2)

Some spaces around the '/' would be nice.

> +/* Time in uS to increment a pmd thread sleep time. */
> +#define PMD_PWR_INC 1

And here.

> +
>  struct dpcls {
>      struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
> @@ -278,4 +285,7 @@ struct dp_netdev {
>      /* Enable collection of PMD performance metrics. */
>      atomic_bool pmd_perf_metrics;
> +    /* Enable PMD load based sleeping. */
> +    atomic_bool pmd_powersave;
> +    atomic_uint64_t pmd_max_sleep;
>      /* Enable the SMC cache from ovsdb config */
>      atomic_bool smc_enable_db;
> @@ -4767,5 +4777,8 @@ dpif_netdev_set_config(struct dpif *dpif, const struct 
> smap *other_config)
>      uint8_t cur_rebalance_load;
>      uint32_t rebalance_load, rebalance_improve;
> +    uint64_t  pmd_max_sleep, cur_pmd_max_sleep;
> +    bool powersave, cur_powersave;
>      bool log_autolb = false;
> +    bool log_pmdsleep = false;
>      enum sched_assignment_type pmd_rxq_assign_type;
>  
> @@ -4915,4 +4928,28 @@ dpif_netdev_set_config(struct dpif *dpif, const struct 
> smap *other_config)
>  
>      set_pmd_auto_lb(dp, autolb_state, log_autolb);
> +
> +    pmd_max_sleep = smap_get_ullong(other_config, "pmd-powersave-maxsleep",
> +                                    PMD_PWR_MAX_SLEEP);
> +    pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep);
> +    atomic_read_relaxed(&dp->pmd_max_sleep, &cur_pmd_max_sleep);
> +    if (pmd_max_sleep != cur_pmd_max_sleep) {
> +        atomic_store_relaxed(&dp->pmd_max_sleep, pmd_max_sleep);
> +        log_pmdsleep = true;
> +    }

We should, probbaly, enforce the upper limit (10000 us).

> +
> +    powersave = smap_get_bool(other_config, "pmd-powersave", false);
> +    atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave);
> +    if (powersave != cur_powersave) {
> +        atomic_store_relaxed(&dp->pmd_powersave, powersave);
> +        log_pmdsleep = true;
> +    }
> +
> +    if (log_pmdsleep) {
> +        VLOG_INFO("PMD powersave max sleep request is %"PRIu64" usecs.",
> +                  pmd_max_sleep);
> +        VLOG_INFO("PMD powersave is %s.",
> +                  powersave ? "enabled" : "disabled" );
> +    }
> +
>      return 0;
>  }
> @@ -6873,7 +6910,9 @@ pmd_thread_main(void *f_)
>      bool exiting;
>      bool reload;
> +    bool powersave;
>      int poll_cnt;
>      int i;
>      int process_packets = 0;
> +    uint64_t sleep_time = 0;
>  
>      poll_list = NULL;
> @@ -6935,7 +6974,16 @@ reload:
>      for (;;) {
>          uint64_t rx_packets = 0, tx_packets = 0;
> +        bool nosleep_hit = false;
> +        uint64_t time_slept = 0;
> +        uint64_t max_sleep;
>  
>          pmd_perf_start_iteration(s);
> +        atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave);
> +        atomic_read_relaxed(&pmd->dp->pmd_max_sleep, &max_sleep);
>  
> +        if (!powersave) {
> +            /* Reset sleep_time as policy may have changed. */
> +            sleep_time = 0;
> +        }
>          atomic_read_relaxed(&pmd->dp->smc_enable_db, 
> &pmd->ctx.smc_enable_db);
>  
> @@ -6957,4 +7005,8 @@ reload:
>                                             poll_list[i].port_no);
>              rx_packets += process_packets;
> +            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
> +                nosleep_hit = true;
> +                sleep_time = 0;
> +            }
>          }
>  
> @@ -6964,5 +7016,24 @@ reload:
>               * There was no time updates on current iteration. */
>              pmd_thread_ctx_time_update(pmd);
> -            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
> +                                                   sleep_time ? true : 
> false);
> +        }
> +
> +        if (powersave) {
> +            if (sleep_time) {
> +                struct timespec ts_sleep;
> +                struct cycle_timer sleep_timer;
> +
> +                nsec_to_timespec(sleep_time * 1000, &ts_sleep);
> +                cycle_timer_start(&pmd->perf_stats, &sleep_timer);
> +                nanosleep(&ts_sleep, NULL);
> +                time_slept = cycle_timer_stop(&pmd->perf_stats, 
> &sleep_timer);
> +            }
> +            if (sleep_time < max_sleep) {
> +                /* Increase sleep time for next iteration. */
> +                sleep_time += PMD_PWR_INC;
> +            } else {
> +                sleep_time = max_sleep;
> +            }
>          }
>  
> @@ -7004,6 +7075,6 @@ reload:
>          }
>  
> -        pmd_perf_end_iteration(s, rx_packets, tx_packets,
> -                               pmd_perf_metrics_enabled(pmd));
> +        pmd_perf_end_iteration(s, rx_packets, tx_packets, nosleep_hit,
> +                               time_slept, pmd_perf_metrics_enabled(pmd));
>      }
>      ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
> @@ -9855,5 +9926,5 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread 
> *pmd,
>  {
>      struct dpcls *cls;
> -    uint64_t tot_idle = 0, tot_proc = 0;
> +    uint64_t tot_idle = 0, tot_proc = 0, tot_sleep = 0;
>      unsigned int pmd_load = 0;
>  
> @@ -9872,8 +9943,11 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread 
> *pmd,
>              tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
>                         pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
> +            tot_sleep = pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES] -
> +                       pmd->prev_stats[PMD_PWR_SLEEP_CYCLES];

Indentation is a bit off.

>  
>              if (pmd_alb->is_enabled && !pmd->isolated) {
>                  if (tot_proc) {
> -                    pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
> +                    pmd_load = ((tot_proc * 100) /
> +                                    (tot_idle + tot_proc + tot_sleep));
>                  }
>  
> @@ -9892,4 +9966,6 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread 
> *pmd,
>          pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
>                          pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
> +        pmd->prev_stats[PMD_PWR_SLEEP_CYCLES] =
> +                        pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES];
>  
>          /* Get the cycles that were used to process each queue and store. */
> diff --git a/tests/pmd.at b/tests/pmd.at
> index 10879a349..fb1d86793 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -1193,2 +1193,45 @@ ovs-appctl: ovs-vswitchd: server returned an error
>  OVS_VSWITCHD_STOP
>  AT_CLEANUP
> +
> +dnl Check default state
> +AT_SETUP([PMD - pmd sleep])
> +OVS_VSWITCHD_START
> +OVS_WAIT_UNTIL([grep "PMD powersave is disabled." ovs-vswitchd.log])
> +
> +dnl Check can be enabled
> +get_log_next_line_num
> +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"])
> +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave is 
> enabled."])
> +
> +dnl Check can be disabled
> +get_log_next_line_num
> +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="false"])
> +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave is 
> disabled."])
> +
> +dnl Check default max sleep
> +get_log_next_line_num
> +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"])
> +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
> sleep request is 250 usecs."])
> +
> +dnl Check low value max sleep
> +get_log_next_line_num
> +AT_CHECK([ovs-vsctl set open_vswitch . 
> other_config:pmd-powersave-maxsleep="1"])
> +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
> sleep request is 1 usecs."])
> +
> +dnl Check high value max sleep
> +get_log_next_line_num
> +AT_CHECK([ovs-vsctl set open_vswitch . 
> other_config:pmd-powersave-maxsleep="10000"])
> +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
> sleep request is 10000 usecs."])
> +
> +dnl Check setting max sleep to zero
> +get_log_next_line_num
> +AT_CHECK([ovs-vsctl set open_vswitch . 
> other_config:pmd-powersave-maxsleep="0"])
> +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
> sleep request is 0 usecs."])
> +
> +dnl Check above high value max sleep
> +get_log_next_line_num
> +AT_CHECK([ovs-vsctl set open_vswitch . 
> other_config:pmd-powersave-maxsleep="10001"])
> +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
> sleep request is 10000 usecs."])
> +
> +OVS_VSWITCHD_STOP
> +AT_CLEANUP
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 928821a82..3ab3560f2 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -789,4 +789,38 @@
>          </p>
>        </column>
> +      <column name="other_config" key="pmd-powersave"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Enables PMD thread requesting to sleep for values up to the
> +          maximum PMD sleep time of
> +          <ref column="other_config" key="pmd-powersave-maxsleep"/>) per
> +          iteration. The request and value that will be used is based on the
> +          number of packets available from the Rx queues that the PMD polls.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>.
> +        </p>
> +        <p>
> +          Set this value to <code>true</code> to enable this option.
> +        </p>
> +        <p>
> +          Sleep time for the PMD will be reset to 0 uS (i.e. no sleep) when
> +          any Rx queue that it polls has 16 or more packets available for Rx.
> +        </p>
> +      </column>
> +      <column name="other_config" key="pmd-powersave-maxsleep"
> +              type='{"type": "integer",
> +                     "minInteger": 0, "maxInteger": 10000}'>
> +        <p>
> +          Specifies the maximum sleep time that will be requested in
> +          microseconds per iteration for a PMD with no or low load.
> +        </p>
> +        <p>
> +          The default value is <code>250 microseconds</code>.
> +        </p>
> +        <p>
> +          The maximum value is <code>10000 microseconds</code>.
> +        </p>
> +      </column>
>        <column name="other_config" key="userspace-tso-enable"
>                type='{"type": "boolean"}'>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2] dpif-netdev: Load based PMD sleeping.

Reply via email to