On Fri, Jan 6, 2023 at 4:00 PM Kevin Traynor <[email protected]> wrote:
>
> Sleep for an incremental amount of time if none of the Rx queues
> assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
> on an polling iteration of the PMD.
>
> Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
> sleep time to zero (i.e. no sleep).
>
> Sleep time will be increased on each iteration where the low load
> conditions remain up to a total of the max sleep time which is set
> by the user e.g:
> ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
>
> The default pmd-maxsleep value is 0, which means that no sleeps
> will occur and the default behaviour is unchanged from previously.
>
> Also add new stats to pmd-perf-show to get visibility of operation
> e.g.
> ...
>    - sleep iterations:       153994  ( 76.8 % of iterations)
>    Sleep time:               9159399  us ( 46 us/iteration avg.)
> ...
>
> Signed-off-by: Kevin Traynor <[email protected]>
> ---
>  Documentation/topics/dpdk/pmd.rst | 51 ++++++++++++++++++++++++
>  NEWS                              |  3 ++
>  lib/dpif-netdev-perf.c            | 24 +++++++++---
>  lib/dpif-netdev-perf.h            |  5 ++-
>  lib/dpif-netdev.c                 | 64 +++++++++++++++++++++++++++++--
>  tests/pmd.at                      | 46 ++++++++++++++++++++++
>  vswitchd/vswitch.xml              | 26 +++++++++++++
>  7 files changed, 209 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/topics/dpdk/pmd.rst 
> b/Documentation/topics/dpdk/pmd.rst
> index 9006fd40f..89f6b3052 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -325,4 +325,55 @@ reassignment due to PMD Auto Load Balance. For example, 
> this could be set
>  (in min) such that a reassignment is triggered at most every few hours.
>
> +PMD Power Saving (Experimental)
> +-------------------------------

I would stick to: "PMD load based sleeping"
The powersaving comes from some external configuration that this patch
does not cover.

Maybe you could mention something about c-states, but it seems out of
OVS scope itself.


> +
> +PMD threads constantly poll Rx queues which are assigned to them. In order to
> +reduce the CPU cycles they use, they can sleep for small periods of time
> +when there is no load or very-low load on all the Rx queues they poll.
> +
> +This can be enabled by setting the max requested sleep time (in microseconds)
> +for a PMD thread::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500
> +
> +Non-zero values will be rounded up to the nearest 10 microseconds to avoid
> +requesting very small sleep times.
> +
> +With a non-zero max value a PMD may request to sleep by an incrementing 
> amount
> +of time up to the maximum time. If at any point the threshold of at least 
> half
> +a batch of packets (i.e. 16) is received from an Rx queue that the PMD is
> +polling is met, the requested sleep time will be reset to 0. At that point no
> +sleeps will occur until the no/low load conditions return.
> +
> +Sleeping in a PMD thread will mean there is a period of time when the PMD
> +thread will not process packets. Sleep times requested are not guaranteed
> +and can differ significantly depending on system configuration. The actual
> +time not processing packets will be determined by the sleep and processor
> +wake-up times and should be tested with each system configuration.
> +
> +Sleep time statistics for 10 secs can be seen with::
> +
> +    $ ovs-appctl dpif-netdev/pmd-stats-clear \
> +        && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
> +
> +Example output, showing that during the last 10 seconds, 76.8% of iterations
> +had a sleep of some length. The total amount of sleep time was 9.15 seconds 
> and
> +the average sleep time per iteration was 46 microseconds::
> +
> +   - sleep iterations:       153994  ( 76.8 % of iterations)
> +   Sleep time:               9159399  us ( 46 us/iteration avg.)
> +
> +.. note::
> +
> +    If there is a sudden spike of packets while the PMD thread is sleeping 
> and
> +    the processor is in a low-power state it may result in some lost packets 
> or
> +    extra latency before the PMD thread returns to processing packets at full
> +    rate.
> +
> +.. note::
> +
> +    Default Linux kernel hrtimer resolution is set to 50 microseconds so this
> +    will add overhead to requested sleep time.
> +
>  .. _ovs-vswitchd(8):
>      http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
> diff --git a/NEWS b/NEWS
> index 2f6ededfe..54d97825e 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -31,4 +31,7 @@ Post-v3.0.0
>       * Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show
>         the pmd usage of an Rx queue over a configurable time period.
> +     * Add new experiemental PMD load based sleeping feature. PMD threads can

*experimental


> +       request to sleep up to a user configured 'pmd-maxsleep' value under no
> +       and low load conditions.
>
>
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
> index a2a7d8f0b..bc6b779a7 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -231,4 +231,6 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
> pmd_perf_stats *s,
>      uint64_t idle_iter = s->pkts.bin[0];
>      uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
> +    uint64_t sleep_iter = stats[PMD_PWR_SLEEP_ITER];
> +    uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES];

I would remove _PWR_.


>
>      ds_put_format(str,
> @@ -236,11 +238,17 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
> pmd_perf_stats *s,
>              "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
>              "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> -            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
> -            tot_iter, tot_cycles * us_per_cycle / tot_iter,
> +            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> +            "  - sleep iterations: %12"PRIu64"  (%5.1f %% of iterations)\n"
> +            " Sleep time:          %12.0f  us (%3.0f us/iteration avg.)\n",

This gives:

pmd thread numa_id 1 core_id 5:

  Iterations:               884937  (361.43 us/it)
  - Used TSC cycles:   24829488529  (  1.2 % of total cycles)
  - idle iterations:        563472  (  1.4 % of used cycles)
  - busy iterations:        321465  ( 98.6 % of used cycles)
  - sleep iterations:       569487  ( 64.4 % of iterations)
 Sleep time:             310297274  us (351 us/iteration avg.)
 ^^^
I would add another space before Sleep so it aligns with the rest.

And maybe put the unit as a comment, since no other stat detailed its
unit so far.
+            "  Sleep time (us):    %12.0f  (%3.0f us/iteration avg.)\n",

  Rx packets:             10000000  (13 Kpps, 2447 cycles/pkt)
  Datapath passes:        10000000  (1.00 passes/pkt)
  - PHWOL hits:                  0  (  0.0 %)
  - MFEX Opt hits:               0  (  0.0 %)
"""


> +            tot_iter,
> +            (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter,
>              tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
>              idle_iter,
>              100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>              busy_iter,
> -            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
> +            sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0,
> +            tot_sleep_cycles * us_per_cycle,
> +            tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0);
>      if (rx_packets > 0) {
>          ds_put_format(str,
> @@ -519,5 +527,6 @@ OVS_REQUIRES(s->stats_mutex)
>  void
>  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics)
> +                       int tx_packets, uint64_t sleep_cycles,
> +                       bool full_metrics)
>  {
>      uint64_t now_tsc = cycles_counter_update(s);
> @@ -526,5 +535,5 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int 
> rx_packets,
>      char *reason = NULL;
>
> -    cycles = now_tsc - s->start_tsc;
> +    cycles = now_tsc - s->start_tsc - sleep_cycles;
>      s->current.timestamp = s->iteration_cnt;
>      s->current.cycles = cycles;
> @@ -540,4 +549,9 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int 
> rx_packets,
>      histogram_add_sample(&s->pkts, rx_packets);
>
> +    if (sleep_cycles) {
> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_ITER, 1);
> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles);
> +    }
> +
>      if (!full_metrics) {
>          return;
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 9673dddd8..ebf776827 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -81,4 +81,6 @@ enum pmd_stat_type {
>      PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>      PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
> +    PMD_PWR_SLEEP_ITER,     /* Iterations where a sleep has taken place. */
> +    PMD_PWR_SLEEP_CYCLES,   /* Total cycles slept to save power. */
>      PMD_N_STATS
>  };
> @@ -409,5 +411,6 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);
>  void
>  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics);
> +                       int tx_packets, uint64_t sleep_cycles,
> +                       bool full_metrics);
>
>  /* Formatting the output of commands. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 7127068fe..af97f9a83 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -172,4 +172,9 @@ static struct odp_support dp_netdev_support = {
>  #define PMD_RCU_QUIESCE_INTERVAL 10000LL
>
> +/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
> +#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST / 2)
> +/* Time in uS to increment a pmd thread sleep time. */
> +#define PMD_PWR_INC_US 10

Idem, no _PWR_.


> +
>  struct dpcls {
>      struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */

The rest lgtm.
With this fixed, you can add:
Reviewed-by: David Marchand <[email protected]>


-- 
David Marchand

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to