Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.

Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).

Sleep time will be increased by 1 uS on each iteration where
the low load conditions remain up to a total of the max sleep
time which has a default of 250 uS.

The feature is off by default and can be enabled by:
ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true

The max sleep time per iteration can be set e.g. to set to 500 uS:
ovs-vsctl set Open_vSwitch . other_config:pmd-powersave-maxsleep=500

Also add new stats to pmd-perf-show to get visibility of operation
e.g.

<snip>
  - No-sleep hit:            36445  ( 98.4 % of busy it)
 Sleep time:               3350902  uS ( 34 us/it avg.)
<snip>

Signed-off-by: Kevin Traynor <ktray...@redhat.com>

---
v2:
- Updated to mark feature as experimental as there is still discussion
  on it's operation and control knobs
- Added pmd-powersave-maxsleep to set the max requested sleep time
- Added unit tests for pmd-powersave and pmd-powersave-maxsleep config
  knobs
- Added docs to explain that requested sleep time and actual sleep time
  may differ
- Added actual measurement of sleep time instead of reporting requested
  time
- Removed Max sleep hit statistics
- Added total sleep time statistic for the length of the measurement
  period (avg. uS per iteration still exists also)
- Updated other statistics to account for sleep time
- Some renaming
- Replaced xnanosleep with nanosleep to avoid having to start/end
  quiesce for every sleep (this may KO this feature on Windows)
- Limited max requested sleep to max PMD quiesce time (10 ms)
- Adapted ALB measurement about whether a PMD is overloaded to account
  for time spent sleeping
---
 Documentation/topics/dpdk/pmd.rst | 46 +++++++++++++++++
 lib/dpif-netdev-perf.c            | 26 ++++++++--
 lib/dpif-netdev-perf.h            |  5 +-
 lib/dpif-netdev.c                 | 86 +++++++++++++++++++++++++++++--
 tests/pmd.at                      | 43 ++++++++++++++++
 vswitchd/vswitch.xml              | 34 ++++++++++++
 6 files changed, 229 insertions(+), 11 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst 
b/Documentation/topics/dpdk/pmd.rst
index b259cc8b3..abc552029 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -312,4 +312,50 @@ reassignment due to PMD Auto Load Balance. For example, 
this could be set
 (in min) such that a reassignment is triggered at most every few hours.
 
+PMD Power Saving (Experimental)
+-------------------------------
+
+PMD threads constantly poll Rx queues which are assigned to them. In order to
+reduce the CPU cycles they use, they can sleep for small periods of time
+when there is no load or very-low load from all the Rx queues they poll.
+
+This can be enabled by::
+
+    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
+
+With this enabled a PMD may request to sleep by an incrementing amount of time
+up to a maximum time. If at any point the threshold of at least half a batch of
+packets (i.e. 16) is received from an Rx queue that the PMD is polling  is met,
+the sleep time will be reset to 0. (i.e. no sleep).
+
+The default maximum sleep time is set 250 us. A user can configure a new
+maximum requested sleep time in uS. e.g. to set to 1 ms::
+
+    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave-maxsleep=1000
+
+Sleeping in a PMD thread will mean there is a period of time when the PMD
+thread will not process packets. Sleep times requested are not guaranteed
+and can differ significantly depending on system configuration. The actual
+time not processing packets will be determined by the sleep and processor
+wake-up times and should be tested with each system configuration.
+
+Sleep time statistics for 10 secs can be seen with::
+
+    $ ovs-appctl dpif-netdev/pmd-stats-clear \
+        && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
+
+Example output, showing that the 16 packet no-sleep threshhold occurred in
+98.2% of busy iterations and there was an average sleep time of
+33 us per iteration::
+
+     No-sleep hit:             119043  ( 98.2 % of busy it)
+     Sleep time:              10638025 uS ( 33 us/it avg.)
+
+.. note::
+
+    If there is a sudden spike of packets while the PMD thread is sleeping and
+    the processor is in a low-power state it may result in some lost packets or
+    extra latency before the PMD thread returns to processing packets at full
+    rate.
+
 .. _ovs-vswitchd(8):
     http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index a2a7d8f0b..16445c68f 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -231,4 +231,6 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
pmd_perf_stats *s,
     uint64_t idle_iter = s->pkts.bin[0];
     uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
+    uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT];
+    uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES];
 
     ds_put_format(str,
@@ -236,11 +238,17 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
pmd_perf_stats *s,
             "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
             "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
-            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
-            tot_iter, tot_cycles * us_per_cycle / tot_iter,
+            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
+            "  - No-sleep hit:     %12"PRIu64"  (%5.1f %% of busy it)\n"
+            " Sleep time:          %12.0f  uS (%3.0f us/it avg.)\n",
+            tot_iter,
+            (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter,
             tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
             idle_iter,
             100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
             busy_iter,
-            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
+            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
+            no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0,
+            tot_sleep_cycles * us_per_cycle,
+            tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0);
     if (rx_packets > 0) {
         ds_put_format(str,
@@ -519,5 +527,6 @@ OVS_REQUIRES(s->stats_mutex)
 void
 pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics)
+                       int tx_packets, bool no_sleep_hit,
+                       uint64_t sleep_cycles, bool full_metrics)
 {
     uint64_t now_tsc = cycles_counter_update(s);
@@ -526,5 +535,5 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int 
rx_packets,
     char *reason = NULL;
 
-    cycles = now_tsc - s->start_tsc;
+    cycles = now_tsc - s->start_tsc - sleep_cycles;
     s->current.timestamp = s->iteration_cnt;
     s->current.cycles = cycles;
@@ -540,4 +549,11 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int 
rx_packets,
     histogram_add_sample(&s->pkts, rx_packets);
 
+    if (no_sleep_hit) {
+        pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1);
+    }
+    if (sleep_cycles) {
+        pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles);
+    }
+
     if (!full_metrics) {
         return;
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 9673dddd8..94ffb16cf 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -81,4 +81,6 @@ enum pmd_stat_type {
     PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
     PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
+    PMD_PWR_NO_SLEEP_HIT,   /* Iterations with Rx above no-sleep thresh. */
+    PMD_PWR_SLEEP_CYCLES,   /* Total cycles slept to save power. */
     PMD_N_STATS
 };
@@ -409,5 +411,6 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);
 void
 pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics);
+                       int tx_packets, bool no_sleep_hit,
+                       uint64_t sleep_cycles, bool full_metrics);
 
 /* Formatting the output of commands. */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 2c08a71c8..586abb58d 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = {
 #define PMD_RCU_QUIESCE_INTERVAL 10000LL
 
+/* Default max time in uS for a pmd thread to sleep based on load. */
+#define PMD_PWR_MAX_SLEEP 250
+/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
+#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2)
+/* Time in uS to increment a pmd thread sleep time. */
+#define PMD_PWR_INC 1
+
 struct dpcls {
     struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
@@ -278,4 +285,7 @@ struct dp_netdev {
     /* Enable collection of PMD performance metrics. */
     atomic_bool pmd_perf_metrics;
+    /* Enable PMD load based sleeping. */
+    atomic_bool pmd_powersave;
+    atomic_uint64_t pmd_max_sleep;
     /* Enable the SMC cache from ovsdb config */
     atomic_bool smc_enable_db;
@@ -4767,5 +4777,8 @@ dpif_netdev_set_config(struct dpif *dpif, const struct 
smap *other_config)
     uint8_t cur_rebalance_load;
     uint32_t rebalance_load, rebalance_improve;
+    uint64_t  pmd_max_sleep, cur_pmd_max_sleep;
+    bool powersave, cur_powersave;
     bool log_autolb = false;
+    bool log_pmdsleep = false;
     enum sched_assignment_type pmd_rxq_assign_type;
 
@@ -4915,4 +4928,28 @@ dpif_netdev_set_config(struct dpif *dpif, const struct 
smap *other_config)
 
     set_pmd_auto_lb(dp, autolb_state, log_autolb);
+
+    pmd_max_sleep = smap_get_ullong(other_config, "pmd-powersave-maxsleep",
+                                    PMD_PWR_MAX_SLEEP);
+    pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep);
+    atomic_read_relaxed(&dp->pmd_max_sleep, &cur_pmd_max_sleep);
+    if (pmd_max_sleep != cur_pmd_max_sleep) {
+        atomic_store_relaxed(&dp->pmd_max_sleep, pmd_max_sleep);
+        log_pmdsleep = true;
+    }
+
+    powersave = smap_get_bool(other_config, "pmd-powersave", false);
+    atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave);
+    if (powersave != cur_powersave) {
+        atomic_store_relaxed(&dp->pmd_powersave, powersave);
+        log_pmdsleep = true;
+    }
+
+    if (log_pmdsleep) {
+        VLOG_INFO("PMD powersave max sleep request is %"PRIu64" usecs.",
+                  pmd_max_sleep);
+        VLOG_INFO("PMD powersave is %s.",
+                  powersave ? "enabled" : "disabled" );
+    }
+
     return 0;
 }
@@ -6873,7 +6910,9 @@ pmd_thread_main(void *f_)
     bool exiting;
     bool reload;
+    bool powersave;
     int poll_cnt;
     int i;
     int process_packets = 0;
+    uint64_t sleep_time = 0;
 
     poll_list = NULL;
@@ -6935,7 +6974,16 @@ reload:
     for (;;) {
         uint64_t rx_packets = 0, tx_packets = 0;
+        bool nosleep_hit = false;
+        uint64_t time_slept = 0;
+        uint64_t max_sleep;
 
         pmd_perf_start_iteration(s);
+        atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave);
+        atomic_read_relaxed(&pmd->dp->pmd_max_sleep, &max_sleep);
 
+        if (!powersave) {
+            /* Reset sleep_time as policy may have changed. */
+            sleep_time = 0;
+        }
         atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db);
 
@@ -6957,4 +7005,8 @@ reload:
                                            poll_list[i].port_no);
             rx_packets += process_packets;
+            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
+                nosleep_hit = true;
+                sleep_time = 0;
+            }
         }
 
@@ -6964,5 +7016,24 @@ reload:
              * There was no time updates on current iteration. */
             pmd_thread_ctx_time_update(pmd);
-            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
+            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
+                                                   sleep_time ? true : false);
+        }
+
+        if (powersave) {
+            if (sleep_time) {
+                struct timespec ts_sleep;
+                struct cycle_timer sleep_timer;
+
+                nsec_to_timespec(sleep_time * 1000, &ts_sleep);
+                cycle_timer_start(&pmd->perf_stats, &sleep_timer);
+                nanosleep(&ts_sleep, NULL);
+                time_slept = cycle_timer_stop(&pmd->perf_stats, &sleep_timer);
+            }
+            if (sleep_time < max_sleep) {
+                /* Increase sleep time for next iteration. */
+                sleep_time += PMD_PWR_INC;
+            } else {
+                sleep_time = max_sleep;
+            }
         }
 
@@ -7004,6 +7075,6 @@ reload:
         }
 
-        pmd_perf_end_iteration(s, rx_packets, tx_packets,
-                               pmd_perf_metrics_enabled(pmd));
+        pmd_perf_end_iteration(s, rx_packets, tx_packets, nosleep_hit,
+                               time_slept, pmd_perf_metrics_enabled(pmd));
     }
     ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
@@ -9855,5 +9926,5 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread 
*pmd,
 {
     struct dpcls *cls;
-    uint64_t tot_idle = 0, tot_proc = 0;
+    uint64_t tot_idle = 0, tot_proc = 0, tot_sleep = 0;
     unsigned int pmd_load = 0;
 
@@ -9872,8 +9943,11 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread 
*pmd,
             tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
                        pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
+            tot_sleep = pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES] -
+                       pmd->prev_stats[PMD_PWR_SLEEP_CYCLES];
 
             if (pmd_alb->is_enabled && !pmd->isolated) {
                 if (tot_proc) {
-                    pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
+                    pmd_load = ((tot_proc * 100) /
+                                    (tot_idle + tot_proc + tot_sleep));
                 }
 
@@ -9892,4 +9966,6 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread 
*pmd,
         pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
                         pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
+        pmd->prev_stats[PMD_PWR_SLEEP_CYCLES] =
+                        pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES];
 
         /* Get the cycles that were used to process each queue and store. */
diff --git a/tests/pmd.at b/tests/pmd.at
index 10879a349..fb1d86793 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -1193,2 +1193,45 @@ ovs-appctl: ovs-vswitchd: server returned an error
 OVS_VSWITCHD_STOP
 AT_CLEANUP
+
+dnl Check default state
+AT_SETUP([PMD - pmd sleep])
+OVS_VSWITCHD_START
+OVS_WAIT_UNTIL([grep "PMD powersave is disabled." ovs-vswitchd.log])
+
+dnl Check can be enabled
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave is 
enabled."])
+
+dnl Check can be disabled
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="false"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave is 
disabled."])
+
+dnl Check default max sleep
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
sleep request is 250 usecs."])
+
+dnl Check low value max sleep
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . 
other_config:pmd-powersave-maxsleep="1"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
sleep request is 1 usecs."])
+
+dnl Check high value max sleep
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . 
other_config:pmd-powersave-maxsleep="10000"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
sleep request is 10000 usecs."])
+
+dnl Check setting max sleep to zero
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . 
other_config:pmd-powersave-maxsleep="0"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
sleep request is 0 usecs."])
+
+dnl Check above high value max sleep
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set open_vswitch . 
other_config:pmd-powersave-maxsleep="10001"])
+OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max 
sleep request is 10000 usecs."])
+
+OVS_VSWITCHD_STOP
+AT_CLEANUP
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 928821a82..3ab3560f2 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -789,4 +789,38 @@
         </p>
       </column>
+      <column name="other_config" key="pmd-powersave"
+              type='{"type": "boolean"}'>
+        <p>
+          Enables PMD thread requesting to sleep for values up to the
+          maximum PMD sleep time of
+          <ref column="other_config" key="pmd-powersave-maxsleep"/>) per
+          iteration. The request and value that will be used is based on the
+          number of packets available from the Rx queues that the PMD polls.
+        </p>
+        <p>
+          The default value is <code>false</code>.
+        </p>
+        <p>
+          Set this value to <code>true</code> to enable this option.
+        </p>
+        <p>
+          Sleep time for the PMD will be reset to 0 uS (i.e. no sleep) when
+          any Rx queue that it polls has 16 or more packets available for Rx.
+        </p>
+      </column>
+      <column name="other_config" key="pmd-powersave-maxsleep"
+              type='{"type": "integer",
+                     "minInteger": 0, "maxInteger": 10000}'>
+        <p>
+          Specifies the maximum sleep time that will be requested in
+          microseconds per iteration for a PMD with no or low load.
+        </p>
+        <p>
+          The default value is <code>250 microseconds</code>.
+        </p>
+        <p>
+          The maximum value is <code>10000 microseconds</code>.
+        </p>
+      </column>
       <column name="other_config" key="userspace-tso-enable"
               type='{"type": "boolean"}'>
-- 
2.38.1

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to