On 12/16/22 18:50, Kevin Traynor wrote: > Sleep for an incremental amount of time if none of the Rx queues > assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) > on an polling iteration of the PMD. > > Upon detecting the threshold of >= 16 pkts on an Rxq, reset the > sleep time to zero (i.e. no sleep). > > Sleep time will be increased by 1 uS on each iteration where > the low load conditions remain up to a total of the max sleep > time which has a default of 250 uS.
Hi, Kevin. Thanks for the patch! The feature seems interesting. At least, as an experimental feature for users to try out. See some comments below. > > The feature is off by default and can be enabled by: > ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true > > The max sleep time per iteration can be set e.g. to set to 500 uS: > ovs-vsctl set Open_vSwitch . other_config:pmd-powersave-maxsleep=500 Do we actually need two separate options for this? What about dropping the general 'pmd-powersave' option and only keeping the max sleep configuration with '0' being a default and meaning no sleep? We may recommend some value in the docs, but it will ultimately be a user's decision. We might also drop the 'powersave' part from the knob and just have 'pmd-max-sleep'. But I have no strong opinion on this. The single max sleep option can be extended in the future to accept a list of 'core:value' pairs for a fine grained per-PMD control, if necessary, without breaking backward compatibility. But that is probably not needed right now. > > Also add new stats to pmd-perf-show to get visibility of operation > e.g. > > <snip> > - No-sleep hit: 36445 ( 98.4 % of busy it) > Sleep time: 3350902 uS ( 34 us/it avg.)> <snip> > > Signed-off-by: Kevin Traynor <[email protected]> > > --- > v2: > - Updated to mark feature as experimental as there is still discussion > on it's operation and control knobs > - Added pmd-powersave-maxsleep to set the max requested sleep time > - Added unit tests for pmd-powersave and pmd-powersave-maxsleep config > knobs > - Added docs to explain that requested sleep time and actual sleep time > may differ > - Added actual measurement of sleep time instead of reporting requested > time > - Removed Max sleep hit statistics > - Added total sleep time statistic for the length of the measurement > period (avg. uS per iteration still exists also) > - Updated other statistics to account for sleep time > - Some renaming > - Replaced xnanosleep with nanosleep to avoid having to start/end > quiesce for every sleep (this may KO this feature on Windows) Maybe convert a current xnanosleep with a static void xnanosleep__(uint64_t nanoseconds, bool need_to_quiesce) and create 2 wrappers with true/false as arguments: xnanosleep() and xnanosleep_no_quiesce() ? Or something like that? I didn't test, but the current code might break the windows build, not only this particular function. > - Limited max requested sleep to max PMD quiesce time (10 ms) > - Adapted ALB measurement about whether a PMD is overloaded to account > for time spent sleeping > --- > Documentation/topics/dpdk/pmd.rst | 46 +++++++++++++++++ > lib/dpif-netdev-perf.c | 26 ++++++++-- > lib/dpif-netdev-perf.h | 5 +- > lib/dpif-netdev.c | 86 +++++++++++++++++++++++++++++-- > tests/pmd.at | 43 ++++++++++++++++ > vswitchd/vswitch.xml | 34 ++++++++++++ > 6 files changed, 229 insertions(+), 11 deletions(-) > > diff --git a/Documentation/topics/dpdk/pmd.rst > b/Documentation/topics/dpdk/pmd.rst > index b259cc8b3..abc552029 100644 > --- a/Documentation/topics/dpdk/pmd.rst > +++ b/Documentation/topics/dpdk/pmd.rst > @@ -312,4 +312,50 @@ reassignment due to PMD Auto Load Balance. For example, > this could be set > (in min) such that a reassignment is triggered at most every few hours. > > +PMD Power Saving (Experimental) > +------------------------------- > + > +PMD threads constantly poll Rx queues which are assigned to them. In order to > +reduce the CPU cycles they use, they can sleep for small periods of time > +when there is no load or very-low load from all the Rx queues they poll. > + > +This can be enabled by:: > + > + $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true" > + > +With this enabled a PMD may request to sleep by an incrementing amount of > time > +up to a maximum time. If at any point the threshold of at least half a batch > of > +packets (i.e. 16) is received from an Rx queue that the PMD is polling is > met, > +the sleep time will be reset to 0. (i.e. no sleep). > + > +The default maximum sleep time is set 250 us. A user can configure a new > +maximum requested sleep time in uS. e.g. to set to 1 ms:: > + > + $ ovs-vsctl set open_vswitch . other_config:pmd-powersave-maxsleep=1000 > + > +Sleeping in a PMD thread will mean there is a period of time when the PMD > +thread will not process packets. Sleep times requested are not guaranteed > +and can differ significantly depending on system configuration. The actual > +time not processing packets will be determined by the sleep and processor > +wake-up times and should be tested with each system configuration. > + > +Sleep time statistics for 10 secs can be seen with:: > + > + $ ovs-appctl dpif-netdev/pmd-stats-clear \ > + && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show > + > +Example output, showing that the 16 packet no-sleep threshhold occurred in > +98.2% of busy iterations and there was an average sleep time of > +33 us per iteration:: > + > + No-sleep hit: 119043 ( 98.2 % of busy it) I'm not sure the word 'hit' is appropriate here. Maybe 'No-sleep iterations' ? And the word 'it' is a bit confusing in this context, the full word might be better for understanding. It's fine to move around alignment in the output here. > + Sleep time: 10638025 uS ( 33 us/it avg.) uS/us inconsistency. > + > +.. note:: > + > + If there is a sudden spike of packets while the PMD thread is sleeping > and > + the processor is in a low-power state it may result in some lost packets > or > + extra latency before the PMD thread returns to processing packets at full > + rate. > + > .. _ovs-vswitchd(8): > http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html > diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c > index a2a7d8f0b..16445c68f 100644 > --- a/lib/dpif-netdev-perf.c > +++ b/lib/dpif-netdev-perf.c > @@ -231,4 +231,6 @@ pmd_perf_format_overall_stats(struct ds *str, struct > pmd_perf_stats *s, > uint64_t idle_iter = s->pkts.bin[0]; > uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0; > + uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT]; > + uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES]; > > ds_put_format(str, > @@ -236,11 +238,17 @@ pmd_perf_format_overall_stats(struct ds *str, struct > pmd_perf_stats *s, > " - Used TSC cycles: %12"PRIu64" (%5.1f %% of total cycles)\n" > " - idle iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" > - " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n", > - tot_iter, tot_cycles * us_per_cycle / tot_iter, > + " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" > + " - No-sleep hit: %12"PRIu64" (%5.1f %% of busy it)\n" > + " Sleep time: %12.0f uS (%3.0f us/it avg.)\n", > + tot_iter, > + (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter, > tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, > idle_iter, > 100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles, > busy_iter, > - 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles); > + 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles, > + no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0, > + tot_sleep_cycles * us_per_cycle, > + tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0); > if (rx_packets > 0) { > ds_put_format(str, > @@ -519,5 +527,6 @@ OVS_REQUIRES(s->stats_mutex) > void > pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, > - int tx_packets, bool full_metrics) > + int tx_packets, bool no_sleep_hit, > + uint64_t sleep_cycles, bool full_metrics) > { > uint64_t now_tsc = cycles_counter_update(s); > @@ -526,5 +535,5 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int > rx_packets, > char *reason = NULL; > > - cycles = now_tsc - s->start_tsc; > + cycles = now_tsc - s->start_tsc - sleep_cycles; > s->current.timestamp = s->iteration_cnt; > s->current.cycles = cycles; > @@ -540,4 +549,11 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int > rx_packets, > histogram_add_sample(&s->pkts, rx_packets); > > + if (no_sleep_hit) { > + pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1); > + } > + if (sleep_cycles) { > + pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles); > + } > + > if (!full_metrics) { > return; > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h > index 9673dddd8..94ffb16cf 100644 > --- a/lib/dpif-netdev-perf.h > +++ b/lib/dpif-netdev-perf.h > @@ -81,4 +81,6 @@ enum pmd_stat_type { > PMD_CYCLES_ITER_BUSY, /* Cycles spent in busy iterations. */ > PMD_CYCLES_UPCALL, /* Cycles spent processing upcalls. */ > + PMD_PWR_NO_SLEEP_HIT, /* Iterations with Rx above no-sleep thresh. */ HIT/ITER > + PMD_PWR_SLEEP_CYCLES, /* Total cycles slept to save power. */ > PMD_N_STATS > }; > @@ -409,5 +411,6 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s); > void > pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, > - int tx_packets, bool full_metrics); > + int tx_packets, bool no_sleep_hit, > + uint64_t sleep_cycles, bool full_metrics); > > /* Formatting the output of commands. */ > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 2c08a71c8..586abb58d 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = { > #define PMD_RCU_QUIESCE_INTERVAL 10000LL > > +/* Default max time in uS for a pmd thread to sleep based on load. */ > +#define PMD_PWR_MAX_SLEEP 250 We should have a _US suffix here. > +/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */ > +#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2) Some spaces around the '/' would be nice. > +/* Time in uS to increment a pmd thread sleep time. */ > +#define PMD_PWR_INC 1 And here. > + > struct dpcls { > struct cmap_node node; /* Within dp_netdev_pmd_thread.classifiers */ > @@ -278,4 +285,7 @@ struct dp_netdev { > /* Enable collection of PMD performance metrics. */ > atomic_bool pmd_perf_metrics; > + /* Enable PMD load based sleeping. */ > + atomic_bool pmd_powersave; > + atomic_uint64_t pmd_max_sleep; > /* Enable the SMC cache from ovsdb config */ > atomic_bool smc_enable_db; > @@ -4767,5 +4777,8 @@ dpif_netdev_set_config(struct dpif *dpif, const struct > smap *other_config) > uint8_t cur_rebalance_load; > uint32_t rebalance_load, rebalance_improve; > + uint64_t pmd_max_sleep, cur_pmd_max_sleep; > + bool powersave, cur_powersave; > bool log_autolb = false; > + bool log_pmdsleep = false; > enum sched_assignment_type pmd_rxq_assign_type; > > @@ -4915,4 +4928,28 @@ dpif_netdev_set_config(struct dpif *dpif, const struct > smap *other_config) > > set_pmd_auto_lb(dp, autolb_state, log_autolb); > + > + pmd_max_sleep = smap_get_ullong(other_config, "pmd-powersave-maxsleep", > + PMD_PWR_MAX_SLEEP); > + pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep); > + atomic_read_relaxed(&dp->pmd_max_sleep, &cur_pmd_max_sleep); > + if (pmd_max_sleep != cur_pmd_max_sleep) { > + atomic_store_relaxed(&dp->pmd_max_sleep, pmd_max_sleep); > + log_pmdsleep = true; > + } We should, probbaly, enforce the upper limit (10000 us). > + > + powersave = smap_get_bool(other_config, "pmd-powersave", false); > + atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave); > + if (powersave != cur_powersave) { > + atomic_store_relaxed(&dp->pmd_powersave, powersave); > + log_pmdsleep = true; > + } > + > + if (log_pmdsleep) { > + VLOG_INFO("PMD powersave max sleep request is %"PRIu64" usecs.", > + pmd_max_sleep); > + VLOG_INFO("PMD powersave is %s.", > + powersave ? "enabled" : "disabled" ); > + } > + > return 0; > } > @@ -6873,7 +6910,9 @@ pmd_thread_main(void *f_) > bool exiting; > bool reload; > + bool powersave; > int poll_cnt; > int i; > int process_packets = 0; > + uint64_t sleep_time = 0; > > poll_list = NULL; > @@ -6935,7 +6974,16 @@ reload: > for (;;) { > uint64_t rx_packets = 0, tx_packets = 0; > + bool nosleep_hit = false; > + uint64_t time_slept = 0; > + uint64_t max_sleep; > > pmd_perf_start_iteration(s); > + atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave); > + atomic_read_relaxed(&pmd->dp->pmd_max_sleep, &max_sleep); > > + if (!powersave) { > + /* Reset sleep_time as policy may have changed. */ > + sleep_time = 0; > + } > atomic_read_relaxed(&pmd->dp->smc_enable_db, > &pmd->ctx.smc_enable_db); > > @@ -6957,4 +7005,8 @@ reload: > poll_list[i].port_no); > rx_packets += process_packets; > + if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) { > + nosleep_hit = true; > + sleep_time = 0; > + } > } > > @@ -6964,5 +7016,24 @@ reload: > * There was no time updates on current iteration. */ > pmd_thread_ctx_time_update(pmd); > - tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false); > + tx_packets = dp_netdev_pmd_flush_output_packets(pmd, > + sleep_time ? true : > false); > + } > + > + if (powersave) { > + if (sleep_time) { > + struct timespec ts_sleep; > + struct cycle_timer sleep_timer; > + > + nsec_to_timespec(sleep_time * 1000, &ts_sleep); > + cycle_timer_start(&pmd->perf_stats, &sleep_timer); > + nanosleep(&ts_sleep, NULL); > + time_slept = cycle_timer_stop(&pmd->perf_stats, > &sleep_timer); > + } > + if (sleep_time < max_sleep) { > + /* Increase sleep time for next iteration. */ > + sleep_time += PMD_PWR_INC; > + } else { > + sleep_time = max_sleep; > + } > } > > @@ -7004,6 +7075,6 @@ reload: > } > > - pmd_perf_end_iteration(s, rx_packets, tx_packets, > - pmd_perf_metrics_enabled(pmd)); > + pmd_perf_end_iteration(s, rx_packets, tx_packets, nosleep_hit, > + time_slept, pmd_perf_metrics_enabled(pmd)); > } > ovs_mutex_unlock(&pmd->perf_stats.stats_mutex); > @@ -9855,5 +9926,5 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread > *pmd, > { > struct dpcls *cls; > - uint64_t tot_idle = 0, tot_proc = 0; > + uint64_t tot_idle = 0, tot_proc = 0, tot_sleep = 0; > unsigned int pmd_load = 0; > > @@ -9872,8 +9943,11 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread > *pmd, > tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] - > pmd->prev_stats[PMD_CYCLES_ITER_BUSY]; > + tot_sleep = pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES] - > + pmd->prev_stats[PMD_PWR_SLEEP_CYCLES]; Indentation is a bit off. > > if (pmd_alb->is_enabled && !pmd->isolated) { > if (tot_proc) { > - pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); > + pmd_load = ((tot_proc * 100) / > + (tot_idle + tot_proc + tot_sleep)); > } > > @@ -9892,4 +9966,6 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread > *pmd, > pmd->prev_stats[PMD_CYCLES_ITER_BUSY] = > pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY]; > + pmd->prev_stats[PMD_PWR_SLEEP_CYCLES] = > + pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES]; > > /* Get the cycles that were used to process each queue and store. */ > diff --git a/tests/pmd.at b/tests/pmd.at > index 10879a349..fb1d86793 100644 > --- a/tests/pmd.at > +++ b/tests/pmd.at > @@ -1193,2 +1193,45 @@ ovs-appctl: ovs-vswitchd: server returned an error > OVS_VSWITCHD_STOP > AT_CLEANUP > + > +dnl Check default state > +AT_SETUP([PMD - pmd sleep]) > +OVS_VSWITCHD_START > +OVS_WAIT_UNTIL([grep "PMD powersave is disabled." ovs-vswitchd.log]) > + > +dnl Check can be enabled > +get_log_next_line_num > +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"]) > +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave is > enabled."]) > + > +dnl Check can be disabled > +get_log_next_line_num > +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="false"]) > +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave is > disabled."]) > + > +dnl Check default max sleep > +get_log_next_line_num > +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"]) > +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max > sleep request is 250 usecs."]) > + > +dnl Check low value max sleep > +get_log_next_line_num > +AT_CHECK([ovs-vsctl set open_vswitch . > other_config:pmd-powersave-maxsleep="1"]) > +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max > sleep request is 1 usecs."]) > + > +dnl Check high value max sleep > +get_log_next_line_num > +AT_CHECK([ovs-vsctl set open_vswitch . > other_config:pmd-powersave-maxsleep="10000"]) > +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max > sleep request is 10000 usecs."]) > + > +dnl Check setting max sleep to zero > +get_log_next_line_num > +AT_CHECK([ovs-vsctl set open_vswitch . > other_config:pmd-powersave-maxsleep="0"]) > +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max > sleep request is 0 usecs."]) > + > +dnl Check above high value max sleep > +get_log_next_line_num > +AT_CHECK([ovs-vsctl set open_vswitch . > other_config:pmd-powersave-maxsleep="10001"]) > +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max > sleep request is 10000 usecs."]) > + > +OVS_VSWITCHD_STOP > +AT_CLEANUP > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > index 928821a82..3ab3560f2 100644 > --- a/vswitchd/vswitch.xml > +++ b/vswitchd/vswitch.xml > @@ -789,4 +789,38 @@ > </p> > </column> > + <column name="other_config" key="pmd-powersave" > + type='{"type": "boolean"}'> > + <p> > + Enables PMD thread requesting to sleep for values up to the > + maximum PMD sleep time of > + <ref column="other_config" key="pmd-powersave-maxsleep"/>) per > + iteration. The request and value that will be used is based on the > + number of packets available from the Rx queues that the PMD polls. > + </p> > + <p> > + The default value is <code>false</code>. > + </p> > + <p> > + Set this value to <code>true</code> to enable this option. > + </p> > + <p> > + Sleep time for the PMD will be reset to 0 uS (i.e. no sleep) when > + any Rx queue that it polls has 16 or more packets available for Rx. > + </p> > + </column> > + <column name="other_config" key="pmd-powersave-maxsleep" > + type='{"type": "integer", > + "minInteger": 0, "maxInteger": 10000}'> > + <p> > + Specifies the maximum sleep time that will be requested in > + microseconds per iteration for a PMD with no or low load. > + </p> > + <p> > + The default value is <code>250 microseconds</code>. > + </p> > + <p> > + The maximum value is <code>10000 microseconds</code>. > + </p> > + </column> > <column name="other_config" key="userspace-tso-enable" > type='{"type": "boolean"}'> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
