Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD.
Upon detecting the threshold of >= 16 pkts on an Rxq, reset the sleep time to zero (i.e. no sleep). Sleep time will be increased by 1 uS on each iteration where the low load conditions remain up to a total of the max sleep time which has a default of 250 uS. The feature is off by default and can be enabled by: ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true The max sleep time per iteration can be set e.g. to set to 500 uS: ovs-vsctl set Open_vSwitch . other_config:pmd-powersave-maxsleep=500 Also add new stats to pmd-perf-show to get visibility of operation e.g. <snip> - No-sleep hit: 36445 ( 98.4 % of busy it) Sleep time: 3350902 uS ( 34 us/it avg.) <snip> Signed-off-by: Kevin Traynor <ktray...@redhat.com> --- v2: - Updated to mark feature as experimental as there is still discussion on it's operation and control knobs - Added pmd-powersave-maxsleep to set the max requested sleep time - Added unit tests for pmd-powersave and pmd-powersave-maxsleep config knobs - Added docs to explain that requested sleep time and actual sleep time may differ - Added actual measurement of sleep time instead of reporting requested time - Removed Max sleep hit statistics - Added total sleep time statistic for the length of the measurement period (avg. uS per iteration still exists also) - Updated other statistics to account for sleep time - Some renaming - Replaced xnanosleep with nanosleep to avoid having to start/end quiesce for every sleep (this may KO this feature on Windows) - Limited max requested sleep to max PMD quiesce time (10 ms) - Adapted ALB measurement about whether a PMD is overloaded to account for time spent sleeping --- Documentation/topics/dpdk/pmd.rst | 46 +++++++++++++++++ lib/dpif-netdev-perf.c | 26 ++++++++-- lib/dpif-netdev-perf.h | 5 +- lib/dpif-netdev.c | 86 +++++++++++++++++++++++++++++-- tests/pmd.at | 43 ++++++++++++++++ vswitchd/vswitch.xml | 34 ++++++++++++ 6 files changed, 229 insertions(+), 11 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index b259cc8b3..abc552029 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -312,4 +312,50 @@ reassignment due to PMD Auto Load Balance. For example, this could be set (in min) such that a reassignment is triggered at most every few hours. +PMD Power Saving (Experimental) +------------------------------- + +PMD threads constantly poll Rx queues which are assigned to them. In order to +reduce the CPU cycles they use, they can sleep for small periods of time +when there is no load or very-low load from all the Rx queues they poll. + +This can be enabled by:: + + $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true" + +With this enabled a PMD may request to sleep by an incrementing amount of time +up to a maximum time. If at any point the threshold of at least half a batch of +packets (i.e. 16) is received from an Rx queue that the PMD is polling is met, +the sleep time will be reset to 0. (i.e. no sleep). + +The default maximum sleep time is set 250 us. A user can configure a new +maximum requested sleep time in uS. e.g. to set to 1 ms:: + + $ ovs-vsctl set open_vswitch . other_config:pmd-powersave-maxsleep=1000 + +Sleeping in a PMD thread will mean there is a period of time when the PMD +thread will not process packets. Sleep times requested are not guaranteed +and can differ significantly depending on system configuration. The actual +time not processing packets will be determined by the sleep and processor +wake-up times and should be tested with each system configuration. + +Sleep time statistics for 10 secs can be seen with:: + + $ ovs-appctl dpif-netdev/pmd-stats-clear \ + && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show + +Example output, showing that the 16 packet no-sleep threshhold occurred in +98.2% of busy iterations and there was an average sleep time of +33 us per iteration:: + + No-sleep hit: 119043 ( 98.2 % of busy it) + Sleep time: 10638025 uS ( 33 us/it avg.) + +.. note:: + + If there is a sudden spike of packets while the PMD thread is sleeping and + the processor is in a low-power state it may result in some lost packets or + extra latency before the PMD thread returns to processing packets at full + rate. + .. _ovs-vswitchd(8): http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index a2a7d8f0b..16445c68f 100644 --- a/lib/dpif-netdev-perf.c +++ b/lib/dpif-netdev-perf.c @@ -231,4 +231,6 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s, uint64_t idle_iter = s->pkts.bin[0]; uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0; + uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT]; + uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES]; ds_put_format(str, @@ -236,11 +238,17 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s, " - Used TSC cycles: %12"PRIu64" (%5.1f %% of total cycles)\n" " - idle iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" - " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n", - tot_iter, tot_cycles * us_per_cycle / tot_iter, + " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" + " - No-sleep hit: %12"PRIu64" (%5.1f %% of busy it)\n" + " Sleep time: %12.0f uS (%3.0f us/it avg.)\n", + tot_iter, + (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter, tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, idle_iter, 100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles, busy_iter, - 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles); + 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles, + no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0, + tot_sleep_cycles * us_per_cycle, + tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0); if (rx_packets > 0) { ds_put_format(str, @@ -519,5 +527,6 @@ OVS_REQUIRES(s->stats_mutex) void pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, - int tx_packets, bool full_metrics) + int tx_packets, bool no_sleep_hit, + uint64_t sleep_cycles, bool full_metrics) { uint64_t now_tsc = cycles_counter_update(s); @@ -526,5 +535,5 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, char *reason = NULL; - cycles = now_tsc - s->start_tsc; + cycles = now_tsc - s->start_tsc - sleep_cycles; s->current.timestamp = s->iteration_cnt; s->current.cycles = cycles; @@ -540,4 +549,11 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, histogram_add_sample(&s->pkts, rx_packets); + if (no_sleep_hit) { + pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1); + } + if (sleep_cycles) { + pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles); + } + if (!full_metrics) { return; diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 9673dddd8..94ffb16cf 100644 --- a/lib/dpif-netdev-perf.h +++ b/lib/dpif-netdev-perf.h @@ -81,4 +81,6 @@ enum pmd_stat_type { PMD_CYCLES_ITER_BUSY, /* Cycles spent in busy iterations. */ PMD_CYCLES_UPCALL, /* Cycles spent processing upcalls. */ + PMD_PWR_NO_SLEEP_HIT, /* Iterations with Rx above no-sleep thresh. */ + PMD_PWR_SLEEP_CYCLES, /* Total cycles slept to save power. */ PMD_N_STATS }; @@ -409,5 +411,6 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s); void pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, - int tx_packets, bool full_metrics); + int tx_packets, bool no_sleep_hit, + uint64_t sleep_cycles, bool full_metrics); /* Formatting the output of commands. */ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 2c08a71c8..586abb58d 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = { #define PMD_RCU_QUIESCE_INTERVAL 10000LL +/* Default max time in uS for a pmd thread to sleep based on load. */ +#define PMD_PWR_MAX_SLEEP 250 +/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */ +#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2) +/* Time in uS to increment a pmd thread sleep time. */ +#define PMD_PWR_INC 1 + struct dpcls { struct cmap_node node; /* Within dp_netdev_pmd_thread.classifiers */ @@ -278,4 +285,7 @@ struct dp_netdev { /* Enable collection of PMD performance metrics. */ atomic_bool pmd_perf_metrics; + /* Enable PMD load based sleeping. */ + atomic_bool pmd_powersave; + atomic_uint64_t pmd_max_sleep; /* Enable the SMC cache from ovsdb config */ atomic_bool smc_enable_db; @@ -4767,5 +4777,8 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config) uint8_t cur_rebalance_load; uint32_t rebalance_load, rebalance_improve; + uint64_t pmd_max_sleep, cur_pmd_max_sleep; + bool powersave, cur_powersave; bool log_autolb = false; + bool log_pmdsleep = false; enum sched_assignment_type pmd_rxq_assign_type; @@ -4915,4 +4928,28 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config) set_pmd_auto_lb(dp, autolb_state, log_autolb); + + pmd_max_sleep = smap_get_ullong(other_config, "pmd-powersave-maxsleep", + PMD_PWR_MAX_SLEEP); + pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep); + atomic_read_relaxed(&dp->pmd_max_sleep, &cur_pmd_max_sleep); + if (pmd_max_sleep != cur_pmd_max_sleep) { + atomic_store_relaxed(&dp->pmd_max_sleep, pmd_max_sleep); + log_pmdsleep = true; + } + + powersave = smap_get_bool(other_config, "pmd-powersave", false); + atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave); + if (powersave != cur_powersave) { + atomic_store_relaxed(&dp->pmd_powersave, powersave); + log_pmdsleep = true; + } + + if (log_pmdsleep) { + VLOG_INFO("PMD powersave max sleep request is %"PRIu64" usecs.", + pmd_max_sleep); + VLOG_INFO("PMD powersave is %s.", + powersave ? "enabled" : "disabled" ); + } + return 0; } @@ -6873,7 +6910,9 @@ pmd_thread_main(void *f_) bool exiting; bool reload; + bool powersave; int poll_cnt; int i; int process_packets = 0; + uint64_t sleep_time = 0; poll_list = NULL; @@ -6935,7 +6974,16 @@ reload: for (;;) { uint64_t rx_packets = 0, tx_packets = 0; + bool nosleep_hit = false; + uint64_t time_slept = 0; + uint64_t max_sleep; pmd_perf_start_iteration(s); + atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave); + atomic_read_relaxed(&pmd->dp->pmd_max_sleep, &max_sleep); + if (!powersave) { + /* Reset sleep_time as policy may have changed. */ + sleep_time = 0; + } atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db); @@ -6957,4 +7005,8 @@ reload: poll_list[i].port_no); rx_packets += process_packets; + if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) { + nosleep_hit = true; + sleep_time = 0; + } } @@ -6964,5 +7016,24 @@ reload: * There was no time updates on current iteration. */ pmd_thread_ctx_time_update(pmd); - tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false); + tx_packets = dp_netdev_pmd_flush_output_packets(pmd, + sleep_time ? true : false); + } + + if (powersave) { + if (sleep_time) { + struct timespec ts_sleep; + struct cycle_timer sleep_timer; + + nsec_to_timespec(sleep_time * 1000, &ts_sleep); + cycle_timer_start(&pmd->perf_stats, &sleep_timer); + nanosleep(&ts_sleep, NULL); + time_slept = cycle_timer_stop(&pmd->perf_stats, &sleep_timer); + } + if (sleep_time < max_sleep) { + /* Increase sleep time for next iteration. */ + sleep_time += PMD_PWR_INC; + } else { + sleep_time = max_sleep; + } } @@ -7004,6 +7075,6 @@ reload: } - pmd_perf_end_iteration(s, rx_packets, tx_packets, - pmd_perf_metrics_enabled(pmd)); + pmd_perf_end_iteration(s, rx_packets, tx_packets, nosleep_hit, + time_slept, pmd_perf_metrics_enabled(pmd)); } ovs_mutex_unlock(&pmd->perf_stats.stats_mutex); @@ -9855,5 +9926,5 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, { struct dpcls *cls; - uint64_t tot_idle = 0, tot_proc = 0; + uint64_t tot_idle = 0, tot_proc = 0, tot_sleep = 0; unsigned int pmd_load = 0; @@ -9872,8 +9943,11 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] - pmd->prev_stats[PMD_CYCLES_ITER_BUSY]; + tot_sleep = pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES] - + pmd->prev_stats[PMD_PWR_SLEEP_CYCLES]; if (pmd_alb->is_enabled && !pmd->isolated) { if (tot_proc) { - pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); + pmd_load = ((tot_proc * 100) / + (tot_idle + tot_proc + tot_sleep)); } @@ -9892,4 +9966,6 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, pmd->prev_stats[PMD_CYCLES_ITER_BUSY] = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY]; + pmd->prev_stats[PMD_PWR_SLEEP_CYCLES] = + pmd->perf_stats.counters.n[PMD_PWR_SLEEP_CYCLES]; /* Get the cycles that were used to process each queue and store. */ diff --git a/tests/pmd.at b/tests/pmd.at index 10879a349..fb1d86793 100644 --- a/tests/pmd.at +++ b/tests/pmd.at @@ -1193,2 +1193,45 @@ ovs-appctl: ovs-vswitchd: server returned an error OVS_VSWITCHD_STOP AT_CLEANUP + +dnl Check default state +AT_SETUP([PMD - pmd sleep]) +OVS_VSWITCHD_START +OVS_WAIT_UNTIL([grep "PMD powersave is disabled." ovs-vswitchd.log]) + +dnl Check can be enabled +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave is enabled."]) + +dnl Check can be disabled +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="false"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave is disabled."]) + +dnl Check default max sleep +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max sleep request is 250 usecs."]) + +dnl Check low value max sleep +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave-maxsleep="1"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max sleep request is 1 usecs."]) + +dnl Check high value max sleep +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave-maxsleep="10000"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max sleep request is 10000 usecs."]) + +dnl Check setting max sleep to zero +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave-maxsleep="0"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max sleep request is 0 usecs."]) + +dnl Check above high value max sleep +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-powersave-maxsleep="10001"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD powersave max sleep request is 10000 usecs."]) + +OVS_VSWITCHD_STOP +AT_CLEANUP diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 928821a82..3ab3560f2 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -789,4 +789,38 @@ </p> </column> + <column name="other_config" key="pmd-powersave" + type='{"type": "boolean"}'> + <p> + Enables PMD thread requesting to sleep for values up to the + maximum PMD sleep time of + <ref column="other_config" key="pmd-powersave-maxsleep"/>) per + iteration. The request and value that will be used is based on the + number of packets available from the Rx queues that the PMD polls. + </p> + <p> + The default value is <code>false</code>. + </p> + <p> + Set this value to <code>true</code> to enable this option. + </p> + <p> + Sleep time for the PMD will be reset to 0 uS (i.e. no sleep) when + any Rx queue that it polls has 16 or more packets available for Rx. + </p> + </column> + <column name="other_config" key="pmd-powersave-maxsleep" + type='{"type": "integer", + "minInteger": 0, "maxInteger": 10000}'> + <p> + Specifies the maximum sleep time that will be requested in + microseconds per iteration for a PMD with no or low load. + </p> + <p> + The default value is <code>250 microseconds</code>. + </p> + <p> + The maximum value is <code>10000 microseconds</code>. + </p> + </column> <column name="other_config" key="userspace-tso-enable" type='{"type": "boolean"}'> -- 2.38.1 _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev