On 09/01/2023 15:23, David Marchand wrote:
On Fri, Jan 6, 2023 at 4:00 PM Kevin Traynor <[email protected]> wrote:Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD. Upon detecting the threshold of >= 16 pkts on an Rxq, reset the sleep time to zero (i.e. no sleep). Sleep time will be increased on each iteration where the low load conditions remain up to a total of the max sleep time which is set by the user e.g: ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500 The default pmd-maxsleep value is 0, which means that no sleeps will occur and the default behaviour is unchanged from previously. Also add new stats to pmd-perf-show to get visibility of operation e.g. ... - sleep iterations: 153994 ( 76.8 % of iterations) Sleep time: 9159399 us ( 46 us/iteration avg.) ... Signed-off-by: Kevin Traynor <[email protected]> --- Documentation/topics/dpdk/pmd.rst | 51 ++++++++++++++++++++++++ NEWS | 3 ++ lib/dpif-netdev-perf.c | 24 +++++++++--- lib/dpif-netdev-perf.h | 5 ++- lib/dpif-netdev.c | 64 +++++++++++++++++++++++++++++-- tests/pmd.at | 46 ++++++++++++++++++++++ vswitchd/vswitch.xml | 26 +++++++++++++ 7 files changed, 209 insertions(+), 10 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index 9006fd40f..89f6b3052 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -325,4 +325,55 @@ reassignment due to PMD Auto Load Balance. For example, this could be set (in min) such that a reassignment is triggered at most every few hours. +PMD Power Saving (Experimental) +-------------------------------I would stick to: "PMD load based sleeping" The powersaving comes from some external configuration that this patch does not cover.
Yes, you are right, I should have updated that title in v3.
Maybe you could mention something about c-states, but it seems out of OVS scope itself.
I gave it a quick mention so the reader is aware that there are external system configuration dependencies if they want to achieve some power saving, but agree how to enable C-states etc. is out of scope of OVS.
"Any potential power saving from PMD load based sleeping is dependent on the system configuration (e.g. enabling processor C-states) and workloads."
+ +PMD threads constantly poll Rx queues which are assigned to them. In order to +reduce the CPU cycles they use, they can sleep for small periods of time +when there is no load or very-low load on all the Rx queues they poll. + +This can be enabled by setting the max requested sleep time (in microseconds) +for a PMD thread:: + + $ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500 + +Non-zero values will be rounded up to the nearest 10 microseconds to avoid +requesting very small sleep times. + +With a non-zero max value a PMD may request to sleep by an incrementing amount +of time up to the maximum time. If at any point the threshold of at least half +a batch of packets (i.e. 16) is received from an Rx queue that the PMD is +polling is met, the requested sleep time will be reset to 0. At that point no +sleeps will occur until the no/low load conditions return. + +Sleeping in a PMD thread will mean there is a period of time when the PMD +thread will not process packets. Sleep times requested are not guaranteed +and can differ significantly depending on system configuration. The actual +time not processing packets will be determined by the sleep and processor +wake-up times and should be tested with each system configuration. + +Sleep time statistics for 10 secs can be seen with:: + + $ ovs-appctl dpif-netdev/pmd-stats-clear \ + && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show + +Example output, showing that during the last 10 seconds, 76.8% of iterations +had a sleep of some length. The total amount of sleep time was 9.15 seconds and +the average sleep time per iteration was 46 microseconds:: + + - sleep iterations: 153994 ( 76.8 % of iterations) + Sleep time: 9159399 us ( 46 us/iteration avg.) + +.. note:: + + If there is a sudden spike of packets while the PMD thread is sleeping and + the processor is in a low-power state it may result in some lost packets or + extra latency before the PMD thread returns to processing packets at full + rate. + +.. note:: + + Default Linux kernel hrtimer resolution is set to 50 microseconds so this + will add overhead to requested sleep time. + .. _ovs-vswitchd(8): http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html diff --git a/NEWS b/NEWS index 2f6ededfe..54d97825e 100644 --- a/NEWS +++ b/NEWS @@ -31,4 +31,7 @@ Post-v3.0.0 * Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show the pmd usage of an Rx queue over a configurable time period. + * Add new experiemental PMD load based sleeping feature. PMD threads can*experimental
Fixed
+ request to sleep up to a user configured 'pmd-maxsleep' value under no + and low load conditions. diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index a2a7d8f0b..bc6b779a7 100644 --- a/lib/dpif-netdev-perf.c +++ b/lib/dpif-netdev-perf.c @@ -231,4 +231,6 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s, uint64_t idle_iter = s->pkts.bin[0]; uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0; + uint64_t sleep_iter = stats[PMD_PWR_SLEEP_ITER]; + uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES];I would remove _PWR_.
Removed it. I also changed to 'PMD_CYCLES_SLEEP' as the other PMD cycles counters use the 'PMD_CYCLES_' format
ds_put_format(str, @@ -236,11 +238,17 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s, " - Used TSC cycles: %12"PRIu64" (%5.1f %% of total cycles)\n" " - idle iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" - " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n", - tot_iter, tot_cycles * us_per_cycle / tot_iter, + " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" + " - sleep iterations: %12"PRIu64" (%5.1f %% of iterations)\n" + " Sleep time: %12.0f us (%3.0f us/iteration avg.)\n",This gives: pmd thread numa_id 1 core_id 5: Iterations: 884937 (361.43 us/it) - Used TSC cycles: 24829488529 ( 1.2 % of total cycles) - idle iterations: 563472 ( 1.4 % of used cycles) - busy iterations: 321465 ( 98.6 % of used cycles) - sleep iterations: 569487 ( 64.4 % of iterations) Sleep time: 310297274 us (351 us/iteration avg.) ^^^ I would add another space before Sleep so it aligns with the rest. And maybe put the unit as a comment, since no other stat detailed its unit so far. + " Sleep time (us): %12.0f (%3.0f us/iteration avg.)\n",
Reformatted to match this suggestion.
Rx packets: 10000000 (13 Kpps, 2447 cycles/pkt) Datapath passes: 10000000 (1.00 passes/pkt) - PHWOL hits: 0 ( 0.0 %) - MFEX Opt hits: 0 ( 0.0 %) """+ tot_iter, + (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter, tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, idle_iter, 100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles, busy_iter, - 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles); + 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles, + sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0, + tot_sleep_cycles * us_per_cycle, + tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0); if (rx_packets > 0) { ds_put_format(str, @@ -519,5 +527,6 @@ OVS_REQUIRES(s->stats_mutex) void pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, - int tx_packets, bool full_metrics) + int tx_packets, uint64_t sleep_cycles, + bool full_metrics) { uint64_t now_tsc = cycles_counter_update(s); @@ -526,5 +535,5 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, char *reason = NULL; - cycles = now_tsc - s->start_tsc; + cycles = now_tsc - s->start_tsc - sleep_cycles; s->current.timestamp = s->iteration_cnt; s->current.cycles = cycles; @@ -540,4 +549,9 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, histogram_add_sample(&s->pkts, rx_packets); + if (sleep_cycles) { + pmd_perf_update_counter(s, PMD_PWR_SLEEP_ITER, 1); + pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles); + } + if (!full_metrics) { return; diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 9673dddd8..ebf776827 100644 --- a/lib/dpif-netdev-perf.h +++ b/lib/dpif-netdev-perf.h @@ -81,4 +81,6 @@ enum pmd_stat_type { PMD_CYCLES_ITER_BUSY, /* Cycles spent in busy iterations. */ PMD_CYCLES_UPCALL, /* Cycles spent processing upcalls. */ + PMD_PWR_SLEEP_ITER, /* Iterations where a sleep has taken place. */ + PMD_PWR_SLEEP_CYCLES, /* Total cycles slept to save power. */ PMD_N_STATS }; @@ -409,5 +411,6 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s); void pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, - int tx_packets, bool full_metrics); + int tx_packets, uint64_t sleep_cycles, + bool full_metrics); /* Formatting the output of commands. */ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 7127068fe..af97f9a83 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -172,4 +172,9 @@ static struct odp_support dp_netdev_support = { #define PMD_RCU_QUIESCE_INTERVAL 10000LL +/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */ +#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST / 2) +/* Time in uS to increment a pmd thread sleep time. */ +#define PMD_PWR_INC_US 10Idem, no _PWR_.
I removed the _PWR_, but PMD_INC_US looked a bit ambiguous, so I changed them to both have a 'PMD_SLEEP_' prefix, and then '_THRESH' and '_INC_US'.
+ struct dpcls { struct cmap_node node; /* Within dp_netdev_pmd_thread.classifiers */The rest lgtm. With this fixed, you can add: Reviewed-by: David Marchand <[email protected]>
Thanks again for reviewing and trying out. The changes in v4 are probably small enough to keep your RvB, but maybe it's better to ask you to resend to confirm. The only functional change is updating the pmd ctx timer after sleep.
_______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
