On Fri, Jan 6, 2023 at 4:00 PM Kevin Traynor <[email protected]> wrote: > > Sleep for an incremental amount of time if none of the Rx queues > assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) > on an polling iteration of the PMD. > > Upon detecting the threshold of >= 16 pkts on an Rxq, reset the > sleep time to zero (i.e. no sleep). > > Sleep time will be increased on each iteration where the low load > conditions remain up to a total of the max sleep time which is set > by the user e.g: > ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500 > > The default pmd-maxsleep value is 0, which means that no sleeps > will occur and the default behaviour is unchanged from previously. > > Also add new stats to pmd-perf-show to get visibility of operation > e.g. > ... > - sleep iterations: 153994 ( 76.8 % of iterations) > Sleep time: 9159399 us ( 46 us/iteration avg.) > ... > > Signed-off-by: Kevin Traynor <[email protected]> > --- > Documentation/topics/dpdk/pmd.rst | 51 ++++++++++++++++++++++++ > NEWS | 3 ++ > lib/dpif-netdev-perf.c | 24 +++++++++--- > lib/dpif-netdev-perf.h | 5 ++- > lib/dpif-netdev.c | 64 +++++++++++++++++++++++++++++-- > tests/pmd.at | 46 ++++++++++++++++++++++ > vswitchd/vswitch.xml | 26 +++++++++++++ > 7 files changed, 209 insertions(+), 10 deletions(-) > > diff --git a/Documentation/topics/dpdk/pmd.rst > b/Documentation/topics/dpdk/pmd.rst > index 9006fd40f..89f6b3052 100644 > --- a/Documentation/topics/dpdk/pmd.rst > +++ b/Documentation/topics/dpdk/pmd.rst > @@ -325,4 +325,55 @@ reassignment due to PMD Auto Load Balance. For example, > this could be set > (in min) such that a reassignment is triggered at most every few hours. > > +PMD Power Saving (Experimental) > +-------------------------------
I would stick to: "PMD load based sleeping" The powersaving comes from some external configuration that this patch does not cover. Maybe you could mention something about c-states, but it seems out of OVS scope itself. > + > +PMD threads constantly poll Rx queues which are assigned to them. In order to > +reduce the CPU cycles they use, they can sleep for small periods of time > +when there is no load or very-low load on all the Rx queues they poll. > + > +This can be enabled by setting the max requested sleep time (in microseconds) > +for a PMD thread:: > + > + $ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500 > + > +Non-zero values will be rounded up to the nearest 10 microseconds to avoid > +requesting very small sleep times. > + > +With a non-zero max value a PMD may request to sleep by an incrementing > amount > +of time up to the maximum time. If at any point the threshold of at least > half > +a batch of packets (i.e. 16) is received from an Rx queue that the PMD is > +polling is met, the requested sleep time will be reset to 0. At that point no > +sleeps will occur until the no/low load conditions return. > + > +Sleeping in a PMD thread will mean there is a period of time when the PMD > +thread will not process packets. Sleep times requested are not guaranteed > +and can differ significantly depending on system configuration. The actual > +time not processing packets will be determined by the sleep and processor > +wake-up times and should be tested with each system configuration. > + > +Sleep time statistics for 10 secs can be seen with:: > + > + $ ovs-appctl dpif-netdev/pmd-stats-clear \ > + && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show > + > +Example output, showing that during the last 10 seconds, 76.8% of iterations > +had a sleep of some length. The total amount of sleep time was 9.15 seconds > and > +the average sleep time per iteration was 46 microseconds:: > + > + - sleep iterations: 153994 ( 76.8 % of iterations) > + Sleep time: 9159399 us ( 46 us/iteration avg.) > + > +.. note:: > + > + If there is a sudden spike of packets while the PMD thread is sleeping > and > + the processor is in a low-power state it may result in some lost packets > or > + extra latency before the PMD thread returns to processing packets at full > + rate. > + > +.. note:: > + > + Default Linux kernel hrtimer resolution is set to 50 microseconds so this > + will add overhead to requested sleep time. > + > .. _ovs-vswitchd(8): > http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html > diff --git a/NEWS b/NEWS > index 2f6ededfe..54d97825e 100644 > --- a/NEWS > +++ b/NEWS > @@ -31,4 +31,7 @@ Post-v3.0.0 > * Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show > the pmd usage of an Rx queue over a configurable time period. > + * Add new experiemental PMD load based sleeping feature. PMD threads can *experimental > + request to sleep up to a user configured 'pmd-maxsleep' value under no > + and low load conditions. > > > diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c > index a2a7d8f0b..bc6b779a7 100644 > --- a/lib/dpif-netdev-perf.c > +++ b/lib/dpif-netdev-perf.c > @@ -231,4 +231,6 @@ pmd_perf_format_overall_stats(struct ds *str, struct > pmd_perf_stats *s, > uint64_t idle_iter = s->pkts.bin[0]; > uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0; > + uint64_t sleep_iter = stats[PMD_PWR_SLEEP_ITER]; > + uint64_t tot_sleep_cycles = stats[PMD_PWR_SLEEP_CYCLES]; I would remove _PWR_. > > ds_put_format(str, > @@ -236,11 +238,17 @@ pmd_perf_format_overall_stats(struct ds *str, struct > pmd_perf_stats *s, > " - Used TSC cycles: %12"PRIu64" (%5.1f %% of total cycles)\n" > " - idle iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" > - " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n", > - tot_iter, tot_cycles * us_per_cycle / tot_iter, > + " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" > + " - sleep iterations: %12"PRIu64" (%5.1f %% of iterations)\n" > + " Sleep time: %12.0f us (%3.0f us/iteration avg.)\n", This gives: pmd thread numa_id 1 core_id 5: Iterations: 884937 (361.43 us/it) - Used TSC cycles: 24829488529 ( 1.2 % of total cycles) - idle iterations: 563472 ( 1.4 % of used cycles) - busy iterations: 321465 ( 98.6 % of used cycles) - sleep iterations: 569487 ( 64.4 % of iterations) Sleep time: 310297274 us (351 us/iteration avg.) ^^^ I would add another space before Sleep so it aligns with the rest. And maybe put the unit as a comment, since no other stat detailed its unit so far. + " Sleep time (us): %12.0f (%3.0f us/iteration avg.)\n", Rx packets: 10000000 (13 Kpps, 2447 cycles/pkt) Datapath passes: 10000000 (1.00 passes/pkt) - PHWOL hits: 0 ( 0.0 %) - MFEX Opt hits: 0 ( 0.0 %) """ > + tot_iter, > + (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter, > tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, > idle_iter, > 100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles, > busy_iter, > - 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles); > + 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles, > + sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0, > + tot_sleep_cycles * us_per_cycle, > + tot_iter ? (tot_sleep_cycles * us_per_cycle) / tot_iter : 0); > if (rx_packets > 0) { > ds_put_format(str, > @@ -519,5 +527,6 @@ OVS_REQUIRES(s->stats_mutex) > void > pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, > - int tx_packets, bool full_metrics) > + int tx_packets, uint64_t sleep_cycles, > + bool full_metrics) > { > uint64_t now_tsc = cycles_counter_update(s); > @@ -526,5 +535,5 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int > rx_packets, > char *reason = NULL; > > - cycles = now_tsc - s->start_tsc; > + cycles = now_tsc - s->start_tsc - sleep_cycles; > s->current.timestamp = s->iteration_cnt; > s->current.cycles = cycles; > @@ -540,4 +549,9 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int > rx_packets, > histogram_add_sample(&s->pkts, rx_packets); > > + if (sleep_cycles) { > + pmd_perf_update_counter(s, PMD_PWR_SLEEP_ITER, 1); > + pmd_perf_update_counter(s, PMD_PWR_SLEEP_CYCLES, sleep_cycles); > + } > + > if (!full_metrics) { > return; > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h > index 9673dddd8..ebf776827 100644 > --- a/lib/dpif-netdev-perf.h > +++ b/lib/dpif-netdev-perf.h > @@ -81,4 +81,6 @@ enum pmd_stat_type { > PMD_CYCLES_ITER_BUSY, /* Cycles spent in busy iterations. */ > PMD_CYCLES_UPCALL, /* Cycles spent processing upcalls. */ > + PMD_PWR_SLEEP_ITER, /* Iterations where a sleep has taken place. */ > + PMD_PWR_SLEEP_CYCLES, /* Total cycles slept to save power. */ > PMD_N_STATS > }; > @@ -409,5 +411,6 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s); > void > pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, > - int tx_packets, bool full_metrics); > + int tx_packets, uint64_t sleep_cycles, > + bool full_metrics); > > /* Formatting the output of commands. */ > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 7127068fe..af97f9a83 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -172,4 +172,9 @@ static struct odp_support dp_netdev_support = { > #define PMD_RCU_QUIESCE_INTERVAL 10000LL > > +/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */ > +#define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST / 2) > +/* Time in uS to increment a pmd thread sleep time. */ > +#define PMD_PWR_INC_US 10 Idem, no _PWR_. > + > struct dpcls { > struct cmap_node node; /* Within dp_netdev_pmd_thread.classifiers */ The rest lgtm. With this fixed, you can add: Reviewed-by: David Marchand <[email protected]> -- David Marchand _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
