Re: [ovs-dev] [PATCH] dpif-netdev: Load based PMD sleeping.

Kevin Traynor Thu, 15 Dec 2022 09:18:42 -0800

On 15/12/2022 01:12, Thilak Raj Surendra Babu wrote:

Hi Kevin,
Apologize for the late reply.
Please find my replies inline.


Hi Thilak,

Thanks for your comments. Replies below.

Just a small thing on email format. If possible, could you send repliesinline (i.e. not top-post and original mail line have ">") or if not,mark your comments with your name. This is just to make it easier foranyone else reading the thread to know who is saying what!


thanks,
Kevin.

Thanks
Thilak Raj S

-----Original Message-----
From: Kevin Traynor <[email protected]>
Sent: 08 December 2022 06:39
To: Thilak Raj Surendra Babu <[email protected]>; [email protected]
Cc: [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH] dpif-netdev: Load based PMD sleeping.

On 07/12/2022 09:28, Thilak Raj Surendra Babu wrote:

Hi Kevin,
Thank you for the patch. It is pretty clean for power saving.


Hi Thilak, Thank you for testing.

I tried the patch below is what I see

Ping test:
Ping from physical machine to br0 ip-address.

Poll:
[root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10 PING
10.15.251.33 (10.15.251.33) 56(84) bytes of data.
64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.081 ms
64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.061 ms
64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.050 ms
64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.049 ms
64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.048 ms
64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.048 ms
64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.047 ms
64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.047 ms
64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.047 ms
64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.049 ms

--- 10.15.251.33 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9ms rtt
min/avg/max/mdev = 0.047/0.052/0.081/0.013 ms

With your changes:
[root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10 PING
10.15.251.33 (10.15.251.33) 56(84) bytes of data.
64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.189 ms
64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.367 ms
64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.351 ms
64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.315 ms
64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.240 ms
64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.204 ms
64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.204 ms
64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.138 ms
64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.137 ms
64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.417 ms

--- 10.15.251.33 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9ms rtt
min/avg/max/mdev = 0.137/0.256/0.417/0.094 ms

At a high level, I was thinking of the below.
Please let me know what you think?

1. Can we have the option of making a particular pmd thread sleep instead of it 
being at the global level.
There are some interfaces for which the flows being serviced by them
are very sensitive towards latency(heartbeats)


To just have a per pmd setting means that the user would need to know which 
PMDs were being used for that port (could be multiple) and react to any changes 
in assignments or changes to PMD cores.

It also means that anyone who wants it on all PMD cores now needs to be aware 
of which PMDs are being used and track changes and then re-set to the new PMDs.

So I think per-pmd config as a replacement for a global one is too difficult 
for a user.

*If* a per-pmd functionality was needed a better option would be to keep the 
global pmd-powersave setting AND add a per-pmd no-sleep override (default 
false).

The expectation then would be if a user is concerned about the latency on a 
particular port they would be prepared to tune and pin that port Rx
queue(s) to a specific PMD and set that PMD to never sleep, regardless of the 
global pmd-powersave mode.

It preserves the easy to use general setting for users, but it gives a user who 
has special requirements about some interfaces and is willing to go deeper into 
tuning the extra functionality.

It still does introduce some extra complexity for user and a bit for code too 
so we'd need to know that it's needed.

Note, by default it would not have an impact on a global pmd-powersave config 
option. In this way a per-pmd override is an add-on functionality and could be 
done at a later time.

That is certainly something that could be added if users felt it is needed and 
we can plan to do it but I feel it would be better to bed down that the core 
operation of sleeping is useful/correct first, before adding on extra ways to 
specify on/off.

So you are suggesting an override at PMD level instead of having this option 
itself at PMD level.
ovs-vsctl set open_vswitch . other_config:pmd-powersave="true" and
ovs-vsctl set open_vswitch . other_config:pmd-powersave-no-sleep:<list of cores 
for which override applies>


Yes, that's what i was thinking.

I was thinking, if we could have a config like this
ovs-vsctl set open_vswitch . other_config:pmd-powersave-cores=<"list of core which will 
sleep/all"> if all is given it implies all cores will be in power save mode
I was thinking whenever there is both sleeping and non-sleeping PMDs it would 
make sense to explicitly pin the RXQs to the PMD threads.

For my mind the above felt much natural, but I don't have a strong preference 
either way.
I also agree that we should get the core operation first before spending much 
time on the knobs.

For mixed cases (want some pmd able to sleep and some not), I'd have apreference for opting some pmd cores out, rather than in.

Reasoning is that if they are used for pinning then the user must knowabout them and they are fixed. Other pmds could be added/removed/changedand the user would not need to know specifics to keep other pmdssleeping but otoh, it is nice to have a one-liner too. We can reviewagain later when we're happy with operation.

2. I was thinking about using umwait to reduce the latency of the first 
packet(but portability becomes a challenge I suppose) . But I guess this patch 
is very generic enough to be adopted on all platforms.


At a quick look, yes portability would be issue. I will look into it some more.

3. Instead of sleeping on the next immediate iteration if the previous
iteration had less than half of the burst size, can we have a logic where we 
record a window of packets received(say last 60 iterations) and based on that 
decide to sleep?


We could delay sleeping until an arbitrary number of iterations with all Rx queue 
< BURST/2, but in some cases that might mean we don't sleep where we might have 
been able to. The other issue would be picking the arbitrary number - I wouldn't 
want a user setting something so low level.

I was thinking on these lines, as I wanted to put the latency requirements 
first before the power saving feature for certain flows.
But If we could a system where there are threads which sleep to save power and 
threads which Poll, We should be fine.


ok, thanks for confirming.

I am trying to avoid scenarios where a PMD thread has two interfaces and one 
interface has < BURST/2 packets coming in which induces latency for the packets 
received on the other interface.


It would be right for the PMD to sleep if both interfaces have <BURST/2.
Waiting an arbitrary amount of iterations would likely just delay that.
My Bad, for some reason I initially thought code which decides the sleep was 
within the for loop going over the rxq.

I suppose one thing to mention is that this focuses on the transition and edge 
cases and it is hard to get them perfect for all traffic sencarios.

There will always be some trade-off between sleeping sooner or later.
i.e. do we sleep sooner and try to save power or wait a bit longer and be more 
sure that the traffic is not bursty. I would hope the trade-off is only for a 
very small window and the majority of time it's clear that we should sleep or 
not sleep.
Agree that there are quite some edge cases, I think if we have the option to 
put only few threads to sleep and other thread to be polling,
Then traffic that are sensitive towards edge cases can be placed on the polling 
thread.

Thanks, really appreciate your thoughts and feedback.

Kevin.

Thanks
Thilak Raj S

-----Original Message-----
From: Kevin Traynor <[email protected]>
Sent: 29 November 2022 06:02
To: [email protected]
Cc: [email protected]; [email protected];
[email protected]; Thilak Raj Surendra Babu
<[email protected]>; Kevin Traynor <[email protected]>
Subject: [PATCH] dpif-netdev: Load based PMD sleeping.

Sleep for an incremental amount of time if none of the Rx queues assigned to a 
PMD have at least half a batch of packets (i.e. 16 pkts) on an polling 
iteration of the PMD.

Upon detecting >= 16 pkts on an Rxq, reset the sleep time to zero.

Sleep time will be increased by 1 uS on each iteration where the low load 
conditions remain up to a total of 250 uS.

The feature is off by default and can be enabled by:
ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true

Also add new stats to pmd-perf-show to get visibility of operation:
Iterations:
...
    - No-sleep thresh:           576  ( 98.1 % of busy it)
    - Max sleeps:              19424  (167 us/it avg.)
...

Signed-off-by: Kevin Traynor <[email protected]>
---
   Documentation/topics/dpdk/pmd.rst | 20 +++++++++++
   lib/dpif-netdev-perf.c            | 25 ++++++++++++--
   lib/dpif-netdev-perf.h            |  7 +++-
   lib/dpif-netdev.c                 | 56 +++++++++++++++++++++++++++++--
   vswitchd/vswitch.xml              | 17 ++++++++++
   5 files changed, 118 insertions(+), 7 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst
b/Documentation/topics/dpdk/pmd.rst
index b259cc8b3..a42e34956 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -312,4 +312,24 @@ reassignment due to PMD Auto Load Balance. For example, 
this could be set  (in min) such that a reassignment is triggered at most every 
few hours.

+PMD Power Saving

+----------------
+
+PMD threads constantly poll Rx queues which are assigned to them. In
+order to reduce the CPU cycles they use, they can sleep for small
+periods of time when there is no load or very-low load from the Rx queues it 
polls.
+
+This can be enabled by::
+
+    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
Should we think about having this per PMD thread instead of applying for all 
the PMD threads?
+
+With this enabled a PMD may sleep by an incrementing amount of time
+up to a total of 250 uS after polling all Rx queues assigned to it,
+if none of the Rx queues have at least half a batch of packets (i.e. 16).
+
+.. note::
+
+    Sleeping in a PMD thread may cause extra packet latency due to the sleep
+    and possible processor low-power states which may be invoked.
+
   .. _ovs-vswitchd(8):

https://urldefense.proofpoint.com/v2/url?u=http-3A__openvswitch.org_su

pport_dist-2Ddocs_ovs-2Dvswitchd.8.html&d=DwIDAg&c=s883GpUCOChKOHiocYt
Gcg&r=1sd3waKor_ps6hs2j0tfqmW6ts2tlVvmmMySlXCPN6w&m=j2qb7PEBmEEdOCoDKn
OOlknwNtbrbMajkVMwgyP9wSKZO5G-jtHdhk6hYvL3wgc1&s=iiXtQr8FYSg6QP3btMH7H
A5LBoyrNKToSpvCx6LE3Sw&e= diff --git a/lib/dpif-netdev-perf.c
b/lib/dpif-netdev-perf.c index a2a7d8f0b..1bdeda265 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -231,4 +231,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
pmd_perf_stats *s,
       uint64_t idle_iter = s->pkts.bin[0];
       uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter -
idle_iter : 0;
+    uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT];
+    uint64_t max_sleep_hit = stats[PMD_PWR_MAX_SLEEP_HIT];
+    uint64_t tot_sleep_time = stats[PMD_PWR_SLEEP_TIME];

ds_put_format(str,

@@ -236,5 +239,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
pmd_perf_stats *s,
               "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total cycles)\n"
               "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
-            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
+            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
+            "  - No-sleep thresh:  %12"PRIu64"  (%5.1f %% of busy it)\n"
+            "  - Max sleeps:       %12"PRIu64"  (%3"PRIu64" us/it avg.)\n",
               tot_iter, tot_cycles * us_per_cycle / tot_iter,
               tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, @@ -242,5 
+247,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats 
*s,
               100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
               busy_iter,
-            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
+            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
+            no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0,
+            max_sleep_hit, tot_iter ? (tot_sleep_time / 1000) /
+ tot_iter : 0);
       if (rx_packets > 0) {
           ds_put_format(str,
@@ -519,5 +526,7 @@ OVS_REQUIRES(s->stats_mutex)  void  
pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics)
+                       int tx_packets, bool max_sleep_hit,
+                       bool no_sleep_hit, uint64_t sleep_time,
+                       bool full_metrics)
   {
       uint64_t now_tsc = cycles_counter_update(s); @@ -540,4 +549,14 @@ 
pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
       histogram_add_sample(&s->pkts, rx_packets);

+ if (no_sleep_hit) {

+        pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1);
+    }
+    if (max_sleep_hit) {
+        pmd_perf_update_counter(s, PMD_PWR_MAX_SLEEP_HIT, 1);
+    }
+    if (sleep_time) {
+        pmd_perf_update_counter(s, PMD_PWR_SLEEP_TIME, sleep_time);
+    }
+
       if (!full_metrics) {
           return;
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index
9673dddd8..c3d6cd435 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -81,4 +81,7 @@ enum pmd_stat_type {
       PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
       PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
+    PMD_PWR_NO_SLEEP_HIT,   /* Iterations with Rx above no-sleep thresh. */
+    PMD_PWR_MAX_SLEEP_HIT,  /* Number of times max sleep value hit. */
+    PMD_PWR_SLEEP_TIME,     /* Total time slept to save power. */
       PMD_N_STATS
   };
@@ -409,5 +412,7 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);  void  
pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
-                       int tx_packets, bool full_metrics);
+                       int tx_packets, bool max_sleep_hit,
+                       bool no_sleep_hit, uint64_t sleep_time,
+                       bool full_metrics);

/* Formatting the output of commands. */ diff --git

a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 2c08a71c8..91a323c74
100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = {
#define PMD_RCU_QUIESCE_INTERVAL 10000LL

+/* Max time in nanoseconds for a pmd thread to sleep based on load.

+*/ #define PMD_PWR_MAX_SLEEP 250000
+/* Number of pkts Rx on an interface that will stop pmd thread
+sleeping. */ #define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2)
+/* Time in nanosecond to increment a pmd thread sleep time. */
+#define PMD_PWR_INC 1000
Should we make them Tunnable/configurable(MAX_SLEEP and PWR_INC) instead of 
them being constants?
+
   struct dpcls {
       struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers */
@@ -278,4 +285,6 @@ struct dp_netdev {
       /* Enable collection of PMD performance metrics. */
       atomic_bool pmd_perf_metrics;
+    /* Enable PMD load based sleeping. */
+    atomic_bool pmd_powersave;
       /* Enable the SMC cache from ovsdb config */
       atomic_bool smc_enable_db;
@@ -4915,4 +4924,17 @@ dpif_netdev_set_config(struct dpif *dpif, const
struct smap *other_config)

set_pmd_auto_lb(dp, autolb_state, log_autolb);

+
+    bool powersave = smap_get_bool(other_config, "pmd-powersave", false);
+    bool cur_powersave;
+    atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave);
+    if (powersave != cur_powersave) {
+        atomic_store_relaxed(&dp->pmd_powersave, powersave);
+        if (powersave) {
+            VLOG_INFO("PMD powersave is enabled.");
+        } else {
+            VLOG_INFO("PMD powersave is disabled.");
+        }
+    }
+
       return 0;
   }
@@ -6873,7 +6895,9 @@ pmd_thread_main(void *f_)
       bool exiting;
       bool reload;
+    bool powersave;
       int poll_cnt;
       int i;
       int process_packets = 0;
+    uint64_t sleep_time = 0;

poll_list = NULL;

@@ -6935,7 +6959,14 @@ reload:
       for (;;) {
           uint64_t rx_packets = 0, tx_packets = 0;
+        bool nosleep_hit = false;
+        bool max_sleep_hit = false;
+        uint64_t time_slept = 0;

pmd_perf_start_iteration(s);

-
+        atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave);
+        if (!powersave) {
+            /* Reset sleep_time as policy may have changed. */
+            sleep_time = 0;
+        }
           atomic_read_relaxed(&pmd->dp->smc_enable_db,
&pmd->ctx.smc_enable_db);

@@ -6957,4 +6988,8 @@ reload:

                                              poll_list[i].port_no);
               rx_packets += process_packets;
+            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
+                nosleep_hit = true;
+                sleep_time = 0;
+            }
           }

@@ -6964,5 +6999,19 @@ reload:

                * There was no time updates on current iteration. */
               pmd_thread_ctx_time_update(pmd);
-            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
+            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
+                                                   sleep_time ? true : false);
+        }
+
+        if (powersave) {
+            if (sleep_time) {
+                xnanosleep(sleep_time);
+                time_slept = sleep_time;
+            }
+            if (sleep_time < PMD_PWR_MAX_SLEEP) {
+                /* Increase potential sleep time for next iteration. */
+                sleep_time += PMD_PWR_INC;
+            } else {
+                max_sleep_hit = true;
+            }
M understanding here is that the above code will lead to a interface
with less packets coming Cause latency for the packets arriving in the other 
interface handled by the same PMD.
Please correct me if I am wrong.

Lets say we have 3 interfaces being serviced by this PMD.
first interface receives 8 pkts , second receives 5 pkts and third received 30 
pkts.
After servicing the first interface, sleep time increases to 1 us and
Before servicing the second interface it will sleep 1 us and increment
the sleep time to 1us And before servicing third interface it will sleep 2 us(so 
even though the third interface receives > PMD_PWR_NO_SLEEP_THRESH packets on 
that interface will suffer a latency).
If the order of the interface was changed the latency suffered by the packets 
on these interfaces will change.

I am thinking , if we hold an history of some x iterations and sleep based on 
this history, the order of the interfaces will not affect the latency suffered 
by the packets.
and possibly have even latencies across.

           }

@@ -7004,5 +7053,6 @@ reload:

- pmd_perf_end_iteration(s, rx_packets, tx_packets,

+        pmd_perf_end_iteration(s, rx_packets, tx_packets, max_sleep_hit,
+                               nosleep_hit, time_slept,
                                  pmd_perf_metrics_enabled(pmd));
       }
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
928821a82..d51a486d8 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -789,4 +789,21 @@
           </p>
         </column>
+      <column name="other_config" key="pmd-powersave"
+              type='{"type": "boolean"}'>
+        <p>
+         Enables PMD thread sleeping up to 250 uS per iteration based on
+         rx queue batch sizes.
+        </p>
+        <p>
+          The default value is <code>false</code>.
+        </p>
+        <p>
+          Set this value to <code>true</code> to enable this option.
+        </p>
+        <p>
+         Sleep time will be reset to 0 when any rx queue has a batch of
+         16 or more packets.
+        </p>
+      </column>
         <column name="other_config" key="userspace-tso-enable"
                 type='{"type": "boolean"}'>
--
2.38.1


_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] dpif-netdev: Load based PMD sleeping.

Reply via email to