Hi Kevin,
Apologize for the late reply. 
Please find my replies inline.

Thanks
Thilak Raj S

-----Original Message-----
From: Kevin Traynor <[email protected]> 
Sent: 08 December 2022 06:39
To: Thilak Raj Surendra Babu <[email protected]>; [email protected]
Cc: [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH] dpif-netdev: Load based PMD sleeping.

On 07/12/2022 09:28, Thilak Raj Surendra Babu wrote:
> Hi Kevin,
> Thank you for the patch. It is pretty clean for power saving.

Hi Thilak, Thank you for testing.

> I tried the patch below is what I see
> 
> Ping test:
> Ping from physical machine to br0 ip-address.
> 
> Poll:
> [root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10 PING 
> 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
> 64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.081 ms
> 64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.061 ms
> 64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.050 ms
> 64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.049 ms
> 64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.048 ms
> 64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.048 ms
> 64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.047 ms
> 64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.047 ms
> 64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.047 ms
> 64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.049 ms
> 
> --- 10.15.251.33 ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9ms rtt 
> min/avg/max/mdev = 0.047/0.052/0.081/0.013 ms
> 
> With your changes:
> [root@dargil-4 pktgen-dpdk]# ping 10.15.251.33 -i 0.001 -c 10 PING 
> 10.15.251.33 (10.15.251.33) 56(84) bytes of data.
> 64 bytes from 10.15.251.33: icmp_seq=1 ttl=64 time=0.189 ms
> 64 bytes from 10.15.251.33: icmp_seq=2 ttl=64 time=0.367 ms
> 64 bytes from 10.15.251.33: icmp_seq=3 ttl=64 time=0.351 ms
> 64 bytes from 10.15.251.33: icmp_seq=4 ttl=64 time=0.315 ms
> 64 bytes from 10.15.251.33: icmp_seq=5 ttl=64 time=0.240 ms
> 64 bytes from 10.15.251.33: icmp_seq=6 ttl=64 time=0.204 ms
> 64 bytes from 10.15.251.33: icmp_seq=7 ttl=64 time=0.204 ms
> 64 bytes from 10.15.251.33: icmp_seq=8 ttl=64 time=0.138 ms
> 64 bytes from 10.15.251.33: icmp_seq=9 ttl=64 time=0.137 ms
> 64 bytes from 10.15.251.33: icmp_seq=10 ttl=64 time=0.417 ms
> 
> --- 10.15.251.33 ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9ms rtt 
> min/avg/max/mdev = 0.137/0.256/0.417/0.094 ms
> 
> At a high level, I was thinking of the below.
> Please let me know what you think?
> 
> 1. Can we have the option of making a particular pmd thread sleep instead of 
> it being at the global level.
> There are some interfaces for which the flows being serviced by them 
> are very sensitive towards latency(heartbeats)

To just have a per pmd setting means that the user would need to know which 
PMDs were being used for that port (could be multiple) and react to any changes 
in assignments or changes to PMD cores.

It also means that anyone who wants it on all PMD cores now needs to be aware 
of which PMDs are being used and track changes and then re-set to the new PMDs.

So I think per-pmd config as a replacement for a global one is too difficult 
for a user.

*If* a per-pmd functionality was needed a better option would be to keep the 
global pmd-powersave setting AND add a per-pmd no-sleep override (default 
false).

The expectation then would be if a user is concerned about the latency on a 
particular port they would be prepared to tune and pin that port Rx
queue(s) to a specific PMD and set that PMD to never sleep, regardless of the 
global pmd-powersave mode.

It preserves the easy to use general setting for users, but it gives a user who 
has special requirements about some interfaces and is willing to go deeper into 
tuning the extra functionality.

It still does introduce some extra complexity for user and a bit for code too 
so we'd need to know that it's needed.

Note, by default it would not have an impact on a global pmd-powersave config 
option. In this way a per-pmd override is an add-on functionality and could be 
done at a later time.

That is certainly something that could be added if users felt it is needed and 
we can plan to do it but I feel it would be better to bed down that the core 
operation of sleeping is useful/correct first, before adding on extra ways to 
specify on/off.

So you are suggesting an override at PMD level instead of having this option 
itself at PMD level.
ovs-vsctl set open_vswitch . other_config:pmd-powersave="true" and 
ovs-vsctl set open_vswitch . other_config:pmd-powersave-no-sleep:<list of cores 
for which override applies>

I was thinking, if we could have a config like this
ovs-vsctl set open_vswitch . other_config:pmd-powersave-cores=<"list of core 
which will sleep/all"> if all is given it implies all cores will be in power 
save mode
I was thinking whenever there is both sleeping and non-sleeping PMDs it would 
make sense to explicitly pin the RXQs to the PMD threads.

For my mind the above felt much natural, but I don't have a strong preference 
either way.
I also agree that we should get the core operation first before spending much 
time on the knobs.

> 2. I was thinking about using umwait to reduce the latency of the first 
> packet(but portability becomes a challenge I suppose) . But I guess this 
> patch is very generic enough to be adopted on all platforms.

At a quick look, yes portability would be issue. I will look into it some more.

> 3. Instead of sleeping on the next immediate iteration if the previous 
> iteration had less than half of the burst size, can we have a logic where we 
> record a window of packets received(say last 60 iterations) and based on that 
> decide to sleep?

We could delay sleeping until an arbitrary number of iterations with all Rx 
queue < BURST/2, but in some cases that might mean we don't sleep where we 
might have been able to. The other issue would be picking the arbitrary number 
- I wouldn't want a user setting something so low level.

I was thinking on these lines, as I wanted to put the latency requirements 
first before the power saving feature for certain flows.
But If we could a system where there are threads which sleep to save power and 
threads which Poll, We should be fine.

> I am trying to avoid scenarios where a PMD thread has two interfaces and one 
> interface has < BURST/2 packets coming in which induces latency for the 
> packets received on the other interface.
> 

It would be right for the PMD to sleep if both interfaces have <BURST/2.
Waiting an arbitrary amount of iterations would likely just delay that.
My Bad, for some reason I initially thought code which decides the sleep was 
within the for loop going over the rxq.

I suppose one thing to mention is that this focuses on the transition and edge 
cases and it is hard to get them perfect for all traffic sencarios.

There will always be some trade-off between sleeping sooner or later. 
i.e. do we sleep sooner and try to save power or wait a bit longer and be more 
sure that the traffic is not bursty. I would hope the trade-off is only for a 
very small window and the majority of time it's clear that we should sleep or 
not sleep.
Agree that there are quite some edge cases, I think if we have the option to 
put only few threads to sleep and other thread to be polling,
Then traffic that are sensitive towards edge cases can be placed on the polling 
thread.

Thanks, really appreciate your thoughts and feedback.

Kevin.

> Thanks
> Thilak Raj S
> 
> -----Original Message-----
> From: Kevin Traynor <[email protected]>
> Sent: 29 November 2022 06:02
> To: [email protected]
> Cc: [email protected]; [email protected]; 
> [email protected]; Thilak Raj Surendra Babu 
> <[email protected]>; Kevin Traynor <[email protected]>
> Subject: [PATCH] dpif-netdev: Load based PMD sleeping.
> 
> Sleep for an incremental amount of time if none of the Rx queues assigned to 
> a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling 
> iteration of the PMD.
> 
> Upon detecting >= 16 pkts on an Rxq, reset the sleep time to zero.
> 
> Sleep time will be increased by 1 uS on each iteration where the low load 
> conditions remain up to a total of 250 uS.
> 
> The feature is off by default and can be enabled by:
> ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true
> 
> Also add new stats to pmd-perf-show to get visibility of operation:
> Iterations:
> ...
>    - No-sleep thresh:           576  ( 98.1 % of busy it)
>    - Max sleeps:              19424  (167 us/it avg.)
> ...
> 
> Signed-off-by: Kevin Traynor <[email protected]>
> ---
>   Documentation/topics/dpdk/pmd.rst | 20 +++++++++++
>   lib/dpif-netdev-perf.c            | 25 ++++++++++++--
>   lib/dpif-netdev-perf.h            |  7 +++-
>   lib/dpif-netdev.c                 | 56 +++++++++++++++++++++++++++++--
>   vswitchd/vswitch.xml              | 17 ++++++++++
>   5 files changed, 118 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst 
> b/Documentation/topics/dpdk/pmd.rst
> index b259cc8b3..a42e34956 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -312,4 +312,24 @@ reassignment due to PMD Auto Load Balance. For example, 
> this could be set  (in min) such that a reassignment is triggered at most 
> every few hours.
>   
> +PMD Power Saving
> +----------------
> +
> +PMD threads constantly poll Rx queues which are assigned to them. In 
> +order to reduce the CPU cycles they use, they can sleep for small 
> +periods of time when there is no load or very-low load from the Rx queues it 
> polls.
> +
> +This can be enabled by::
> +
> +    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
> Should we think about having this per PMD thread instead of applying for all 
> the PMD threads?
> +
> +With this enabled a PMD may sleep by an incrementing amount of time 
> +up to a total of 250 uS after polling all Rx queues assigned to it, 
> +if none of the Rx queues have at least half a batch of packets (i.e. 16).
> +
> +.. note::
> +
> +    Sleeping in a PMD thread may cause extra packet latency due to the sleep
> +    and possible processor low-power states which may be invoked.
> +
>   .. _ovs-vswitchd(8):
>       
> https://urldefense.proofpoint.com/v2/url?u=http-3A__openvswitch.org_su
> pport_dist-2Ddocs_ovs-2Dvswitchd.8.html&d=DwIDAg&c=s883GpUCOChKOHiocYt
> Gcg&r=1sd3waKor_ps6hs2j0tfqmW6ts2tlVvmmMySlXCPN6w&m=j2qb7PEBmEEdOCoDKn
> OOlknwNtbrbMajkVMwgyP9wSKZO5G-jtHdhk6hYvL3wgc1&s=iiXtQr8FYSg6QP3btMH7H
> A5LBoyrNKToSpvCx6LE3Sw&e= diff --git a/lib/dpif-netdev-perf.c 
> b/lib/dpif-netdev-perf.c index a2a7d8f0b..1bdeda265 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -231,4 +231,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
> pmd_perf_stats *s,
>       uint64_t idle_iter = s->pkts.bin[0];
>       uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - 
> idle_iter : 0;
> +    uint64_t no_sleep_hit = stats[PMD_PWR_NO_SLEEP_HIT];
> +    uint64_t max_sleep_hit = stats[PMD_PWR_MAX_SLEEP_HIT];
> +    uint64_t tot_sleep_time = stats[PMD_PWR_SLEEP_TIME];
>   
>       ds_put_format(str,
> @@ -236,5 +239,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct 
> pmd_perf_stats *s,
>               "  - Used TSC cycles:  %12"PRIu64"  (%5.1f %% of total 
> cycles)\n"
>               "  - idle iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> -            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n",
> +            "  - busy iterations:  %12"PRIu64"  (%5.1f %% of used cycles)\n"
> +            "  - No-sleep thresh:  %12"PRIu64"  (%5.1f %% of busy it)\n"
> +            "  - Max sleeps:       %12"PRIu64"  (%3"PRIu64" us/it avg.)\n",
>               tot_iter, tot_cycles * us_per_cycle / tot_iter,
>               tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, @@ -242,5 
> +247,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats 
> *s,
>               100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>               busy_iter,
> -            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
> +            no_sleep_hit, busy_iter ? 100.0 * no_sleep_hit / busy_iter : 0,
> +            max_sleep_hit, tot_iter ? (tot_sleep_time / 1000) / 
> + tot_iter : 0);
>       if (rx_packets > 0) {
>           ds_put_format(str,
> @@ -519,5 +526,7 @@ OVS_REQUIRES(s->stats_mutex)  void  
> pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics)
> +                       int tx_packets, bool max_sleep_hit,
> +                       bool no_sleep_hit, uint64_t sleep_time,
> +                       bool full_metrics)
>   {
>       uint64_t now_tsc = cycles_counter_update(s); @@ -540,4 +549,14 @@ 
> pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>       histogram_add_sample(&s->pkts, rx_packets);
>   
> +    if (no_sleep_hit) {
> +        pmd_perf_update_counter(s, PMD_PWR_NO_SLEEP_HIT, 1);
> +    }
> +    if (max_sleep_hit) {
> +        pmd_perf_update_counter(s, PMD_PWR_MAX_SLEEP_HIT, 1);
> +    }
> +    if (sleep_time) {
> +        pmd_perf_update_counter(s, PMD_PWR_SLEEP_TIME, sleep_time);
> +    }
> +
>       if (!full_metrics) {
>           return;
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 
> 9673dddd8..c3d6cd435 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -81,4 +81,7 @@ enum pmd_stat_type {
>       PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>       PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
> +    PMD_PWR_NO_SLEEP_HIT,   /* Iterations with Rx above no-sleep thresh. */
> +    PMD_PWR_MAX_SLEEP_HIT,  /* Number of times max sleep value hit. */
> +    PMD_PWR_SLEEP_TIME,     /* Total time slept to save power. */
>       PMD_N_STATS
>   };
> @@ -409,5 +412,7 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s);  void 
>  pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
> -                       int tx_packets, bool full_metrics);
> +                       int tx_packets, bool max_sleep_hit,
> +                       bool no_sleep_hit, uint64_t sleep_time,
> +                       bool full_metrics);
>   
>   /* Formatting the output of commands. */ diff --git 
> a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 2c08a71c8..91a323c74 
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -170,4 +170,11 @@ static struct odp_support dp_netdev_support = {  
> #define PMD_RCU_QUIESCE_INTERVAL 10000LL
>   
> +/* Max time in nanoseconds for a pmd thread to sleep based on load. 
> +*/ #define PMD_PWR_MAX_SLEEP 250000
> +/* Number of pkts Rx on an interface that will stop pmd thread 
> +sleeping. */ #define PMD_PWR_NO_SLEEP_THRESH (NETDEV_MAX_BURST/2)
> +/* Time in nanosecond to increment a pmd thread sleep time. */ 
> +#define PMD_PWR_INC 1000
> Should we make them Tunnable/configurable(MAX_SLEEP and PWR_INC) instead of 
> them being constants?
> +
>   struct dpcls {
>       struct cmap_node node;      /* Within dp_netdev_pmd_thread.classifiers 
> */
> @@ -278,4 +285,6 @@ struct dp_netdev {
>       /* Enable collection of PMD performance metrics. */
>       atomic_bool pmd_perf_metrics;
> +    /* Enable PMD load based sleeping. */
> +    atomic_bool pmd_powersave;
>       /* Enable the SMC cache from ovsdb config */
>       atomic_bool smc_enable_db;
> @@ -4915,4 +4924,17 @@ dpif_netdev_set_config(struct dpif *dpif, const 
> struct smap *other_config)
>   
>       set_pmd_auto_lb(dp, autolb_state, log_autolb);
> +
> +    bool powersave = smap_get_bool(other_config, "pmd-powersave", false);
> +    bool cur_powersave;
> +    atomic_read_relaxed(&dp->pmd_powersave, &cur_powersave);
> +    if (powersave != cur_powersave) {
> +        atomic_store_relaxed(&dp->pmd_powersave, powersave);
> +        if (powersave) {
> +            VLOG_INFO("PMD powersave is enabled.");
> +        } else {
> +            VLOG_INFO("PMD powersave is disabled.");
> +        }
> +    }
> +
>       return 0;
>   }
> @@ -6873,7 +6895,9 @@ pmd_thread_main(void *f_)
>       bool exiting;
>       bool reload;
> +    bool powersave;
>       int poll_cnt;
>       int i;
>       int process_packets = 0;
> +    uint64_t sleep_time = 0;
>   
>       poll_list = NULL;
> @@ -6935,7 +6959,14 @@ reload:
>       for (;;) {
>           uint64_t rx_packets = 0, tx_packets = 0;
> +        bool nosleep_hit = false;
> +        bool max_sleep_hit = false;
> +        uint64_t time_slept = 0;
>   
>           pmd_perf_start_iteration(s);
> -
> +        atomic_read_relaxed(&pmd->dp->pmd_powersave, &powersave);
> +        if (!powersave) {
> +            /* Reset sleep_time as policy may have changed. */
> +            sleep_time = 0;
> +        }
>           atomic_read_relaxed(&pmd->dp->smc_enable_db, 
> &pmd->ctx.smc_enable_db);
>   
> @@ -6957,4 +6988,8 @@ reload:
>                                              poll_list[i].port_no);
>               rx_packets += process_packets;
> +            if (process_packets >= PMD_PWR_NO_SLEEP_THRESH) {
> +                nosleep_hit = true;
> +                sleep_time = 0;
> +            }
>           }
>   
> @@ -6964,5 +6999,19 @@ reload:
>                * There was no time updates on current iteration. */
>               pmd_thread_ctx_time_update(pmd);
> -            tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
> +                                                   sleep_time ? true : 
> false);
> +        }
> +
> +        if (powersave) {
> +            if (sleep_time) {
> +                xnanosleep(sleep_time);
> +                time_slept = sleep_time;
> +            }
> +            if (sleep_time < PMD_PWR_MAX_SLEEP) {
> +                /* Increase potential sleep time for next iteration. */
> +                sleep_time += PMD_PWR_INC;
> +            } else {
> +                max_sleep_hit = true;
> +            }
> M understanding here is that the above code will lead to a interface 
> with less packets coming Cause latency for the packets arriving in the other 
> interface handled by the same PMD.
> Please correct me if I am wrong.
> 
> Lets say we have 3 interfaces being serviced by this PMD.
> first interface receives 8 pkts , second receives 5 pkts and third received 
> 30 pkts.
> After servicing the first interface, sleep time increases to 1 us and 
> Before servicing the second interface it will sleep 1 us and increment 
> the sleep time to 1us And before servicing third interface it will sleep 2 
> us(so even though the third interface receives > PMD_PWR_NO_SLEEP_THRESH 
> packets on that interface will suffer a latency).
> If the order of the interface was changed the latency suffered by the packets 
> on these interfaces will change.
> 
> I am thinking , if we hold an history of some x iterations and sleep based on 
> this history, the order of the interfaces will not affect the latency 
> suffered by the packets.
> and possibly have even latencies across.
> 
>           }
>   
> @@ -7004,5 +7053,6 @@ reload:
>           }
>   
> -        pmd_perf_end_iteration(s, rx_packets, tx_packets,
> +        pmd_perf_end_iteration(s, rx_packets, tx_packets, max_sleep_hit,
> +                               nosleep_hit, time_slept,
>                                  pmd_perf_metrics_enabled(pmd));
>       }
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 
> 928821a82..d51a486d8 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -789,4 +789,21 @@
>           </p>
>         </column>
> +      <column name="other_config" key="pmd-powersave"
> +              type='{"type": "boolean"}'>
> +        <p>
> +         Enables PMD thread sleeping up to 250 uS per iteration based on
> +         rx queue batch sizes.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>.
> +        </p>
> +        <p>
> +          Set this value to <code>true</code> to enable this option.
> +        </p>
> +        <p>
> +         Sleep time will be reset to 0 when any rx queue has a batch of
> +         16 or more packets.
> +        </p>
> +      </column>
>         <column name="other_config" key="userspace-tso-enable"
>                 type='{"type": "boolean"}'>
> --
> 2.38.1
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to