On 21/12/2022 23:43, Ilya Maximets wrote:
On 12/20/22 19:35, Kevin Traynor wrote:
On 19/12/2022 16:18, Ilya Maximets wrote:
On 12/16/22 18:50, Kevin Traynor wrote:
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.

Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).

Sleep time will be increased by 1 uS on each iteration where
the low load conditions remain up to a total of the max sleep
time which has a default of 250 uS.

Hi, Kevin.  Thanks for the patch!

The feature seems interesting.  At least, as an experimental feature
for users to try out.  See some comments below.


Hi Ilya,

Thanks for reviewing. Comments below,

thanks,
Kevin.


The feature is off by default and can be enabled by:
ovs-vsctl set Open_vSwitch . other_config:pmd-powersave=true

The max sleep time per iteration can be set e.g. to set to 500 uS:
ovs-vsctl set Open_vSwitch . other_config:pmd-powersave-maxsleep=500

Do we actually need two separate options for this?
What about dropping the general 'pmd-powersave' option and only
keeping the max sleep configuration with '0' being a default and
meaning no sleep?

We may recommend some value in the docs, but it will ultimately be
a user's decision.


We definitely don't need both to operate now. There's just a couple of things 
to consider about the interface.

If having a default is useful for users, or we can expect that they would be ok 
with coming up with a value. In some ways it's nice that they would have to 
pick a value because it forces them to be aware of the latency trade-off :-)

Yeah, I think that is a good thing.  :)


Another thing is how much to abstract the user i.e. if we want the user to have 
a powersaving feature, in which we may change the implementation (think 
interrupts), or have a pmd sleep feature which can (hopefully) result in power 
saving for them.

I would say that users will likely need a separate knob to
manage interrupts or other stuff.  My guess is that users
of this feature will want to understand what is happening
under the hood, because of possible trade-offs.


I suppose a plus for only a max-sleep setting is that it's easier to start one 
config knob and add another if needed. Also, now that it is limited to 10 ms, 
it is not as easy for a user to put in something crazy.

OTOH, if the implementation changed or became some mix of interrupts and 
sleeps, the user would likely need to change their commands.

Just sharing thoughts above, I'm ok with single max-sleep param for now as 
there is no concrete plan for changing the underlying implementation at present.

Interested to hear what anyone else thinks about the user controls.

I'd say that it's hard to predict what will be useful and
what will not be.  That's why we have 'experimental' features
and ability to more or les easily change some APIs around them.
I'd still vote for a single max-sleep option for now to not
overcomplicate the configuration while we still don't fully
understand how the feature will be used.


Sure, I went with single control knob 'pmd-maxsleep' in v3.


We might also drop the 'powersave' part from the knob and just have
'pmd-max-sleep'.  But I have no strong opinion on this.


sure, the 'powersave' was to tie the config knobs together, so we could remove 
it if removing =true.

The single max sleep option can be extended in the future to
accept a list of 'core:value' pairs for a fine grained per-PMD
control, if necessary, without breaking backward compatibility.
But that is probably not needed right now.


True. Something along these lines is already requested by Thilak.


Also add new stats to pmd-perf-show to get visibility of operation
e.g.

<snip>
    - No-sleep hit:            36445  ( 98.4 % of busy it)
   Sleep time:               3350902  uS ( 34 us/it avg.)> <snip>

Signed-off-by: Kevin Traynor <[email protected]>

---
v2:
- Updated to mark feature as experimental as there is still discussion
    on it's operation and control knobs
- Added pmd-powersave-maxsleep to set the max requested sleep time
- Added unit tests for pmd-powersave and pmd-powersave-maxsleep config
    knobs
- Added docs to explain that requested sleep time and actual sleep time
    may differ
- Added actual measurement of sleep time instead of reporting requested
    time
- Removed Max sleep hit statistics
- Added total sleep time statistic for the length of the measurement
    period (avg. uS per iteration still exists also)
- Updated other statistics to account for sleep time
- Some renaming
- Replaced xnanosleep with nanosleep to avoid having to start/end
    quiesce for every sleep (this may KO this feature on Windows)

Maybe convert a current xnanosleep with a

static void
xnanosleep__(uint64_t nanoseconds, bool need_to_quiesce)

and create 2 wrappers with true/false as arguments:
xnanosleep() and xnanosleep_no_quiesce() ?
Or something like that?


yes, i had thought about doing something like that, but figured it could be 
extended to Windows later....however, I see next comment, so seems better to do 
now.

I didn't test, but the current code might break the windows build,
not only this particular function.


I had not thought about that!

- Limited max requested sleep to max PMD quiesce time (10 ms)
- Adapted ALB measurement about whether a PMD is overloaded to account
    for time spent sleeping
---
   Documentation/topics/dpdk/pmd.rst | 46 +++++++++++++++++
   lib/dpif-netdev-perf.c            | 26 ++++++++--
   lib/dpif-netdev-perf.h            |  5 +-
   lib/dpif-netdev.c                 | 86 +++++++++++++++++++++++++++++--
   tests/pmd.at                      | 43 ++++++++++++++++
   vswitchd/vswitch.xml              | 34 ++++++++++++
   6 files changed, 229 insertions(+), 11 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst 
b/Documentation/topics/dpdk/pmd.rst
index b259cc8b3..abc552029 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -312,4 +312,50 @@ reassignment due to PMD Auto Load Balance. For example, 
this could be set
   (in min) such that a reassignment is triggered at most every few hours.
   +PMD Power Saving (Experimental)
+-------------------------------
+
+PMD threads constantly poll Rx queues which are assigned to them. In order to
+reduce the CPU cycles they use, they can sleep for small periods of time
+when there is no load or very-low load from all the Rx queues they poll.
+
+This can be enabled by::
+
+    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave="true"
+
+With this enabled a PMD may request to sleep by an incrementing amount of time
+up to a maximum time. If at any point the threshold of at least half a batch of
+packets (i.e. 16) is received from an Rx queue that the PMD is polling  is met,
+the sleep time will be reset to 0. (i.e. no sleep).
+
+The default maximum sleep time is set 250 us. A user can configure a new
+maximum requested sleep time in uS. e.g. to set to 1 ms::
+
+    $ ovs-vsctl set open_vswitch . other_config:pmd-powersave-maxsleep=1000
+
+Sleeping in a PMD thread will mean there is a period of time when the PMD
+thread will not process packets. Sleep times requested are not guaranteed
+and can differ significantly depending on system configuration. The actual
+time not processing packets will be determined by the sleep and processor
+wake-up times and should be tested with each system configuration.
+
+Sleep time statistics for 10 secs can be seen with::
+
+    $ ovs-appctl dpif-netdev/pmd-stats-clear \
+        && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
+
+Example output, showing that the 16 packet no-sleep threshhold occurred in
+98.2% of busy iterations and there was an average sleep time of
+33 us per iteration::
+
+     No-sleep hit:             119043  ( 98.2 % of busy it)

I'm not sure the word 'hit' is appropriate here.
Maybe 'No-sleep iterations' ?


Hmm, it's really not reporting the number of iterations that there was 
sleep/no-sleep. It's reporting on whether the Rxq's are above/below the 
threshold that can trigger a sleep.

This was definitely not clear as I thought it is a number
of iterations on which we didn't sleep. :)


I have to agree it was not very intuitive for the user.


It is a subtle difference. At present it is almost the same thing, but we could 
change to say only start sleeping after we have n iterations where the 
threshhold is not met etc.

Now I'm thinking maybe the iterations where there was a sleep/no-sleep might be 
a simpler and more intuitive stat. For that, your naming suggestion makes 
perfect sense in the context of the other stats:

   Iterations:             21957560  (0.10 us/it)
   - Used TSC cycles:    5563094477  ( 87.8 % of total cycles)
   - idle iterations:      21957561  (100.0 % of used cycles)
   - busy iterations:             0  (  0.0 % of used cycles)
   - No-sleep hit:                0  (  0.0 % of busy it)

I'm not sure which way is better, but at least we need
to make sure that users understand what these stats mean
without extensive dig into documentation or code.


I changed it to number of iterations that a sleep took place in, and the % of the total iterations that it represents. I think that's a much clearer stat for the user. Thanks!


And the word 'it' is a bit confusing in this context,
the full word might be better for understanding.


Yeah, I agree

It's fine to move around alignment in the output here.

+     Sleep time:              10638025 uS ( 33 us/it avg.)


uS/us inconsistency.


Ack


<snipped>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to