RE: [RFC PATCH] ring: adding TPAUSE instruction to ring dequeue

Morten Brørup Wed, 03 May 2023 14:32:31 -0700

> From: Coyle, David [mailto:[email protected]]
> Sent: Wednesday, 3 May 2023 17.32
> 
> Hi Morten
> 
> > From: Morten Brørup <[email protected]>
> >
> > > From: David Coyle [mailto:[email protected]]
> > > Sent: Wednesday, 3 May 2023 13.39
> > >
> > > This is NOT for upstreaming. This is being submitted to allow early
> > > comparison testing with the preferred solution, which will add
> TAPUSE
> > > power management support to the ring library through the addition of
> > > callbacks. Initial stages of the preferred solution are available at
> > > http://dpdk.org/patch/125454.
> > >
> > > This patch adds functionality directly to rte_ring_dequeue functions
> > > to monitor the empty reads of the ring. When a configurable number
> of
> > > empty reads is reached, a TPAUSE instruction is triggered by using
> > > rte_power_pause() on supported architectures. rte_pause() is used on
> > > other architectures. The functionality can be included or excluded
> at
> > > compilation time using the RTE_RING_PMGMT flag. If included, the new
> > > API can be used to enable/disable the feature on a per-ring basis.
> > > Other related settings can also be configured using the API.
> >
> > I don't understand why DPDK developers keep spending time on trying to
> > invent methods to determine application busyness based on entry/exit
> > points in a variety of libraries, when the application is in a much
> better
> > position to determine busyness. All of these "busyness measuring"
> library
> > extensions have their own specific assumptions and weird limitations.
> >
> > I do understand that the goal is power saving, which certainly is
> relevant! I
> > only criticize the measuring methods.
> >
> > For reference, we implemented something very simple in our application
> > framework:
> > 1. When each pipeline stage has completed a burst, it reports if it
> was busy or
> > not.
> > 2. If the pipeline busyness is low, we take a nap to save some power.
> >
> > And here is the magic twist to this simple algorithm:
> > 3. A pipeline stage is not considered busy unless it processed a full
> burst, and
> > is ready to process more packets immediately. This interpretation of
> > busyness has a significant impact on the percentage of time spent
> napping
> > during the low-traffic hours.
> >
> > This algorithm was very quickly implemented. It might not be perfect,
> and we
> > do intend to improve it (also to determine CPU Utilization on a scale
> that the
> > end user can translate to a linear interpretation of how busy the
> system is).
> > But I seriously doubt that any of the proposed "busyness measuring"
> library
> > extensions are any better.
> >
> > So: The application knows better, please spend your precious time on
> > something useful instead.
> >
> > @David, my outburst is not directed at you specifically. Generally, I
> do
> > appreciate experimenting as a good way of obtaining knowledge. So
> thank
> > you for sharing your experiments with this audience!
> >
> > PS: If cruft can be disabled at build time, I generally don't oppose
> to it.
> 
> [DC] Appreciate that feedback, and it is certainly another way of
> looking at
> and tackling the problem that we are ultimately trying to solve (i.e
> power
> saving)
> 
> The problem however is that we work with a large number of ISVs and
> operators,
> each with their own workload architecture and implementation. That means
> we
> would have to work individually with each of these to integrate this
> type of
> pipeline-stage-busyness algorithm into their applications. And as these
> applications are usually commercial, non-open-source applications, that
> could
> prove to be very difficult.
> 
> Also most ISVs and operators don't want to have to worry about changing
> their
> application, especially their fast-path dataplane, in order to get power
> savings. They prefer for it to just happen without them caring about the
> finer
> details.
> 
> For these reasons, consolidating the busyness algorithms down into the
> DPDK
> libraries and PMDs is currently the preferred solution. As you say
> though, the
> libraries and PMDs may not be in the best position to determine the
> busyness
> of the pipeline, but it provides a good balance between achieving power
> savings
> and ease of adoption.


Thank you for describing the business logic driving this technical approach. 
Now I get it!

Automagic busyness monitoring and power management would be excellent. But what 
I see on the mailing list is a bunch of incoherent attempts at doing this. (And 
I don't mean your patches, I mean all the patches for automagic power 
management.) And the cost is not insignificant: Pollution of DPDK all over the 
place, in both drivers and libraries.

I would much rather see a top-down approach, so we could all work towards a 
unified solution.

However, I understand that customers are impatient, so I accept that in reality 
we have to live with these weird "code injection" based solutions until 
something sane becomes available. If they were clearly marked as temporary 
workarounds until a proper solution is provided, I might object less to them. 
(Again, not just your patches, but all the patches of this sort.)

> 
> It's also worth calling out again that this patch is only to allow early
> testing by some customers of the benefit of adding TPAUSE support to the
> ring
> library. We don't intend on this patch being upstreamed. The preferred
> longer
> term solution is to use callbacks from the ring library to initiate the
> pause
> (either via the DPDK power management API or through functions that an
> ISV
> may write themselves). This is mentioned in the commit message.

Noted!

> 
> Also, the pipeline stage busyness algorithm that you have added to your
> pipeline - have you ever considered implementing this into DPDK as a
> generic
> type library. This could certainly be of benefit to other DPDK
> application
> developers, and having this mechanism in DPDK could again ease the
> adoption
> and realisation of power savings for others. I understand though if this
> is your
> own secret sauce and you want to keep it like that :)

Power saving is important for the environment (to save the planet and all 
that), so everyone should contribute, if they have a good solution. So even if 
our algorithm had a significant degree of innovation, we would probably choose 
to make it public anyway. Open sourcing it also makes it possible for chip 
vendors like Intel to fine tune it more than we can ourselves, which also comes 
back to benefit us. All products need some sort of power saving in to stay 
competitive, but power saving algorithms is not an area we want to pursue for 
competitive purposes in our products.

Our algorithm is too simple to make a library at this point, but I have been 
thinking about how we can make it a generic library when it has matured some 
more. I will take your information about the many customers' need to have it 
invisibly injected into consideration in this regard.

Our current algorithm works like this:

while (running) {
int more = 0;
more += stage1();
more += stage2();
more += stage3();
if (!more) sleep();
}

Each pipeline stage only returns 1 if it processed a full burst. Furthermore, 
if a pipeline stage processed a full burst, but happens to know that no more 
data is readily available for it, it returns 0 instead.

Obviously, the sleep() duration must be short enough to avoid that the NIC RX 
descriptor rings overflow before the ingress pipeline stage is serviced again.

Changing the algorithm to "more" (1 = more work expected by the pipeline stage) 
from "busy" (1 = some work done by the pipeline stage) has the consequence that 
sleep() is called more often, which has the follow-on consequence that the 
ingress stage is called less often, and thus more often has a full burst to 
process.

We know from our in-house profiler that processing a full burst provides *much* 
higher execution efficiency (cycles/packet) than processing a few packets. This 
is public knowledge - after all, this is the whole point of DPDK's vector 
packet processing design! Nonetheless, it might surprise some people how much 
the efficiency (cycles/packet) increases when processing a full burst compared 
to processing just a few packets. I will leave it up to the readers to make 
their own experiments. :-)

Our initial "busy" algorithm behaved like this:
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
No packets to process (we are lucky this time!), sleep briefly,
Repeat.

So we switched to our "more" algorithm, which behaves like this:
Process a few packets (at low efficiency), sleep briefly,
Process a full burst of packets (at high efficiency), don't sleep,
Repeat.

Instead of processing e.g. 8 small bursts per sleep, we now process only 2 
bursts per sleep. And the big of the two bursts is processed at higher 
efficiency.

We can improve this algorithm in some areas...

E.g. some of our pipeline stages also know that they are not going to do 
anymore work for the next X amount of nanoseconds; but we don't use that 
information in our power management algorithm yet. The sleep duration could 
depend on this.

Also, we don't use the CPU power management states yet. I assume that doing 
some work for 20 us at half clock speed is more power conserving than doing the 
same work at full speed for 10 us and then sleeping for 10 us. That's another 
potential improvement.


What we need in generic a power management helper library are functions to feed 
it with the application's perception of how much work is being done, and 
functions to tell if we can sleep and/or if we should change the power 
management states of the individual CPU cores.

Such a unified power management helper (or "busyness") library could perhaps 
also be fed with data directly from the drivers and libraries to support the 
customer use cases you described.

-Morten

RE: [RFC PATCH] ring: adding TPAUSE instruction to ring dequeue

Reply via email to