Hi Anthony,

On Thu, Apr 18, 2024 at 12:52:14PM +0200, Anthony Harivel wrote:
> Date: Thu, 18 Apr 2024 12:52:14 +0200
> From: Anthony Harivel <ahari...@redhat.com>
> Subject: Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu
> > The package energy consumption includes core part and uncore part, where
> > uncore part consumption may not be able to be scaled based on vCPU
> > runtime ratio.
> >
> > When the uncore part consumption is small, the error in this part is
> > small, but if it is large, then the error generated by scaling by vCPU
> > runtime will be large.
> >
> So far we can only work with what Intel is giving us i.e Package power 
> plane and DRAM power plane on server, which is the main target of 
> this feature. Maybe in the future, Intel will expand the core power 
> plane and the uncore power plane to server class CPU ?

Not future features, I'd like to illustrate the impact of the uncore
part (iGPU/NPU on Client or various accelerators on Server) on this
algorithm. Because the consumption of the uncore part is complex and not
necessarily linearly related to the vCPU task running time.

It might be worth to state potential impact on accuracy of uncore parts
to doc (I doubt that heavy uncore consumption will even affect the
consistency of the energy trend as you said).

Anyway, clearer scenarios help this feature get used.

> > May I ask what your usage scenario is? Is there significant uncore
> > consumption (e.g. GPU)?
> >
> Same answer as above: uncore/graphics power plane is only available on 
> client class CPU. 

Yes, iGPU is, but server may have other accelerators, e.g., DSA/IAA/QAT
on SPR.

> > Also, I think of a generic question is whether the error in this
> > calculation is measurable? Like comparing the RAPL status of the same
> > workload on Guest and bare metal to check the error.
> >
> > IIUC, this calculation is highly affected by native/sibling Guests,
> > especially in cloud scenarios where there are multiple Guests, the
> > accuracy of this algorithm needs to be checked.
> >
> Indeed, depending on where your vCPUs are running within the package (on 
> the native or sibling CPU), you might observe different power 
> consumption levels. However, I don't consider this to be a problem, as 
> the ratio calculation takes into account the vCPU's location.
> We also need to approach the measurement differently. Due to the 
> complexity of factors influencing power consumption, we must compare 
> what is comparable. If you require precise power consumption data, 
> use a power meter on the PSU of the server.It will provide the 
> ultimate judgment. However, if you need an estimation to optimize 
> software workloads in a guest, then this feature could be useful. All my 
> tests have consistently shown reproducible output in terms of power 
> consumption, which has convinced me that we can effectively work with 
> it.

Thanks, another mail in which you illustrated that the trend is


> >
> > In addition, RAPL is basically a CPU feature, I think it would be more
> > appropriate to make it as a x86 CPU's property.
> >
> > Your RAPL support actually provides a framework for assisting KVM
> > emulation in userspace, so this informs other feature support (maybe model
> > specific, or architectural) in the future. Enabling/disabling CPU features
> > via -cpu looks more natural.
> This is totally dependant of KVM because it used the KVM MSR Filtering 
> to access userspace when a specific MSR is required.

Yes, but in other words, other KVM based features (completely hardware
virtualization) are also configured by -cpu. This RAPL is still a CPU
feature and just need KVM's help.
> I can try to find a way to use -cpu for this feature and check if KVM is 
> activated or not. 


> >
> > I understand tick would ignore frequency changes, e.g., HWP's auto-pilot
> > or turbo boost. All these CPU frequency change would impact on core energy
> > consumption.
> >
> > I think the better way would be to use APERF counter, but unfortunately it
> > lacks virtualization support (for Intel).
> >
> > Due to such considerations, it may be more worthwhile to evaluate the
> > accuracy of this tick-based algorithm.
> >
> I've evaluated such things with another tool called Kepler [1]. This 
> tool calculate the power ratio with metrics from RAPL and uses either 
> eBPF or the tick based systems for time metrics.

Thanks for this information! I understand current tick based algorithm
is a common approximation in the industry (like Kepler), right?

> The eBPF part [2] is 
> triggered on each 'finish_task_switch' of Thread and calculate the delta 
> of cpu cycle, cache miss, cpu time, etc. Very complex. My tests showed 
> that the difference between using eBPF and tick based ratio is really 
> not that important. Maybe on some special cases, using eBPF would show 
> a way better accuracy but I'm not aware of that.

Good to know!

Just curious, so for using Kepler in Guest to optimize the task (your
use case), is it only necessary that the trends are similar and there
is no requirement for accuracy of the values?
> [1]: https://github.com/sustainable-computing-io/kepler
> [2]: 
> https://github.com/sustainable-computing-io/kepler/blob/main/bpfassets/libbpf/src/kepler.bpf.c


> >
> > > +        /* Sleep a short period while the other threads are working */
> > > +        usleep(MSR_ENERGY_THREAD_SLEEP_US);
> >
> > Is it possible to passively read the energy status? i.e. access the Host
> > MSR and calculate the energy consumption for the Guest when the Guest
> > triggers the relevant exit.
> >
> Yes it could be possible. But what I wanted to avoid with my approach is 
> the overhead it could take when accessing a RAPL MSR in the Guest.
> The value is always available and return very quickly. 
> I'm not sure about the overhead if we have to have to access the MSR, 
> then do the calculation and so on.
> > I think this might make the error larger, but not sure the error would
> > be so large as to be unacceptable.
> >
> I'm a bit concerned about the potential overflow in calculation if the 
> time in between is too big. 



> > > +int is_rapl_enabled(void)
> > > +{
> > > +    const char *path = "/sys/class/powercap/intel-rapl/enabled";
> >
> > This field does not ensure the existence of RAPL MSRs since TPMI would
> > also enable this field. (See powercap_register_control_type() in
> > drivers/powercap/intel_rapl_{msr,tpmi}.c)
> >
> But this is exactly what it is intended to do. Wether it is MSR of TPMI, 
> checking this, tell me that RAPL is activated or not. If it is 
> activated...
> > We can read RAPL MSRs directly. If we get 0 or failure, then there's no
> > RAPL MSRs on Host.
> >
> ... then it is safe to access the RAPL MSR. I would not like to access 
> this on a XYZ cpu without knowing I can !

RAPL will disappear and this field can't ease your concern. ;-)

Even if there are no commercially available machines yet, the logic of
the linux driver has shown that relying on this item is unreliable.

For model specific features, either use the CPU model ID to know if it
is supported, or to access the relevant MSR directly. Even if the
feature is not supported, the register address should not be used for
other purposes, so I don't think there is information leak problem here.

> Thanks a lot for all your feedback Zhao !

You're welcome! I've also been thinking about how to support the
emulation of other thermal features in user space in QEMU, as you did
with RAPL.


Reply via email to