Re: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

Darrick J. Wong Thu, 17 Jul 2008 12:06:18 -0700

On Wed, Jul 16, 2008 at 01:56:51PM +0800, Tian, Kevin wrote:
> 
> How many VMs did you run in this test?


100 idle

> All the VMs are idle except
> the one where your benchmark runs?

Yes.

> How about the actual effect when several VMs are doing some stuff?

If there are multiple VMs that are busy, the busy ones will fight among
themselves for CPU time.  I still see some priority boost, just not as
much.

> There's another scenario where some VMs don't support cpufreq while
> others do. Here is it unfair to just renice the latter when the former is
> not 'nice' at all?
> 
> Guess this feature has to be applied with some qualifications, e.g.
> in a group of VMs with known same PM abilities...

Agreed it's not very convenient for guests that don't know about
cpufreq, though I was planning to make it the case that unaware guests
get no priority boost/reduction.

> You can report constant tsc feature in cpuid virtualization. Of course
> if physical TSC is unstable, it's another story about how to mark guest
> TSC untrustable. (e.g. Marcelo develops one method by simulating C2)

I wonder how stable the virtual tsc is...?  Will have to study this.

> This description seems mismatch with the implementaion, which pushes
> +10 and -10 for 1 and 3 case. Maybe I misinterpret the code?

Nope, that's a mistake on my part.

> One interesting point is the initial value of PERF_CTL MSR. Current 'zero'
> doesn't reflect a meanful state to guest, since there's no perf entry in
> ACPI table to carry such value. One likely result is that guest'd think the
> cur freq as 0 when initializing ACPI cpufreq driver. So it would more make
> sense to set initial value to 2 (P1), as keeping the default nice value, or 
> even 3 (P0), if you take that state as IDA style which may over-clock but
> not assure.

Indeed.  I had pondered this point considerably myself.  For this RFC I
decided that I could leave the MSR as zero as a way of detecting a guest
that didn't know anything, in case that ability is useful.  However, the
Linux drivers seem to give you either 0MHz or some arbitrary p-state, so
I think I'll change it to value 1.

> More critical points to be further thought of, if expecting this feature to be
> in real use, is the difinition of exposed virtual freq states, and how these
> states can be mapped to scheduler knobs. Inappropriate exposure may
> cause guest to excessively bounce between virtual freq points. For example, 
> 'nice' value is only a relative hint to scheduler and there's no guarantee 
> that
> same portion of cpu cycles are added as what 'nice' value changes. There's

IDA has the same problem... the T61 BIOS "compensates" for this fakery
by reporting a frequency of $max_freq + 1 so if you're smart then you'll
somehow know that you might see a boost that you can't measure. :P

I suppose the problem here is that p-states were designed on the
assumption that you're directly manipulating hardware speeds, whereas
what we really want in both this patch and IDA are qualitative values
("medium speed", "highest speed", "ludicrous speed?")

> even the case where guest requests lowest speed while actual cpu cycles
> allocated to it keeps similar as last epoch when it's in high speed. This
> will fool the guest that lowest speed can satisfy its requirement. It's 
> similar 

On the other hand, if you get the same performance  at both high and low
speeds, then it doesn't really matter which one you choose.  At least
not until the load changes.  I suppose the next question is, how much
software is dependent on knowing the exact CPU frequency, and are
workload schedulers smart enough to realize that performance
characteristics can change over time (throttling, TM1/TM2, etc)?
Inasmuch as you actually ever know, since with hardware coordination of
cpufreq the hardware can do whatever it wants.

> to the requirement on core-based hardware coordination logic, where some 
> feedback mechanism (e.g. APERF/MPERF MSR pair) is required to reveal 
> actual freq in last sampling period. Here the VM case may need similar 
> virtualized feedback mechanism. Not sure whether 'actual' freq is easily 
> deduced however.

I don't think it's easily deduced.  I also don't think APERF/MPERF are emulated 
in
kvm yet.  I suppose it wouldn't be difficult to add those two, though
measuring that might be a bit messy.

Maybe the cheap workaround for now is to report the CPU speeds in the
table as n-1, n, n+1.

> Maybe it's applausive to compare the freq change count for same benchmark
> between VM and native, and more interesting is, how's the effect when 
> multiple VMs all take use of such features? For example, whether the
> expected effect is counteracted with only overhead added? Any strange
> behaviors exposed as in real 'nice' won't be changed so frequently in dozens
> of ms level? :-)

I'll run some benchmarks and see what happens over the next week.

--D
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

Reply via email to