>From: Darrick J. Wong >Sent: 2008年7月16日 7:18 > >I envision four scenarios: > >0. Guests that don't know about cpufreq still run at whatever >nice level >they started with. > >1. If we have a system with a lot of idle VMs, they will all >run with +5 >nice and this patch has no effect. > >2. If we have a system with a lot of busy VMs, they all run >with -5 nice >and this patch also has no effect. > >3. If, however, we have a lot of idle VMs and a few busy ones, then the >-5 nice of the busy VMs will get those VMs extra CPU time. On a really >crummy FPU microbenchmark I have, the score goes from about 500 to 2000 >with the patch applied, though of course YMMV. In some respects this
How many VMs did you run in this test? All the VMs are idle except the one where your benchmark runs? How about the actual effect when several VMs are doing some stuff? There's another scenario where some VMs don't support cpufreq while others do. Here is it unfair to just renice the latter when the former is not 'nice' at all? Guess this feature has to be applied with some qualifications, e.g. in a group of VMs with known same PM abilities... > >There are some warts to this patch--most notably, the current >implementation uses the Intel MSRs and EST feature flag ... even if the >guest reports the CPU as being AuthenticAMD. Also, there could be >timing problems introduced by this change--the OS thinks the CPU >frequency changes, but I don't know the effect on the guest CPU TSCs. You can report constant tsc feature in cpuid virtualization. Of course if physical TSC is unstable, it's another story about how to mark guest TSC untrustable. (e.g. Marcelo develops one method by simulating C2) > >Control values are as as follows: >0: Nobody's touched cpufreq. nice is the whatever the default is. >1: Lowest speed. nice +5. >2: Medium speed. nice is reset. >3: High speed. nice -5. This description seems mismatch with the implementaion, which pushes +10 and -10 for 1 and 3 case. Maybe I misinterpret the code? One interesting point is the initial value of PERF_CTL MSR. Current 'zero' doesn't reflect a meanful state to guest, since there's no perf entry in ACPI table to carry such value. One likely result is that guest'd think the cur freq as 0 when initializing ACPI cpufreq driver. So it would more make sense to set initial value to 2 (P1), as keeping the default nice value, or even 3 (P0), if you take that state as IDA style which may over-clock but not assure. More critical points to be further thought of, if expecting this feature to be in real use, is the difinition of exposed virtual freq states, and how these states can be mapped to scheduler knobs. Inappropriate exposure may cause guest to excessively bounce between virtual freq points. For example, 'nice' value is only a relative hint to scheduler and there's no guarantee that same portion of cpu cycles are added as what 'nice' value changes. There's even the case where guest requests lowest speed while actual cpu cycles allocated to it keeps similar as last epoch when it's in high speed. This will fool the guest that lowest speed can satisfy its requirement. It's similar to the requirement on core-based hardware coordination logic, where some feedback mechanism (e.g. APERF/MPERF MSR pair) is required to reveal actual freq in last sampling period. Here the VM case may need similar virtualized feedback mechanism. Not sure whether 'actual' freq is easily deduced however. Maybe it's applausive to compare the freq change count for same benchmark between VM and native, and more interesting is, how's the effect when multiple VMs all take use of such features? For example, whether the expected effect is counteracted with only overhead added? Any strange behaviors exposed as in real 'nice' won't be changed so frequently in dozens of ms level? :-) Thanks, Kevin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
