On 05/10/2020 16:22, Jan Kiszka wrote:
> On 05.10.20 16:15, Ralf Ramsauer wrote:
>>
>>
>> On 05/10/2020 15:36, Jan Kiszka wrote:
>>> On 05.10.20 15:33, Ralf Ramsauer wrote:
>>>>
>>>>
>>>> On 04/10/2020 22:16, Ralf Ramsauer wrote:
>>>>> On 10/4/20 8:38 PM, Jan Kiszka wrote:
>>>>>> On 03.10.20 01:56, Ralf Ramsauer wrote:
>>>>>>> On x86_64 systems, this test inmate measures the time that is required
>>>>>>> to read a value from main memory. Via rdtsc, it measures the CPU cycles
>>>>>>> that are required for the access. Acces can either happen cached, or
>>>>>>> uncached. In case of uncached access, the cache line will be flushed
>>>>>>> before access.
>>>>>>>
>>>>>>> This tool repeats the measurement for 10e6 times, and outputs the
>>>>>>> average cycles that were required for the access. Before accessing the
>>>>>>> actual measurement, a dummy test is used to determine the average
>>>>>>> overhead of one single measurement.
>>>>>>>
>>>>>>> And that's pretty useful, because this tool gives a lot of insights of
>>>>>>> differences between the root and the non-root cell: With tiny effort, we
>>>>>>> can also run it on Linux.
>>>>>>>
>>>>>>> If the 'overhead' time differs between root and non-root cell, this can
>>>>>>> be an indicator that there might be some timing or speed differences
>>>>>>> between the root and non-root cell.
>>>>>>>
>>>>>>> If the 'uncached' or 'cached' average time differs between the non-root
>>>>>>> and root cell, it's an indicator that both might have different hardware
>>>>>>> configurations / setups.
>>>>>>>
>>>>>>> The host tool can be compiled with:
>>>>>>> $ gcc -Os -Wall -Wextra -fno-stack-protector -mno-red-zone -o 
>>>>>>> cache-timing ./inmates/tests/x86/cache-timings-host.c
>>>>>>>
>>>>>>> Signed-off-by: Ralf Ramsauer <[email protected]>
>>>>>>> ---
>>>>>>>
>>>>>>> Hi Jan,
>>>>>>>
>>>>>>> what do you think about a test inmate like this one? It's still a RFC 
>>>>>>> patch, as
>>>>>>> I'm not sure if the measurement setup is correct. Especially I might 
>>>>>>> have too
>>>>>>> much fences.
>>>>>>>
>>>>>>> This test could be extended to run permanently and show the results of 
>>>>>>> the last
>>>>>>> 1e3, 1e5 and 1e6 runs. Having this, this tool could be used to monitor
>>>>>>> influences of the root cell on the non-root cell's caches.
>>>>>>
>>>>>> Such benchmarks aren't bad. However, the current form does not qualify
>>>>>> for the test folder yet IMHO: no functional test, no easy evaluation of
>>>>>> benchmark results in order to generate a pass/fail criteria.
>>>>>
>>>>> Ack, will move it to demos/. Before posting a v2: Did you have the
>>>>> chance to look at the usage of the fences? I'm pretty sure that I might
>>>>> have messed up something.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Aaand btw: On a Xeon Gold 5118, we have following values on Linux resp. 
>>>>>>> in the
>>>>>>> non-root cell:
>>>>>>>
>>>>>>> Linux:
>>>>>>> $ ./cache-timing
>>>>>>> Measurement rounds: 10000000
>>>>>>> Determining measurement overhead...
>>>>>>>   -> Average measurement overhead: 37 cycles
>>>>>>> Measuring uncached memory access...
>>>>>>>   -> Average uncached memory access: 222 cycles
>>>>>>> Measuring cached memory access...
>>>>>>>   -> Average cached memory access: 9 cycles
>>>>>>>
>>>>>>
>>>>>> Linux native or Linux in Jailhouse?
>>>>>>
>>>>>>> Non-Root:
>>>>>>> Cell "apic-demo" can be loaded
>>>>>>> Started cell "apic-demo"
>>>>>>> CPU 3 received SIPI, vector 100
>>>>>>> Measurement rounds: 10000000
>>>>>>> Determining measurement overhead...
>>>>>>>   -> Average measurement overhead: 82 cycles
>>>>>>> Measuring uncached memory access...
>>>>>>>   -> Average uncached memory access: 247 cycles
>>>>>>> Measuring cached memory access...
>>>>>>>   -> Average cached memory access: 19 cycles
>>>>>>
>>>>>> How does this compare to Linux in Jailhouse (if the above was native)?
>>>>>
>>>>> Ok, the following table shows the three numbers for
>>>>> overhead / uncached / cached:
>>>>>
>>>>> Measurement            | OH |  U$ | $
>>>>> -----------------------+----+-----+-----
>>>>> Linux native           | 37 | 222 |  9
>>>>> Linux root             | 37 | 226 |  9
>>>>> Linux non-root         | 37 | 215 |  9
>>>>> libinmate non-root [1] | 82 | 266 | 19
>>>>> libinmate non-root [2] | 36 | 217 |  8
>>>>
>>>> Okay, fasten seatbelts, here's another one:
>>>>
>>>> $ jh cell create my-cell
>>>> $ jh cell load my-cell apic-demo.bin
>>>> $ jh cell start my-cell
>>>> [snip]
>>>> Timer fired, jitter:    728 ns, min:    655 ns, max:    899 ns
>>>>
>>>> And that one:
>>>> $ jh cell linux my-cell [...]
>>>> $ jh cell load my-cell apic-demo.bin
>>>> $ jh cell start my-cell
>>>> [snip]
>>>> Timer fired, jitter:    332 ns, min:    267 ns, max:    461 ns
>>>>
>>>> Wow.
>>>
>>> Power management? We eventually need to look into those nasty details...
>>
>> Yes, very likely. I can can confirm that it's probably power management.
>> It looks like the following happens: the CPU gets throttled by
>> root-cell's Linux when offlining the CPU.
>>
>> When we later run apic-demo on that CPU, we run it on a throttled CPU.
>> But if we load Linux on the very same cell before apic-demo, Linux will
>> take care of power management and bring the CPU up again.
>>
>> Per default, my non-root Linux uses the performance cpufreq governor and
>> configures everthing to max speed.
>>
>> To confirm my assumption: If I set the powersave governor before
>> reloading the cell with apic-demo, I get worse latencies again.
>>
>> So this issue must definitely be somehow related to power management.
>>
> 
> This whole topic consists of three aspects at least:
> 
>  - understand what all can be controlled on Intel (and eventually also
>    AMD) CPUs and what cross-code effects it has
>  - model access control in the hypervisor
>  - make use of those tunings in our bare-metal cells
> 
> We can probably pull the last item to the front as it will provide
> direct input to the others.

Just for the records, this is my local hack to get the inmate CPU in the
highest pstate:

diff --git a/inmates/lib/x86/setup.c b/inmates/lib/x86/setup.c
index 807db99e..9c03ca3b 100644
--- a/inmates/lib/x86/setup.c
+++ b/inmates/lib/x86/setup.c
@@ -42,6 +42,10 @@

 #define AUTHENTIC_AMD(n)       (((const u32 *)"AuthenticAMD")[n])

+#define MSR_IA32_PERF_CTRL     0x199
+#define MSR_IA32_MISC_ENABLE   0x1a0
+#define MAX_PSTATE             23
+
 void *stack = (void*)stack_top;

 struct desc_table_reg {
@@ -72,4 +76,9 @@ void arch_init_early(void)
        dtr.limit = sizeof(idt) - 1;
        dtr.base = (unsigned long)&idt;
        asm volatile("lidt %0" : : "m" (dtr));
+
+       u64 perf = read_msr(MSR_IA32_PERF_CTRL);
+       printk("CPU booted with: %llx\n", perf);
+       perf = MAX_PSTATE << 8;
+       write_msr(MSR_IA32_PERF_CTRL, perf);
 }


You can determine MAX_PSTATE with:
$ cat /sys/devices/system/cpu/intel_pstate/num_pstates

in the root-cell.

  Ralf

-- 
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jailhouse-dev/4dbc53d4-893c-ae9a-f231-18e790b8901f%40oth-regensburg.de.

Reply via email to