On 05.10.20 16:15, Ralf Ramsauer wrote: > > > On 05/10/2020 15:36, Jan Kiszka wrote: >> On 05.10.20 15:33, Ralf Ramsauer wrote: >>> >>> >>> On 04/10/2020 22:16, Ralf Ramsauer wrote: >>>> On 10/4/20 8:38 PM, Jan Kiszka wrote: >>>>> On 03.10.20 01:56, Ralf Ramsauer wrote: >>>>>> On x86_64 systems, this test inmate measures the time that is required >>>>>> to read a value from main memory. Via rdtsc, it measures the CPU cycles >>>>>> that are required for the access. Acces can either happen cached, or >>>>>> uncached. In case of uncached access, the cache line will be flushed >>>>>> before access. >>>>>> >>>>>> This tool repeats the measurement for 10e6 times, and outputs the >>>>>> average cycles that were required for the access. Before accessing the >>>>>> actual measurement, a dummy test is used to determine the average >>>>>> overhead of one single measurement. >>>>>> >>>>>> And that's pretty useful, because this tool gives a lot of insights of >>>>>> differences between the root and the non-root cell: With tiny effort, we >>>>>> can also run it on Linux. >>>>>> >>>>>> If the 'overhead' time differs between root and non-root cell, this can >>>>>> be an indicator that there might be some timing or speed differences >>>>>> between the root and non-root cell. >>>>>> >>>>>> If the 'uncached' or 'cached' average time differs between the non-root >>>>>> and root cell, it's an indicator that both might have different hardware >>>>>> configurations / setups. >>>>>> >>>>>> The host tool can be compiled with: >>>>>> $ gcc -Os -Wall -Wextra -fno-stack-protector -mno-red-zone -o >>>>>> cache-timing ./inmates/tests/x86/cache-timings-host.c >>>>>> >>>>>> Signed-off-by: Ralf Ramsauer <[email protected]> >>>>>> --- >>>>>> >>>>>> Hi Jan, >>>>>> >>>>>> what do you think about a test inmate like this one? It's still a RFC >>>>>> patch, as >>>>>> I'm not sure if the measurement setup is correct. Especially I might >>>>>> have too >>>>>> much fences. >>>>>> >>>>>> This test could be extended to run permanently and show the results of >>>>>> the last >>>>>> 1e3, 1e5 and 1e6 runs. Having this, this tool could be used to monitor >>>>>> influences of the root cell on the non-root cell's caches. >>>>> >>>>> Such benchmarks aren't bad. However, the current form does not qualify >>>>> for the test folder yet IMHO: no functional test, no easy evaluation of >>>>> benchmark results in order to generate a pass/fail criteria. >>>> >>>> Ack, will move it to demos/. Before posting a v2: Did you have the >>>> chance to look at the usage of the fences? I'm pretty sure that I might >>>> have messed up something. >>>> >>>>> >>>>>> >>>>>> >>>>>> Aaand btw: On a Xeon Gold 5118, we have following values on Linux resp. >>>>>> in the >>>>>> non-root cell: >>>>>> >>>>>> Linux: >>>>>> $ ./cache-timing >>>>>> Measurement rounds: 10000000 >>>>>> Determining measurement overhead... >>>>>> -> Average measurement overhead: 37 cycles >>>>>> Measuring uncached memory access... >>>>>> -> Average uncached memory access: 222 cycles >>>>>> Measuring cached memory access... >>>>>> -> Average cached memory access: 9 cycles >>>>>> >>>>> >>>>> Linux native or Linux in Jailhouse? >>>>> >>>>>> Non-Root: >>>>>> Cell "apic-demo" can be loaded >>>>>> Started cell "apic-demo" >>>>>> CPU 3 received SIPI, vector 100 >>>>>> Measurement rounds: 10000000 >>>>>> Determining measurement overhead... >>>>>> -> Average measurement overhead: 82 cycles >>>>>> Measuring uncached memory access... >>>>>> -> Average uncached memory access: 247 cycles >>>>>> Measuring cached memory access... >>>>>> -> Average cached memory access: 19 cycles >>>>> >>>>> How does this compare to Linux in Jailhouse (if the above was native)? >>>> >>>> Ok, the following table shows the three numbers for >>>> overhead / uncached / cached: >>>> >>>> Measurement | OH | U$ | $ >>>> -----------------------+----+-----+----- >>>> Linux native | 37 | 222 | 9 >>>> Linux root | 37 | 226 | 9 >>>> Linux non-root | 37 | 215 | 9 >>>> libinmate non-root [1] | 82 | 266 | 19 >>>> libinmate non-root [2] | 36 | 217 | 8 >>> >>> Okay, fasten seatbelts, here's another one: >>> >>> $ jh cell create my-cell >>> $ jh cell load my-cell apic-demo.bin >>> $ jh cell start my-cell >>> [snip] >>> Timer fired, jitter: 728 ns, min: 655 ns, max: 899 ns >>> >>> And that one: >>> $ jh cell linux my-cell [...] >>> $ jh cell load my-cell apic-demo.bin >>> $ jh cell start my-cell >>> [snip] >>> Timer fired, jitter: 332 ns, min: 267 ns, max: 461 ns >>> >>> Wow. >> >> Power management? We eventually need to look into those nasty details... > > Yes, very likely. I can can confirm that it's probably power management. > It looks like the following happens: the CPU gets throttled by > root-cell's Linux when offlining the CPU. > > When we later run apic-demo on that CPU, we run it on a throttled CPU. > But if we load Linux on the very same cell before apic-demo, Linux will > take care of power management and bring the CPU up again. > > Per default, my non-root Linux uses the performance cpufreq governor and > configures everthing to max speed. > > To confirm my assumption: If I set the powersave governor before > reloading the cell with apic-demo, I get worse latencies again. > > So this issue must definitely be somehow related to power management. >
This whole topic consists of three aspects at least: - understand what all can be controlled on Intel (and eventually also AMD) CPUs and what cross-code effects it has - model access control in the hypervisor - make use of those tunings in our bare-metal cells We can probably pull the last item to the front as it will provide direct input to the others. Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux -- You received this message because you are subscribed to the Google Groups "Jailhouse" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/jailhouse-dev/d704b7bd-d150-b762-3811-ad0bdc839eef%40siemens.com.
