Re: Difference in execution duration between root cell and inmate for same code

Michael Hinton Sat, 25 Jan 2020 22:43:25 -0800

Hi Ralf,

On Thursday, January 23, 2020 at 5:02:40 AM UTC-7, Ralf Ramsauer wrote:
>
> On 23/01/2020 08:57, Michael Hinton wrote: 
> > Here’s my setup: I’ve got a 6-core Intel x86-64 CPU running Kubuntu 
> > 19.10. I have an inmate that is given a single core and runs a 
> > single-threaded workload. For comparison, I also run the same workload 
> > in Linux under Jailhouse. 
>
> What CPU exactly? 
>
CPU 2 (.cpus = {0b0100}). I could try moving it to another core to see if 
that makes a difference.


> > For a SHA3 workload with the same 20 MiB input, the root Linux cell (and 
> > no inmate running) takes about 2 seconds, while the inmate (and an idle 
> > root cell) takes about 2.8 seconds. That is a worrisome discrepancy, and 
> > I need to understand why it’s 0.8 s slower. 
>
> What about CPU power features? cpufreq? turbo boost? ... 
>
I have already turned off Hardware P-states when booting Linux, and that 
stopped hardware p-state stuff affecting my inmate benchmarks:

hintron@bazooka:~/code/jailhouse/mgh/scripts$ grep "no_hwp" 
/etc/default/grub    
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=off 
memmap=82M\\\$0x3a000000 kvm_intel.nested=1 in
tel_pstate=no_hwp acpi=force"

As for Turbo Boost, I've been trying to turn it off, but it appears that 
the inmate just runs at the max turbo boost frequency when it starts up, 
which in this case is 3.9 GHz. I've even measured the frequency in the 
inmate with the APERF and MPERF MSRs to verify this. When I change the 
Turbo Boost max frequency in Linux to 3.7 GHz using CoreFreq, that is what 
the inmate runs at when I start the inmate.

SHA3 is only computationally 'expensive', right? So it's neither memory 
> intensive nor should it trap. 
>
Yes, SHA3 is computationally expensive. When running it under VTune, it 
doesn't show it as memory bound. The other two workloads are more memory 
bound.

By "trap" do you mean a vmexit or something else?
 

> > You can see that the inmate and the Linux wrapper both execute the same 
> > function, sha3_mgh(). It's the same C code. 
> > 
> > 
> > The other workloads I run are intentionally more memory intensive. They 
> > see a much worse slowdown. For my CSB workload, the root cell takes only 
> > 0.05 s for a 20 MiB input, while the inmate takes 1.48 s (30x 
> > difference). And for my Random Access workload, the root cell takes 0.08 
> > s while the inmate takes 3.29 s for a 20 MiB input (40x difference). 
>
> Now this sounds pretty much like what I once had: too little caches for 
> the inmate. 
>
> BTW: For a sound comparison, you would need to take care to have a 
> comparable initial hardware state: E.g., you need to take care that 
> workloads in root-cell and non-root inmate are both either uncached or 
> cached when starting the code. 
>
I run 10 consecutive iterations for each workload, so that should flush 
things out, right?

But that's fine tuning, and won't explain a 40x difference. 
>
> I recommend to deactivate hyperthreading. 


> If your inmate just gets one sibling, the other one will still belong to 
> Linux, which could, in case of utilisation, steals a lot of power. So 
> either disable HT or assign both siblings to the inmate. 
>
Yeah, I deactivated HT a while ago, because I realized there could be 
significant coupling between logical threads on the same core, like you 
mentioned. So there are 6 cores on my CPU without HT.
 

> > 
> https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/configs/x86/bazooka-inmate.c
>  
> > 
> > 
> > I did do some modifications to Jailhouse with VMX and the preemption 
> > timer, but any slowdown that I may have inadvertently introduced should 
> > apply equally to the inmate and root cell. 
> > 
> > 
> > It’s possible that I am measuring the duration of the inmate 
> > incorrectly. But the number of vmexits I measure for the inmate and root 
> > seem to roughly correspond with the duration. I also made sure to avoid 
>
> Yeah, I would also expect that: Your code only utilises memory + CPU, 
> almost no I/O. 
>
> > tsc_read_ns() by instead recording the TSC cycles and deriving the 
> > duration by dividing by 3,700,000,000 (the unchanging TSC frequency of 
> > my processor). Without this, the time recorded would overflow after 
> > something like 1.2 seconds. 
> > 
> > 
> > I'm wondering if something else is causing unexpected delays: using 
> > IVSHMEM, memory mapping extra memory pages and using it to hold my 
> > input, printing to a virtual console in addition to a serial console, 
> > disabling hardware p-states, turbo boost in the root cell, maybe the 
> > workload code is being compiled to different instructions for the inmate 
> > vs. Linux, etc. 
>
> The latter one: You definitely need to check that. If your Linux 
> compiler generates (e.g.) AVX code and your inmate. 
>
Ok, I compared the assembly and they were very different. It turns out that 
I was using a different version of GCC *and* my machine by default does PIC.
 

> You could also try to link the same library object to your target 
> binaries -- the build system is your friend. 
>
That is a great idea. I just did that today to see if I get better results. 
I made sure to use the same object files for the workload. That, combined 
with no PIC and using the same version of GCC, made the duration of the 
SHA3 workload running in Linux go from 2.0 s to *1.2* seconds. So now the 
discrepancy is even larger! 2.8 s (inmate) vs. 1.2 s (Linux). I imagine 
that not doing PIC was probably the biggest difference, but I'm not sure 
exactly how that interacts with the hypervisor. At any rate, I'm still 
quite lost at how there is a 1.6 s difference between the inmate and Linux.

I did notice something strange, though: somehow the CPU features in 
/proc/cpuinfo are different for different cores. So this causes `jailhouse 
hardware check` to fail, since that check assumes that all CPUs have the 
same features. When the machine boots, they do. But after running for a 
while, they don't. I'm not sure what causes them to activate those features.

So for CPU 0 and 2, right now it shows that they have three extra features: 
md_clear, 
flush_l1d, and ssbd. All three of those are hardware bug mitigation 
features. But can that explain a 1.6 s difference? And why would some CPUs 
run the mitigations while others do not? That seems fishy to me. I'll do 
more testing to see if running the inmate on a core without those features 
currently active improves performance (it should, but by how much?). 

 

> > Sorry for all the detail, but I am grasping at straws at this point. Any 
> > ideas at what I could look into are appreciated. 
>
> Benchmarking is fun. Especially getting the hardware under control. :-)  
>
  Ralf 
>
"fun" :)

Thanks,
Michael

-- 
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jailhouse-dev/14044a3a-7bc1-45c0-8447-2138f3834838%40googlegroups.com.

Re: Difference in execution duration between root cell and inmate for same code

Reply via email to