Re: Difference in execution duration between root cell and inmate for same code

Henning Schild Sun, 26 Jan 2020 23:16:25 -0800

On Sat, 25 Jan 2020 23:07:53 -0800
Michael Hinton <[email protected]> wrote:


> Hi Henning,
> 
> On Thursday, January 23, 2020 at 5:15:08 AM UTC-7, Henning Schild
> wrote:
> >
> > Thanks, 
> >
> > that is a lot of information. I would say that is CPU and memory
> > bound work. It should not cause exits at all, maybe a few for
> > getting the input in and the output out. reading ivshmem should not
> > trap,  
> 
>  So IVSHMEM won't trap/cause a vmexit? What are the other potential
> causes for traps, then? My inmate doesn't access any other resources.
>  
> 
> > writing 
> > output to a console should be avoided within the measured time. 
> >  
> Before starting this thread, I found that I accidentally did do this,
> and after removing the console print, I shaved 300 ms off the inmate
> time. But I don't see any more prints that could happen.
> 
> This is how I measure the workload in the inmate: 
> https://github.com/hintron/jailhouse/blob/ba0c5f9cc28edf43ab6970cdaddafea77dd07b4c/inmates/demos/x86/mgh-demo.c#L1117-L1121
> I then print the cycle count and divide it by 3,700,000 manually to
> get the total duration to avoid overflows and truncations.
>  
> 
> > If you need to use something that traps, see if you can "batch"
> > things. I.e. do not read/write in byte-chunks.    
> 
> For truly memory bound applications the mapping of the memory
> matters. 
> > The bigger the pages in the pagetable (and the nested pagetable)
> > the better. You might be able to read performance counters and look
> > at TLB misses. 
> >  
> I'll need to look into that.
>   
> 
> > Not sure what Jailhouse exactly does to mitigate Spectre etc. but
> > these mitigations often have a severe effect on "memory
> > performance". 
> >
> > I would for sure have a look at aligning the CFLAGS used for the
> > Linux application and the inmate. 
> >  
> Ok, I'll double check CFLAGS as well. I'm now using the exact same
> object files for the workload functions, but the discrepancy got
> worse :D 
> 
> > The first things to compare is "native Linux", "root cell Linux
> > under jailhouse" and "non-root cell Linux under jailhouse". If the
> > third is better than your inmate, your inmates environment is
> > likely the cause. 
> Yes, I've been looking at those three cases, and Linux under
> Jailhouse is only 30 ms slower than Linux not under Jailhouse, while
> both of those are way faster than the inmate. So that tells me that
> there is something going wrong with the inmate.

Ok, so we are just looking for differences between the inmate and the
linux as non-root cell, because the jailhouse/virtualization overhead
is acceptable or known.

In that case a memory bound workload boils down to the mapping and the
tlb misses or CAT. And cpu bound could be an issue with the FPU. If your
binary uses FPU instructions but is able to fall back to soft-fpu, you
should check which path it takes in the inmate.

Henning

> Thanks for the help,
> -Michael
> 
> Henning 
> >
> > On Wed, 22 Jan 2020 23:57:29 -0800 
> > Michael Hinton <[email protected] <javascript:>> wrote: 
> >  
> > > Ralf, Henning, 
> > > 
> > > 
> > > Thanks for the quick response, and sorry for the delay. 
> > > 
> > > Here’s my setup: I’ve got a 6-core Intel x86-64 CPU running
> > > Kubuntu 19.10. I have an inmate that is given a single core and
> > > runs a single-threaded workload. For comparison, I also run the
> > > same workload in Linux under Jailhouse. 
> > > 
> > > For a SHA3 workload with the same 20 MiB input, the root Linux
> > > cell (and no inmate running) takes about 2 seconds, while the
> > > inmate (and an idle root cell) takes about 2.8 seconds. That is a
> > > worrisome discrepancy, and I need to understand why it’s 0.8 s
> > > slower. 
> > > 
> > > This is the SHA3 workload: 
> > >   
> > https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/inmates/lib/mgh-sha3.c#L185-L208
> >   
> > > 
> > > This is the Linux wrapper for the SHA3 workload: 
> > >   
> > https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/mgh/workloads/src/sha3-512.c#L166-L168
> >   
> > > 
> > > This is the inmate program calling the SHA3 workload: 
> > >   
> > https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/inmates/demos/x86/mgh-demo.c#L370-L379
> >   
> > > 
> > > You can see that the inmate and the Linux wrapper both execute
> > > the same function, sha3_mgh(). It's the same C code. 
> > > 
> > > The other workloads I run are intentionally more memory
> > > intensive. They see a much worse slowdown. For my CSB workload,
> > > the root cell takes only 0.05 s for a 20 MiB input, while the
> > > inmate takes 1.48 s (30x difference). And for my Random Access
> > > workload, the root cell takes 0.08 s while the inmate takes 3.29
> > > s for a 20 MiB input (40x difference). 
> > > 
> > > Here are the root and inmate cell configs, respectively: 
> > > 
> > >   
> > https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/configs/x86/bazooka-root.c
> >   
> > > 
> > >   
> > https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/configs/x86/bazooka-inmate.c
> >   
> > > 
> > > I did do some modifications to Jailhouse with VMX and the
> > > preemption timer, but any slowdown that I may have inadvertently
> > > introduced should apply equally to the inmate and root cell. 
> > > 
> > > It’s possible that I am measuring the duration of the inmate 
> > > incorrectly. But the number of vmexits I measure for the inmate
> > > and root seem to roughly correspond with the duration. I also
> > > made sure to avoid tsc_read_ns() by instead recording the TSC
> > > cycles and deriving the duration by dividing by 3,700,000,000
> > > (the unchanging TSC frequency of my processor). Without this, the
> > > time recorded would overflow after something like 1.2 seconds. 
> > > 
> > > 
> > > I'm wondering if something else is causing unexpected delays:
> > > using IVSHMEM, memory mapping extra memory pages and using it to
> > > hold my input, printing to a virtual console in addition to a
> > > serial console, disabling hardware p-states, turbo boost in the
> > > root cell, maybe the workload code is being compiled to different
> > > instructions for the inmate vs. Linux, etc. 
> > > 
> > > Sorry for all the detail, but I am grasping at straws at this
> > > point. Any ideas at what I could look into are appreciated. 
> > > 
> > > Thanks, 
> > > Michael 
> > > 
> > > On Monday, January 20, 2020 at 6:46:32 AM UTC-7, Henning Schild
> > > wrote:   
> > > > 
> > > > On Sun, 19 Jan 2020 23:45:46 -0800 
> > > > Michael Hinton <[email protected] <javascript:>> wrote: 
> > > >     
> > > > > Hello, 
> > > > > 
> > > > > I have found that running code in an inmate is a lot slower
> > > > > than running that same code in the root cell on my x86
> > > > > machine. I am not sure why.     
> > > > 
> > > > Can you elaborate on "code" and "a lot"? Maybe roughly tell us
> > > > what your testcase does and how severe your slowdown is.
> > > > Synthetic microbenchmark to measure context switching ? 
> > > > 
> > > > As Ralf already said, anything causing "exits" can be subject
> > > > to slowdown. But that should be roughly the same for the root
> > > > cell or any non-root cell, is it truly the "same" code? 
> > > > 
> > > > And of cause anything accessing shared resources can be slowed
> > > > down by the sharing. Caches/buses ... but i would not expect "a
> > > > lot". 
> > > > 
> > > > regards, 
> > > > Henning 
> > > >     
> > >   
> >
> >  
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jailhouse-dev/20200127081602.08ea3fd6%40md1za8fc.ad001.siemens.net.

Re: Difference in execution duration between root cell and inmate for same code

Reply via email to