On 23/01/2020 13:15, Henning Schild wrote:
> Thanks,
>
> that is a lot of information. I would say that is CPU and memory bound
> work. It should not cause exits at all, maybe a few for getting the
> input in and the output out. reading ivshmem should not trap, writing
> output to a console should be avoided within the measured time.
> If you need to use something that traps, see if you can "batch" things.
> I.e. do not read/write in byte-chunks.
>
> For truly memory bound applications the mapping of the memory matters.
> The bigger the pages in the pagetable (and the nested pagetable) the
> better. You might be able to read performance counters and look at TLB
> misses.
Good point. But I guess that can't explain 40x slowdown.
> Not sure what Jailhouse exactly does to mitigate Spectre etc. but these
> mitigations often have a severe effect on "memory performance".
On x86: Nothing that would affect inmates in their execution.
Ralf
>
> I would for sure have a look at aligning the CFLAGS used for the Linux
> application and the inmate.
>
> The first things to compare is "native Linux", "root cell Linux under
> jailhouse" and "non-root cell Linux under jailhouse". If the third is
> better than your inmate, your inmates environment is likely the cause.
>
> Henning
>
> On Wed, 22 Jan 2020 23:57:29 -0800
> Michael Hinton <[email protected]> wrote:
>
>> Ralf, Henning,
>>
>>
>> Thanks for the quick response, and sorry for the delay.
>>
>> Here’s my setup: I’ve got a 6-core Intel x86-64 CPU running Kubuntu
>> 19.10. I have an inmate that is given a single core and runs a
>> single-threaded workload. For comparison, I also run the same
>> workload in Linux under Jailhouse.
>>
>> For a SHA3 workload with the same 20 MiB input, the root Linux cell
>> (and no inmate running) takes about 2 seconds, while the inmate (and
>> an idle root cell) takes about 2.8 seconds. That is a worrisome
>> discrepancy, and I need to understand why it’s 0.8 s slower.
>>
>> This is the SHA3 workload:
>> https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/inmates/lib/mgh-sha3.c#L185-L208
>>
>> This is the Linux wrapper for the SHA3 workload:
>> https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/mgh/workloads/src/sha3-512.c#L166-L168
>>
>> This is the inmate program calling the SHA3 workload:
>> https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/inmates/demos/x86/mgh-demo.c#L370-L379
>>
>> You can see that the inmate and the Linux wrapper both execute the
>> same function, sha3_mgh(). It's the same C code.
>>
>> The other workloads I run are intentionally more memory intensive.
>> They see a much worse slowdown. For my CSB workload, the root cell
>> takes only 0.05 s for a 20 MiB input, while the inmate takes 1.48 s
>> (30x difference). And for my Random Access workload, the root cell
>> takes 0.08 s while the inmate takes 3.29 s for a 20 MiB input (40x
>> difference).
>>
>> Here are the root and inmate cell configs, respectively:
>>
>> https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/configs/x86/bazooka-root.c
>>
>> https://github.com/hintron/jailhouse/blob/76e6d446ca682f73679616a0f3df8ac79f4a1cde/configs/x86/bazooka-inmate.c
>>
>> I did do some modifications to Jailhouse with VMX and the preemption
>> timer, but any slowdown that I may have inadvertently introduced
>> should apply equally to the inmate and root cell.
>>
>> It’s possible that I am measuring the duration of the inmate
>> incorrectly. But the number of vmexits I measure for the inmate and
>> root seem to roughly correspond with the duration. I also made sure
>> to avoid tsc_read_ns() by instead recording the TSC cycles and
>> deriving the duration by dividing by 3,700,000,000 (the unchanging
>> TSC frequency of my processor). Without this, the time recorded would
>> overflow after something like 1.2 seconds.
>>
>>
>> I'm wondering if something else is causing unexpected delays: using
>> IVSHMEM, memory mapping extra memory pages and using it to hold my
>> input, printing to a virtual console in addition to a serial console,
>> disabling hardware p-states, turbo boost in the root cell, maybe the
>> workload code is being compiled to different instructions for the
>> inmate vs. Linux, etc.
>>
>> Sorry for all the detail, but I am grasping at straws at this point.
>> Any ideas at what I could look into are appreciated.
>>
>> Thanks,
>> Michael
>>
>> On Monday, January 20, 2020 at 6:46:32 AM UTC-7, Henning Schild wrote:
>>>
>>> On Sun, 19 Jan 2020 23:45:46 -0800
>>> Michael Hinton <[email protected] <javascript:>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have found that running code in an inmate is a lot slower than
>>>> running that same code in the root cell on my x86 machine. I am
>>>> not sure why.
>>>
>>> Can you elaborate on "code" and "a lot"? Maybe roughly tell us what
>>> your testcase does and how severe your slowdown is. Synthetic
>>> microbenchmark to measure context switching ?
>>>
>>> As Ralf already said, anything causing "exits" can be subject to
>>> slowdown. But that should be roughly the same for the root cell or
>>> any non-root cell, is it truly the "same" code?
>>>
>>> And of cause anything accessing shared resources can be slowed down
>>> by the sharing. Caches/buses ... but i would not expect "a lot".
>>>
>>> regards,
>>> Henning
>>>
>>
>
--
You received this message because you are subscribed to the Google Groups
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/jailhouse-dev/701bb65f-a45a-8f01-1cdd-55682c8fa626%40oth-regensburg.de.