Re: [gem5-dev] Multicore ARM v8 KVM based simulation

Andreas Sandberg Wed, 04 Apr 2018 04:09:33 -0700

The performance impact shouldn't be too bad. I did some scalability tests using 
LU from SPLASH 2 years ago. IIRC, I was using an 8-core Westmere-EX based 
system at the time. Native throughput for that benchmark was ~30GIPS @ 8 cores. 
When running in KVM, I got something like ~15GIPS with a 1ms quantum and 10GIPS 
with a 0.5ms quantum. Unfortunately, I don't have that data for any Arm-based 
system.


Turning on the HDLCD will probably reduce throughput quite a bit, but it should 
be running in a functional refresh mode (10Hz by default) when running in KVM. 
It's far from optimised, but should work. We had some KMI issues last time I 
looked at this. IIRC, the KMI model doesn't clear interrupts correctly, which 
confuses the interrupt model in the kernel.

Setting up event queues for KVM automatically would definitely be desirable. As 
you, painfully, noticed, this is currently the responsibility of the config 
script. The Arm example scripts do it already and should work out of the box. I 
suspect it might be tricky to get this right from inside the simulator without 
some re-architecting of the simulator core. What we would have to do is to add 
an API to allocate semi-private EQs from inside C++. Since Python provides an 
EQ number that get allocated in C++ at instantiation time, we would have to 
defer EQ allocation until init() is called or create a better mechanism to 
allocate EQs from Python instead of having a plain EQ index. We still want a 
way to force the old behaviour when simulating single-core systems since that 
makes debugging a lot easier.

Cheers,
Andreas

On 29/03/2018 01:14, Gabe Black wrote:
Ok, I think I figured it out, and it all has to do with the simulation quantum. 
If the quantum is too big, the kernel might poke hardware and expect to get an 
interrupt within a certain period of time. It could be that the CPU gets to the 
end of its timeout before the simulated hardware has had a chance to trigger an 
interrupt, even though the interrupt would happen first if the event queues 
were held in tighter sync. If I decrease the size of the quantum from 500ms 
(per your suggestion) to 1ms, then I see the errors from the keyboard/mouse 
drivers and the ATA driver go away, at least in the one CPU/multiple event 
queue configuration.

I'm going to do some more testing to make sure there isn't some other problem 
that pops up, and also to characterize the performance impact which I'm hopeful 
won't be too bad.

Also, I was thinking it would be nice if KVM CPUs could set up their event 
queues in some more automatic, less error prone way. Before I knew that they 
needed their own event queue (which I think is just institutional knowledge 
that isn't documented/warned about/etc.?), I had no idea what was going wrong 
when just dropping in some KVM CPUs in place of regular CPUs. I don't have a 
fully fleshed out plan for how to do that, but it doesn't *seem* like something 
that should be that hard to do.

Gabe

On Mon, Mar 26, 2018 at 7:06 PM, Gabe Black 
<[email protected]<mailto:[email protected]>> wrote:
I looked into this a little further, and I see the same problem happen with one 
CPU but with the CPU and the devices in different event queues. I haven't 
figured out exactly where things go wrong, but it looks like a write DMA is set 
up but doesn't happen for some reason. I'm not sure if the DMA starts but then 
gets stuck, or if it never starts at all. It could also be that the DMA 
happens, but the completion event (which is what doesn't seem to happen) is 
mishandled because of the additional event queue.

I turned on the DMA debug flag, but that produced so much debug output that my 
tools are crashing. I'll have to see what I can do to narrow things down a bit.

Gabe

On Thu, Mar 22, 2018 at 11:28 AM, Gabe Black 
<[email protected]<mailto:[email protected]>> wrote:
Ok, thanks. We're deciding internally what approach to use to tackle this.

Gabe

On Wed, Mar 21, 2018 at 3:01 AM, Andreas Sandberg 
<[email protected]<mailto:[email protected]>> wrote:

Hi Gabe,

There are issues with the IDE model that prevent it from working with in-kernel 
GIC emulation. I believe the model doesn't clear interrupts correctly, which 
confuses the host kernel. I tried to debug this at some point, but wasn't able 
to do much immaediate progress and decided it wasn't worth the effort. The 
VirtIO block devices doesn't suffer from this problem.

Using the VirtIO device by default seems like a good idea to me. It doesn't 
simulate any timing, but that might not be a huge deal since the IDE device 
doesn't provide realistic timing anyway. It would be really awesome if we had a 
modern storage controller (e.g., NVMe or AHCI) and proper storage timing models.

Cheers,
Andreas

On 20/03/2018 23:38, Gabe Black wrote:
My next question is about disks. I see that the fs_bigLITTLE.py script uses 
PciVirtIO to set up its disks, where I'm using IDE which I inherited from the 
fs.py scripts I used as reference. The problem I'm seeing is that the IDE 
controllers seem to be mangling commands and dropping interrupts, so this 
difference looks particularly suspicious. Is there a KVM related reason you're 
using PciVirtIO? Is this something that *should* work with IDE bug doesn't, or 
do I have to use PciVirtIO for things to work properly? I'm not familiar with 
PciVirtIO beyond briefly skimming the source for it in gem5. Is this something 
we should consider using globally as a replacement for IDE, even in simulations 
where we're trying to be really realistic?

Thanks again for all the help.

Gabe

On Tue, Mar 20, 2018 at 3:14 PM, Gabe Black 
<[email protected]<mailto:[email protected]>> wrote:
Ok, that (multiple event queues) made things way better. There are still some 
glitches to figure out, but at least it makes good forward progress at a 
reasonable speed. Thanks!

Gabe

On Mon, Mar 19, 2018 at 5:12 PM, Gabe Black 
<[email protected]<mailto:[email protected]>> wrote:
This is on an chromebook based on the RK3399 with only ~4GB of RAM which is not 
ideal, although we have a bigger machine in the works for the future. I agree 
with your reasoning and don't think option 1 is a problem. We're using static 
DTBs so I don't think that's an issue either. In my script, I'm not doing 
anything smart with the event queues, so that's likely at least part of the 
problem. When I tried using fs_bigLITTLE.py I ran into what looked like a 
similar issue so that might not be the whole story, but it's definitely 
something I should fix up. I'll let you know how that goes!

Gabe

On Mon, Mar 19, 2018 at 4:30 AM, Andreas Sandberg 
<[email protected]<mailto:[email protected]>> wrote:

Hmm, OK, this is very strange.

What type of hardware are you running on? Is it an A57-based chip or something 
else? Also, what's your simulation quantum? I have been able to run with a 
0.5ms quantum  (5e8 ticks).

I think the following trace of two CPUs running in KVM should be roughly 
equivalent to the trace you shared earlier. It was generated on a commercially 
available 8xA57 (16GiB ram) using the following command (gem5 rev 9dc44b417):

gem5.opt -r --debug-flags Kvm,KvmIO,KvmRun configs/example/arm/fs_bigLITTLE.py \
   --sim-quantum '0.5ms' \
   --cpu-type kvm --big-cpus 0 --little-cpus 2 \
   --dtb system/arm/dt/armv8_gem5_v1_2cpu.dtb --kernel 
vmlinux.aarch64.4.4-d318f95d0c

Note that the tick counts are a bit weird since we have three different event 
queues at play (1 for devices and one per CPU).

     0: system.littleCluster.cpus0: KVM: Executing for 500000000 ticks
     0: system.littleCluster.cpus1: KVM: Executing for 500000000 ticks
     0: system.littleCluster.cpus0: KVM: Executed 79170 instructions in 176363 
cycles (88181504 ticks, sim cycles: 176363).
88182000: system.littleCluster.cpus0: handleKvmExit (exit_reason: 6)
88182000: system.littleCluster.cpus0: KVM: Handling MMIO (w: 1, addr: 
0x1c090024, len: 4)
88332000: system.littleCluster.cpus0: Entering KVM...
88332000: system.littleCluster.cpus0: KVM: Executing for 411668000 ticks
88332000: system.littleCluster.cpus0: KVM: Executed 4384 instructions in 16854 
cycles (8427000 ticks, sim cycles: 16854).
96759000: system.littleCluster.cpus0: handleKvmExit (exit_reason: 6)
96759000: system.littleCluster.cpus0: KVM: Handling MMIO (w: 1, addr: 
0x1c090030, len: 4)
     0: system.littleCluster.cpus1: KVM: Executed 409368 instructions in 666400 
cycles (333200000 ticks, sim cycles: 666400).
333200000: system.littleCluster.cpus1: Entering KVM...
333200000: system.littleCluster.cpus1: KVM: Executing for 166800000 ticks
96909000: system.littleCluster.cpus0: Entering KVM...
96909000: system.littleCluster.cpus0: KVM: Executing for 403091000 ticks
96909000: system.littleCluster.cpus0: KVM: Executed 4384 instructions in 15257 
cycles (7628500 ticks, sim cycles: 15257).
104538000: system.littleCluster.cpus0: handleKvmExit (exit_reason: 6)
104538000: system.littleCluster.cpus0: KVM: Handling MMIO (w: 1, addr: 
0x1c0100a0, len: 4)
333200000: system.littleCluster.cpus1: KVM: Executed 47544 instructions in 
200820 cycles (100410000 ticks, sim cycles: 200820).
433610000: system.littleCluster.cpus1: Entering KVM...
433610000: system.littleCluster.cpus1: KVM: Executing for 66390000 ticks
104688000: system.littleCluster.cpus0: Entering KVM...
104688000: system.littleCluster.cpus0: KVM: Executing for 395312000 ticks
104688000: system.littleCluster.cpus0: KVM: Executed 4382 instructions in 14942 
cycles (7471000 ticks, sim cycles: 14942).


Comparing this trace to yours, I'd say that there the frequent KVM exits look a 
bit suspicious. I would expect secondary CPUs to make very little process while 
the main CPU initializes the system and starts the early boot code.

There area  couple of possibilities that might be causing issues:

1) There is some CPU ID weirdness that confuses the boot code and puts both 
CPUs in the holding pen. This seems unlikely since there are some writes to the 
UART.

2) Some device is incorrectly mapped to the CPU event queues and causes 
frequent KVM exits. Have a look at _build_kvm in fs_bigLITTLE.py, it doesn't 
use configs/common, so no need to tear your eyes out. ;) Do you map event 
queues in the same way? It's mapping all simulated devices to one event queue 
and the CPUs to private event queues. It's important to remap CPU child devices 
to the device queue instead of the CPU queue. Failing to do this will cause 
chaos, madness, and quite possibly result in Armageddon.

3) You're using DTB autogeneration. This doesn't work for KVM guests due to 
issues with the timer interrupt specification. We have a patch for the timer 
that we are testing internally. Sorry. :(

Regards,
Andreas

On 16/03/2018 23:20, Gabe Black wrote:
Ok, diving into this a little deeper, it looks like execution is progressing but is 
making very slow progress for some reason. I added a call to "dump()" before 
each ioctl invocation which enters the VM and looked at the PC to get an idea of what it 
was up to. I made sure to put that before the timers to avoid taking up VM time with 
printing debug stuff. In any case, I see that neither CPU gets off of PC 0 for about 2ms 
simulated time (~500Hz), and that's EXTREMELY slow for a CPU which is supposed to be 
running in the ballpark of 2GHz. It's not clear to me why it's making such slow progress, 
but that would explain why I'm getting very little out on the simulated console. It's 
just taking forever to make it that far.

Any idea why it's going so slow, or how to debug further?

Gabe

On Wed, Mar 14, 2018 at 7:42 PM, Gabe Black 
<[email protected]<mailto:[email protected]>> wrote:
Some output which I think is suspicious:

55462000: system.cpus0: Entering KVM...
55462000: system.cpus0: KVM: Executing for 1506000 ticks
55462000: system.cpus0: KVM: Executed 5159 instructions in 13646 cycles 
(6823000 ticks, sim cycles: 13646).
56968000: system.cpus1: Entering KVM...
56968000: system.cpus1: KVM: Executing for 5317000 ticks
56968000: system.cpus1: KVM: Executed 7229 instructions in 14379 cycles 
(7189500 ticks, sim cycles: 14379).
62285000: system.cpus0: Entering KVM...
62285000: system.cpus0: KVM: Executing for 1872500 ticks
62285000: system.cpus0: KVM: Executed 5159 instructions in 13496 cycles 
(6748000 ticks, sim cycles: 13496).
64157500: system.cpus1: Entering KVM...
64157500: system.cpus1: KVM: Executing for 4875500 ticks
64157500: system.cpus1: KVM: Executed 6950 instructions in 13863 cycles 
(6931500 ticks, sim cycles: 13863).
69033000: system.cpus0: Entering KVM...
69033000: system.cpus0: KVM: Executing for 2056000 ticks
69033000: system.cpus0: KVM: Executed 5159 instructions in 13454 cycles 
(6727000 ticks, sim cycles: 13454).
71089000: system.cpus1: Entering KVM...
71089000: system.cpus1: KVM: Executing for 4671000 ticks
71089000: system.cpus1: KVM: Executed 6950 instructions in 13861 cycles 
(6930500 ticks, sim cycles: 13861).
75760000: system.cpus0: Entering KVM...
75760000: system.cpus0: KVM: Executing for 2259500 ticks
75760000: system.cpus0: KVM: Executed 5159 instructions in 13688 cycles 
(6844000 ticks, sim cycles: 13688).

[...]

126512000: system.cpus0: handleKvmExit (exit_reason: 6)
126512000: system.cpus0: KVM: Handling MMIO (w: 1, addr: 0x1c090024, len: 4)
126512000: system.cpus0: In updateThreadContext():

[...]

126512000: system.cpus0:   PC := 0xd8 (t: 0, a64: 1)

On Wed, Mar 14, 2018 at 7:37 PM, Gabe Black 
<[email protected]<mailto:[email protected]>> wrote:
I tried it just now, and I still don't see anything on the console. I switched 
back to using my own script since it's a bit simpler (it doesn't use all the 
configs/common stuff), and started looking at the KVM debug output. I see that 
both cpus claim to execute instructions, although cpu1 didn't take an exit in 
the output I was looking at. cpu0 took four exits, two which touched some UART 
registers, and two which touched RealView registes, the V2M_SYS_CFGDATA and 
V2M_SYS_CFGCTRL registers judging by the comments in the bootloader assembly 
file.

After that they claim to be doing stuff, although I see no further console 
output or KVM exits. The accesses themselves and their PCs are from the 
bootloader blob, and so I'm pretty confident that it's starting that and 
executing some of those instructions. One thing that looks very odd now that I 
think about it, is that the KVM messages about entering and executing 
instructions (like those below) seem to say that cpu0 has executed thousands of 
instructions, but the exits I see seem to correspond to the first maybe 50 
instructions it should be seeing in the bootloader blob. Are those values bogus 
for some reason? Is there some existing debug output which would let me see 
where KVM thinks it is periodically to see if it's in the kernel or if it went 
bananas and is executing random memory somewhere? Or if it just got stuck 
waiting for some event that's not going to show up?

Are there any important CLs which haven't made their way into upstream somehow?

Gabe

On Wed, Mar 14, 2018 at 4:28 AM, Andreas Sandberg 
<[email protected]<mailto:[email protected]>> wrote:
Have you tried using the fs_bigLITTLE script in configs/examples/arm?
That's the script I have been using for testing.

I just tested the script with 8 little CPUs and 0 big CPUs and it seems
to work. Timing is a bit temperamental though, so you might need to
override the simulation quantum. The default is 1ms, you might need to
decrease it to something slightly smaller (I'm currently using 0.5ms).
Another caveat is that there seem to be some issues related to dtb
auto-generation that affect KVM guests. We are currently testing a
solution for this issue.

Cheers,
Andreas



On 12/03/2018 22:26, Gabe Black wrote:
I'm trying to run in FS mode, to boot android/linux.

Gabe

On Mon, Mar 12, 2018 at 3:26 PM, Dutu, Alexandru 
<[email protected]<mailto:[email protected]>>
wrote:

Hi Gabe,

Are you running SE or FS mode?

Thanks,
Alex

-----Original Message-----
From: gem5-dev 
[mailto:[email protected]<mailto:[email protected]>] On Behalf 
Of Gabe Black
Sent: Friday, March 9, 2018 5:46 PM
To: gem5 Developer List <[email protected]<mailto:[email protected]>>
Subject: [gem5-dev] Multicore ARM v8 KVM based simulation

Hi folks. I have a config script set up where I can run a KVM based ARM v8
simulation just fine when I have a single CPU in it, but when I try running
with more than one CPU, it just seems to get lost and not do anything. Is
this a configuration that's supported? If so, are there any caveats to how
it's set up? I may be missing something simple, but it's not apparent to me
at the moment.

Gabe
_______________________________________________
gem5-dev mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/mailman/listinfo/gem5-dev

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.




IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.




IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.




IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Multicore ARM v8 KVM based simulation

Reply via email to