The performance impact shouldn't be too bad. I did some scalability tests using LU from SPLASH 2 years ago. IIRC, I was using an 8-core Westmere-EX based system at the time. Native throughput for that benchmark was ~30GIPS @ 8 cores. When running in KVM, I got something like ~15GIPS with a 1ms quantum and 10GIPS with a 0.5ms quantum. Unfortunately, I don't have that data for any Arm-based system.
Turning on the HDLCD will probably reduce throughput quite a bit, but it should be running in a functional refresh mode (10Hz by default) when running in KVM. It's far from optimised, but should work. We had some KMI issues last time I looked at this. IIRC, the KMI model doesn't clear interrupts correctly, which confuses the interrupt model in the kernel. Setting up event queues for KVM automatically would definitely be desirable. As you, painfully, noticed, this is currently the responsibility of the config script. The Arm example scripts do it already and should work out of the box. I suspect it might be tricky to get this right from inside the simulator without some re-architecting of the simulator core. What we would have to do is to add an API to allocate semi-private EQs from inside C++. Since Python provides an EQ number that get allocated in C++ at instantiation time, we would have to defer EQ allocation until init() is called or create a better mechanism to allocate EQs from Python instead of having a plain EQ index. We still want a way to force the old behaviour when simulating single-core systems since that makes debugging a lot easier. Cheers, Andreas On 29/03/2018 01:14, Gabe Black wrote: Ok, I think I figured it out, and it all has to do with the simulation quantum. If the quantum is too big, the kernel might poke hardware and expect to get an interrupt within a certain period of time. It could be that the CPU gets to the end of its timeout before the simulated hardware has had a chance to trigger an interrupt, even though the interrupt would happen first if the event queues were held in tighter sync. If I decrease the size of the quantum from 500ms (per your suggestion) to 1ms, then I see the errors from the keyboard/mouse drivers and the ATA driver go away, at least in the one CPU/multiple event queue configuration. I'm going to do some more testing to make sure there isn't some other problem that pops up, and also to characterize the performance impact which I'm hopeful won't be too bad. Also, I was thinking it would be nice if KVM CPUs could set up their event queues in some more automatic, less error prone way. Before I knew that they needed their own event queue (which I think is just institutional knowledge that isn't documented/warned about/etc.?), I had no idea what was going wrong when just dropping in some KVM CPUs in place of regular CPUs. I don't have a fully fleshed out plan for how to do that, but it doesn't *seem* like something that should be that hard to do. Gabe On Mon, Mar 26, 2018 at 7:06 PM, Gabe Black <[email protected]<mailto:[email protected]>> wrote: I looked into this a little further, and I see the same problem happen with one CPU but with the CPU and the devices in different event queues. I haven't figured out exactly where things go wrong, but it looks like a write DMA is set up but doesn't happen for some reason. I'm not sure if the DMA starts but then gets stuck, or if it never starts at all. It could also be that the DMA happens, but the completion event (which is what doesn't seem to happen) is mishandled because of the additional event queue. I turned on the DMA debug flag, but that produced so much debug output that my tools are crashing. I'll have to see what I can do to narrow things down a bit. Gabe On Thu, Mar 22, 2018 at 11:28 AM, Gabe Black <[email protected]<mailto:[email protected]>> wrote: Ok, thanks. We're deciding internally what approach to use to tackle this. Gabe On Wed, Mar 21, 2018 at 3:01 AM, Andreas Sandberg <[email protected]<mailto:[email protected]>> wrote: Hi Gabe, There are issues with the IDE model that prevent it from working with in-kernel GIC emulation. I believe the model doesn't clear interrupts correctly, which confuses the host kernel. I tried to debug this at some point, but wasn't able to do much immaediate progress and decided it wasn't worth the effort. The VirtIO block devices doesn't suffer from this problem. Using the VirtIO device by default seems like a good idea to me. It doesn't simulate any timing, but that might not be a huge deal since the IDE device doesn't provide realistic timing anyway. It would be really awesome if we had a modern storage controller (e.g., NVMe or AHCI) and proper storage timing models. Cheers, Andreas On 20/03/2018 23:38, Gabe Black wrote: My next question is about disks. I see that the fs_bigLITTLE.py script uses PciVirtIO to set up its disks, where I'm using IDE which I inherited from the fs.py scripts I used as reference. The problem I'm seeing is that the IDE controllers seem to be mangling commands and dropping interrupts, so this difference looks particularly suspicious. Is there a KVM related reason you're using PciVirtIO? Is this something that *should* work with IDE bug doesn't, or do I have to use PciVirtIO for things to work properly? I'm not familiar with PciVirtIO beyond briefly skimming the source for it in gem5. Is this something we should consider using globally as a replacement for IDE, even in simulations where we're trying to be really realistic? Thanks again for all the help. Gabe On Tue, Mar 20, 2018 at 3:14 PM, Gabe Black <[email protected]<mailto:[email protected]>> wrote: Ok, that (multiple event queues) made things way better. There are still some glitches to figure out, but at least it makes good forward progress at a reasonable speed. Thanks! Gabe On Mon, Mar 19, 2018 at 5:12 PM, Gabe Black <[email protected]<mailto:[email protected]>> wrote: This is on an chromebook based on the RK3399 with only ~4GB of RAM which is not ideal, although we have a bigger machine in the works for the future. I agree with your reasoning and don't think option 1 is a problem. We're using static DTBs so I don't think that's an issue either. In my script, I'm not doing anything smart with the event queues, so that's likely at least part of the problem. When I tried using fs_bigLITTLE.py I ran into what looked like a similar issue so that might not be the whole story, but it's definitely something I should fix up. I'll let you know how that goes! Gabe On Mon, Mar 19, 2018 at 4:30 AM, Andreas Sandberg <[email protected]<mailto:[email protected]>> wrote: Hmm, OK, this is very strange. What type of hardware are you running on? Is it an A57-based chip or something else? Also, what's your simulation quantum? I have been able to run with a 0.5ms quantum (5e8 ticks). I think the following trace of two CPUs running in KVM should be roughly equivalent to the trace you shared earlier. It was generated on a commercially available 8xA57 (16GiB ram) using the following command (gem5 rev 9dc44b417): gem5.opt -r --debug-flags Kvm,KvmIO,KvmRun configs/example/arm/fs_bigLITTLE.py \ --sim-quantum '0.5ms' \ --cpu-type kvm --big-cpus 0 --little-cpus 2 \ --dtb system/arm/dt/armv8_gem5_v1_2cpu.dtb --kernel vmlinux.aarch64.4.4-d318f95d0c Note that the tick counts are a bit weird since we have three different event queues at play (1 for devices and one per CPU). 0: system.littleCluster.cpus0: KVM: Executing for 500000000 ticks 0: system.littleCluster.cpus1: KVM: Executing for 500000000 ticks 0: system.littleCluster.cpus0: KVM: Executed 79170 instructions in 176363 cycles (88181504 ticks, sim cycles: 176363). 88182000: system.littleCluster.cpus0: handleKvmExit (exit_reason: 6) 88182000: system.littleCluster.cpus0: KVM: Handling MMIO (w: 1, addr: 0x1c090024, len: 4) 88332000: system.littleCluster.cpus0: Entering KVM... 88332000: system.littleCluster.cpus0: KVM: Executing for 411668000 ticks 88332000: system.littleCluster.cpus0: KVM: Executed 4384 instructions in 16854 cycles (8427000 ticks, sim cycles: 16854). 96759000: system.littleCluster.cpus0: handleKvmExit (exit_reason: 6) 96759000: system.littleCluster.cpus0: KVM: Handling MMIO (w: 1, addr: 0x1c090030, len: 4) 0: system.littleCluster.cpus1: KVM: Executed 409368 instructions in 666400 cycles (333200000 ticks, sim cycles: 666400). 333200000: system.littleCluster.cpus1: Entering KVM... 333200000: system.littleCluster.cpus1: KVM: Executing for 166800000 ticks 96909000: system.littleCluster.cpus0: Entering KVM... 96909000: system.littleCluster.cpus0: KVM: Executing for 403091000 ticks 96909000: system.littleCluster.cpus0: KVM: Executed 4384 instructions in 15257 cycles (7628500 ticks, sim cycles: 15257). 104538000: system.littleCluster.cpus0: handleKvmExit (exit_reason: 6) 104538000: system.littleCluster.cpus0: KVM: Handling MMIO (w: 1, addr: 0x1c0100a0, len: 4) 333200000: system.littleCluster.cpus1: KVM: Executed 47544 instructions in 200820 cycles (100410000 ticks, sim cycles: 200820). 433610000: system.littleCluster.cpus1: Entering KVM... 433610000: system.littleCluster.cpus1: KVM: Executing for 66390000 ticks 104688000: system.littleCluster.cpus0: Entering KVM... 104688000: system.littleCluster.cpus0: KVM: Executing for 395312000 ticks 104688000: system.littleCluster.cpus0: KVM: Executed 4382 instructions in 14942 cycles (7471000 ticks, sim cycles: 14942). Comparing this trace to yours, I'd say that there the frequent KVM exits look a bit suspicious. I would expect secondary CPUs to make very little process while the main CPU initializes the system and starts the early boot code. There area couple of possibilities that might be causing issues: 1) There is some CPU ID weirdness that confuses the boot code and puts both CPUs in the holding pen. This seems unlikely since there are some writes to the UART. 2) Some device is incorrectly mapped to the CPU event queues and causes frequent KVM exits. Have a look at _build_kvm in fs_bigLITTLE.py, it doesn't use configs/common, so no need to tear your eyes out. ;) Do you map event queues in the same way? It's mapping all simulated devices to one event queue and the CPUs to private event queues. It's important to remap CPU child devices to the device queue instead of the CPU queue. Failing to do this will cause chaos, madness, and quite possibly result in Armageddon. 3) You're using DTB autogeneration. This doesn't work for KVM guests due to issues with the timer interrupt specification. We have a patch for the timer that we are testing internally. Sorry. :( Regards, Andreas On 16/03/2018 23:20, Gabe Black wrote: Ok, diving into this a little deeper, it looks like execution is progressing but is making very slow progress for some reason. I added a call to "dump()" before each ioctl invocation which enters the VM and looked at the PC to get an idea of what it was up to. I made sure to put that before the timers to avoid taking up VM time with printing debug stuff. In any case, I see that neither CPU gets off of PC 0 for about 2ms simulated time (~500Hz), and that's EXTREMELY slow for a CPU which is supposed to be running in the ballpark of 2GHz. It's not clear to me why it's making such slow progress, but that would explain why I'm getting very little out on the simulated console. It's just taking forever to make it that far. Any idea why it's going so slow, or how to debug further? Gabe On Wed, Mar 14, 2018 at 7:42 PM, Gabe Black <[email protected]<mailto:[email protected]>> wrote: Some output which I think is suspicious: 55462000: system.cpus0: Entering KVM... 55462000: system.cpus0: KVM: Executing for 1506000 ticks 55462000: system.cpus0: KVM: Executed 5159 instructions in 13646 cycles (6823000 ticks, sim cycles: 13646). 56968000: system.cpus1: Entering KVM... 56968000: system.cpus1: KVM: Executing for 5317000 ticks 56968000: system.cpus1: KVM: Executed 7229 instructions in 14379 cycles (7189500 ticks, sim cycles: 14379). 62285000: system.cpus0: Entering KVM... 62285000: system.cpus0: KVM: Executing for 1872500 ticks 62285000: system.cpus0: KVM: Executed 5159 instructions in 13496 cycles (6748000 ticks, sim cycles: 13496). 64157500: system.cpus1: Entering KVM... 64157500: system.cpus1: KVM: Executing for 4875500 ticks 64157500: system.cpus1: KVM: Executed 6950 instructions in 13863 cycles (6931500 ticks, sim cycles: 13863). 69033000: system.cpus0: Entering KVM... 69033000: system.cpus0: KVM: Executing for 2056000 ticks 69033000: system.cpus0: KVM: Executed 5159 instructions in 13454 cycles (6727000 ticks, sim cycles: 13454). 71089000: system.cpus1: Entering KVM... 71089000: system.cpus1: KVM: Executing for 4671000 ticks 71089000: system.cpus1: KVM: Executed 6950 instructions in 13861 cycles (6930500 ticks, sim cycles: 13861). 75760000: system.cpus0: Entering KVM... 75760000: system.cpus0: KVM: Executing for 2259500 ticks 75760000: system.cpus0: KVM: Executed 5159 instructions in 13688 cycles (6844000 ticks, sim cycles: 13688). [...] 126512000: system.cpus0: handleKvmExit (exit_reason: 6) 126512000: system.cpus0: KVM: Handling MMIO (w: 1, addr: 0x1c090024, len: 4) 126512000: system.cpus0: In updateThreadContext(): [...] 126512000: system.cpus0: PC := 0xd8 (t: 0, a64: 1) On Wed, Mar 14, 2018 at 7:37 PM, Gabe Black <[email protected]<mailto:[email protected]>> wrote: I tried it just now, and I still don't see anything on the console. I switched back to using my own script since it's a bit simpler (it doesn't use all the configs/common stuff), and started looking at the KVM debug output. I see that both cpus claim to execute instructions, although cpu1 didn't take an exit in the output I was looking at. cpu0 took four exits, two which touched some UART registers, and two which touched RealView registes, the V2M_SYS_CFGDATA and V2M_SYS_CFGCTRL registers judging by the comments in the bootloader assembly file. After that they claim to be doing stuff, although I see no further console output or KVM exits. The accesses themselves and their PCs are from the bootloader blob, and so I'm pretty confident that it's starting that and executing some of those instructions. One thing that looks very odd now that I think about it, is that the KVM messages about entering and executing instructions (like those below) seem to say that cpu0 has executed thousands of instructions, but the exits I see seem to correspond to the first maybe 50 instructions it should be seeing in the bootloader blob. Are those values bogus for some reason? Is there some existing debug output which would let me see where KVM thinks it is periodically to see if it's in the kernel or if it went bananas and is executing random memory somewhere? Or if it just got stuck waiting for some event that's not going to show up? Are there any important CLs which haven't made their way into upstream somehow? Gabe On Wed, Mar 14, 2018 at 4:28 AM, Andreas Sandberg <[email protected]<mailto:[email protected]>> wrote: Have you tried using the fs_bigLITTLE script in configs/examples/arm? That's the script I have been using for testing. I just tested the script with 8 little CPUs and 0 big CPUs and it seems to work. Timing is a bit temperamental though, so you might need to override the simulation quantum. The default is 1ms, you might need to decrease it to something slightly smaller (I'm currently using 0.5ms). Another caveat is that there seem to be some issues related to dtb auto-generation that affect KVM guests. We are currently testing a solution for this issue. Cheers, Andreas On 12/03/2018 22:26, Gabe Black wrote: I'm trying to run in FS mode, to boot android/linux. Gabe On Mon, Mar 12, 2018 at 3:26 PM, Dutu, Alexandru <[email protected]<mailto:[email protected]>> wrote: Hi Gabe, Are you running SE or FS mode? Thanks, Alex -----Original Message----- From: gem5-dev [mailto:[email protected]<mailto:[email protected]>] On Behalf Of Gabe Black Sent: Friday, March 9, 2018 5:46 PM To: gem5 Developer List <[email protected]<mailto:[email protected]>> Subject: [gem5-dev] Multicore ARM v8 KVM based simulation Hi folks. I have a config script set up where I can run a KVM based ARM v8 simulation just fine when I have a single CPU in it, but when I try running with more than one CPU, it just seems to get lost and not do anything. Is this a configuration that's supported? If so, are there any caveats to how it's set up? I may be missing something simple, but it's not apparent to me at the moment. Gabe _______________________________________________ gem5-dev mailing list [email protected]<mailto:[email protected]> http://m5sim.org/mailman/listinfo/gem5-dev _______________________________________________ gem5-dev mailing list [email protected]<mailto:[email protected]> http://m5sim.org/mailman/listinfo/gem5-dev _______________________________________________ gem5-dev mailing list [email protected]<mailto:[email protected]> http://m5sim.org/mailman/listinfo/gem5-dev IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
