Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests

Anthony Liguori Fri, 05 Dec 2008 07:42:19 -0800

Andre Przywara wrote:

Anthony,
This patch series needs to be posted to qemu-devel. I know qemudoesn't do true SMP yet, but it will in the relatively near future.Either way, some of the design points needs review from a largeraudience than present on kvm-devel.
OK, I already started looking at that. The first patch applies withonly some fuzz, so no problems here. The second patch could be changedto promote the values via the firmware configuration interface only,leaving the host side pinning alone (which wouldn't make much sensewithout true SMP anyway).
The third patch is actually BOCHS BIOS, and I am confused here:
I see the host side of the firmware config interface in QEMU SVN, butneither in the BOCHS CVS nor in the qemu/pc-bios/bios.diff there isany sign of usage from the BIOS side.

Really? I assumed it was there. I'll look this afternoon and if itisn't, I'll apply those patches to bios.diff and update the bios.

Is the kvm-patched qemu the only user of the interface? If so I wouldhave to introduce the interface to QEMU's bios.diff (or better send tobochs-developers?)Do you know what BOCHS version the bios.diff applies against? Is thatthe 2.3.7 release?

Unfortunately, we don't track what version of the BOCHS BIOS is in thetree. Usually, it's a SVN snapshot. I'm going to change this the nexttime I update the BIOS though.

I'm not a big fan of the libnuma dependency. I'll willing to concedethis if there's a wide agreement that we should support this directlyin QEMU.
As long as QEMU is not true SMP, libnuma is rather useless. One couldpin the memory to the appropriate host nodes, but without the properscheduling this doesn't make much sense. And rescheduling the qemuprocess each time a new VCPU is scheduled doesn't seem so smart, either.

Even if it's not useful, I'd still like to add it to QEMU. That's oneless thing that has to be merged from KVM into QEMU.

I don't think there's such a thing as a casual NUMA user. Thedefault NUMA policy in Linux is node-local memory. As long as a VMis smaller than a single node, everything will work out fine.
Almost right, but simply calling qemu-system-x86_64 can lead to badsituations. I lately saw that VCPU #0 was scheduled on one node andVCPU #1 on another. This leads to random (probably excessive) remoteaccesses from the VCPUs, since the guest assumes uniform memory

That seems like Linux is behaving badly, no? Can you describe thesituation more?

Of course one could cure this small guest case with numactl, but in myexperience the existence of this tool isn't as well-known as one wouldexpect.

NUMA systems are expensive. If a customer cares about performance (asopposed to just getting more memory), then I think tools like numactlare pretty well known.

In the event that the VM is larger than a single node, if a user iscreating it via qemu-system-x86_64, they're going to either not careat all about NUMA, or be familiar enough with the numactl tools thatthey'll probably just want to use that. Once you've got your headaround the fact that VCPUs are just threads and the memory is just ashared memory segment, any knowledgable sysadmin will have no problemdoing whatever sort of NUMA layout they want.
Really? How do you want to assign certain _parts_ of guest memory withnumactl? (Let alone the rather weird way of using -mempath, which ismuch easier done within QEMU).

I don't think -mem-path is weird at all. In fact, I'd be inclined touse shared memory by default and create a temporary file name. Thenprovide a monitor interface to lookup that file name so that an explicit-mem-path isn't required anymore.

The same applies to the threads. You can assign _all_ the threads tocertain nodes, but pinning single threads only requires some tediouswork (QEMU monitor or top, then taskset -p). Isn't that OK if qemuwould do this automatically (or at least give some support here)?

Most VMs are going to be created through management tools so I don'tthink it's an issue. I'd rather provide the best mechanisms formanagement tools to have the most flexibility.

The other case is where management tools are creating VMs. In thiscase, it's probably better to use numactl as an external tool becausethen it keeps things consistent wrt CPU pinning.
There's also a good argument for not introducing CPU pinning directlyto QEMU. There are multiple ways to effectively do CPU pinning. Youcan use taskset, you can use cpusets or even something like libcgroup.
I agree that pinning isn't the last word on the subject, but it workspretty well. But I wouldn't load the admin with the burden of pinning,but let this be done by QEMU/KVM. Maybe one could introduce a way totell QEMU/KVM to not pin the threads.


This is where things start to get ugly...

I also had the idea to start with some sort of pinning (eitherautomatically or user-chosen) and lift the affinity later (after thethread has done something and touched some memory). In this case Linuxcould (but probably will not easily) move the thread to another node.One could think about triggering this from a management app: If theapp detects a congestion on one node, it could first lift the affinityrestriction of some VCPU threads to achieve a better load balancing.If the situation persists (and doesn't turn out to be a short timepeak), the manager could migrate the memory too and pin the VCPUs tothe new node. I thought the migration and temporary un-pinning couldbe implemented in the monitor.

The other issue with pinning is what happens after live migration? Whatabout single-machine load balancing? Regardless of whether we bake inlibnuma control or not, I think an interface on the command line is notterribly interesting because it's too static. I think a monitorinterface is what we'd really want if we integrated with libnuma.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests

Reply via email to