On 09.08.22 12:56, Joao Martins wrote: > On 7/21/22 13:07, David Hildenbrand wrote: >> This is a follow-up on "util: NUMA aware memory preallocation" [1] by >> Michal. >> >> Setting the CPU affinity of threads from inside QEMU usually isn't >> easily possible, because we don't want QEMU -- once started and running >> guest code -- to be able to mess up the system. QEMU disallows relevant >> syscalls using seccomp, such that any such invocation will fail. >> >> Especially for memory preallocation in memory backends, the CPU affinity >> can significantly increase guest startup time, for example, when running >> large VMs backed by huge/gigantic pages, because of NUMA effects. For >> NUMA-aware preallocation, we have to set the CPU affinity, however: >> >> (1) Once preallocation threads are created during preallocation, management >> tools cannot intercept anymore to change the affinity. These threads >> are created automatically on demand. >> (2) QEMU cannot easily set the CPU affinity itself. >> (3) The CPU affinity derived from the NUMA bindings of the memory backend >> might not necessarily be exactly the CPUs we actually want to use >> (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs). >> >> There is an easy "workaround". If we have a thread with the right CPU >> affinity, we can simply create new threads on demand via that prepared >> context. So, all we have to do is setup and create such a context ahead >> of time, to then configure preallocation to create new threads via that >> environment. >> >> So, let's introduce a user-creatable "thread-context" object that >> essentially consists of a context thread used to create new threads. >> QEMU can either try setting the CPU affinity itself ("cpu-affinity", >> "node-affinity" property), or upper layers can extract the thread id >> ("thread-id" property) to configure it externally. >> >> Make memory-backends consume a thread-context object >> (via the "prealloc-context" property) and use it when preallocating to >> create new threads with the desired CPU affinity. Further, to make it >> easier to use, allow creation of "thread-context" objects, including >> setting the CPU affinity directly from QEMU, *before* enabling the >> sandbox option. >> >> >> Quick test on a system with 2 NUMA nodes: >> >> Without CPU affinity: >> time qemu-system-x86_64 \ >> -object >> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind >> \ >> -nographic -monitor stdio >> >> real 0m5.383s >> real 0m3.499s >> real 0m5.129s >> real 0m4.232s >> real 0m5.220s >> real 0m4.288s >> real 0m3.582s >> real 0m4.305s >> real 0m5.421s >> real 0m4.502s >> >> -> It heavily depends on the scheduler CPU selection >> >> With CPU affinity: >> time qemu-system-x86_64 \ >> -object thread-context,id=tc1,node-affinity=0 \ >> -object >> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 >> \ >> -sandbox enable=on,resourcecontrol=deny \ >> -nographic -monitor stdio >> >> real 0m1.959s >> real 0m1.942s >> real 0m1.943s >> real 0m1.941s >> real 0m1.948s >> real 0m1.964s >> real 0m1.949s >> real 0m1.948s >> real 0m1.941s >> real 0m1.937s >> >> On reasonably large VMs, the speedup can be quite significant. >> > Really awesome work!
Thanks! > > I am not sure I picked up this well while reading the series, but it seems to > me that > prealloc is still serialized on per memory-backend when solely configured by > command-line > right? I think it's serialized in any case, even when preallocation is triggered manually using prealloc=on. I might be wrong, but any kind of object creation or property changes should be serialized by the BQL. In theory, we can "easily" preallocate in our helper -- qemu_prealloc_mem() -- concurrently when we don't have to bother about handling SIGBUS -- that is, when the kernel supports MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll serialize in there as well. > > Meaning when we start prealloc we wait until the memory-backend > thread-context action is > completed (per-memory-backend) even if other to-be-configured memory-backends > will use a > thread-context on a separate set of pinned CPUs on another node ... and > wouldn't in theory > "need" to wait until the former prealloc finishes? Yes. This series only takes care of NUMA-aware preallocation, but doesn't preallocate multiple memory backends in parallel. In theory, it would be quite easy to preallocate concurrently: simply create the memory backend objects passed on the QEMU cmdline concurrently from multiple threads. In practice, we have to be careful I think with the BQL. But it doesn't sound horribly complicated to achieve that. We can perform all synchronized under the BQL and only trigger actual expensive preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with released BQL. > > Unless as you alluded in one of the last patches: we can pass these > thread-contexts with > prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have > different QMP > clients set prealloc=on, and thus prealloc would happen concurrently per node? I think we will serialize in any case when modifying properties. Can you give it a shot and see if it would work as of now? I doubt it, but I might be wrong. > > We were thinking to extend it to leverage per socket bandwidth essentially to > parallel > this even further (we saw improvements with something like that but haven't > tried this > series yet). Likely this is already possible with your work and I didn't pick > up on it, > hence just making sure this is the case :) With this series, you can essentially tell QEMU which physical CPUs to use for preallocating a given memory backend. But memory backends are not created+preallocated concurrently yet. -- Thanks, David / dhildenb