> Also, good to say why multiple nodes per device are needed. This is to support the GPU's MIG (Mult-Instance GPUs) feature, (https://www.nvidia.com/en-in/technologies/multi-instance-gpu/) which allows partitioning of the GPU device resources (including device memory) into several isolated instances. We are creating multiple NUMA nodes to give each partition their own node. Now the partitions are not fixed and they can be created/deleted and updated in (memory) sizes at runtime. This is the reason these nodes are tagged as MEM_AFFINITY_HOTPLUGGABLE. Such setting allows flexibility in the VM to associate a desired partition/range of device memory to a node (that is adjustable). Note that we are replicating the behavior on baremetal here.
I will also put this detail on the cover letter in the next version. > QEMU have already means to assign NUMA node affinity > to PCI hierarchies in generic way by using a PBX per node > (also done 'backwards') by setting node option on it. > So every device behind it should belong to that node as well > and guest OS shall pickup device affinity from PCI tree it belongs to. Yes, but the problem is that only one node may be associated this way and we have several.