On 12.11.21 14:27, Igor Mammedov wrote: > On Wed, 10 Nov 2021 12:01:11 +0100 > David Hildenbrand <da...@redhat.com> wrote: > >> On 10.11.21 11:33, Igor Mammedov wrote: >>> On Fri, 5 Nov 2021 23:47:37 +1100 >>> Gavin Shan <gs...@redhat.com> wrote: >>> >>>> Hi Drew and Igor, >>>> >>>> On 11/2/21 6:39 PM, Andrew Jones wrote: >>>>> On Tue, Nov 02, 2021 at 10:44:08AM +1100, Gavin Shan wrote: >>>>>> >>>>>> Yeah, I agree. I don't have strong sense to expose these empty nodes >>>>>> for now. Please ignore the patch. >>>>>> >>>>> >>>>> So were describing empty numa nodes on the command line ever a reasonable >>>>> thing to do? What happens on x86 machine types when describing empty numa >>>>> nodes? I'm starting to think that the solution all along was just to >>>>> error out when a numa node has memory size = 0... >>> >>> memory less nodes are fine as long as there is another type of device >>> that describes a node (apic/gic/...). >>> But there is no way in spec to describe completely empty nodes, >>> and I dislike adding out of spec entries just to fake an empty node. >>> >> >> There are reasonable *upcoming* use cases for initially completely empty >> NUMA nodes with virtio-mem: being able to expose a dynamic amount of >> performance-differentiated memory to a VM. I don't know of any existing >> use cases that would require that as of now. >> >> Examples include exposing HBM or PMEM to the VM. Just like on real HW, >> this memory is exposed via cpu-less, special nodes. In contrast to real >> HW, the memory is hotplugged later (I don't think HW supports hotplug >> like that yet, but it might just be a matter of time). > > I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT > some by MEMORY entries. Or nodes created dynamically like with normal > hotplug memory. > > >> The same should be true when using DIMMs instead of virtio-mem in this >> example. >> >>> >>>> Sorry for the delay as I spent a few days looking into linux virtio-mem >>>> driver. I'm afraid we still need this patch for ARM64. I don't think x86 >>> >>> does it behave the same way is using pc-dimm hotplug instead of virtio-mem? >>> >>> CCing David >>> as it might be virtio-mem issue. >> >> Can someone share the details why it's a problem on arm64 but not on >> x86-64? I assume this really only applies when having a dedicated, empty >> node -- correct? >> >>> >>> PS: >>> maybe for virtio-mem-pci, we need to add GENERIC_AFFINITY entry into SRAT >>> and describe it as PCI device (we don't do that yet if I'm no mistaken). >> >> virtio-mem exposes the PXM itself, and avoids exposing it memory via any >> kind of platform specific firmware maps. The PXM gets translated in the >> guest accordingly. For now there was no need to expose this in SRAT -- >> the SRAT is really only used to expose the maximum possible PFN to the >> VM, just like it would have to be used to expose "this is a possible node". >> >> Of course, we could use any other paravirtualized interface to expose >> both information. For example, on s390x, I'll have to introduce a new >> hypercall to query the "device memory region" to detect the maximum >> possible PFN, because existing interfaces don't allow for that. For now >> we're ruinning SRAT to expose "maximum possible PFN" simply because it's >> easy to re-use. >> >> But I assume that hotplugging a DIMM to an empty node will have similar >> issues on arm64. >> >>> >>>> has this issue even though I didn't experiment on X86. For example, I >>>> have the following command lines. The hot added memory is put into node#0 >>>> instead of node#2, which is wrong. >> >> I assume Linux will always fallback to node 0 if node X is not possible >> when translating the PXM. > > I tested how x86 behaves, with pc-dimm, and it seems that > fc43 guest works only sometimes. > cmd: > -numa node,memdev=mem,cpus=0 -numa node,cpus=1 -numa node -numa node > > 1: hotplug into the empty last node creates a new node dynamically > 2: hotplug into intermediate empty node (last-1) is broken, memory goes into > the first node
See my other reply: Reason is that we (QEMU) indicate all hotpluggable memory as belonging to the last NUMA node. When processing that SRAT entry, Linux maps that PXM to an actual node. -- Thanks, David / dhildenb