virt: Expose empty NUMA nodes through ACPI

David Hildenbrand Wed, 17 Nov 2021 10:27:21 -0800

On 12.11.21 14:27, Igor Mammedov wrote:
> On Wed, 10 Nov 2021 12:01:11 +0100
> David Hildenbrand <da...@redhat.com> wrote:
> 
>> On 10.11.21 11:33, Igor Mammedov wrote:
>>> On Fri, 5 Nov 2021 23:47:37 +1100
>>> Gavin Shan <gs...@redhat.com> wrote:
>>>   
>>>> Hi Drew and Igor,
>>>>
>>>> On 11/2/21 6:39 PM, Andrew Jones wrote:  
>>>>> On Tue, Nov 02, 2021 at 10:44:08AM +1100, Gavin Shan wrote:    
>>>>>>
>>>>>> Yeah, I agree. I don't have strong sense to expose these empty nodes
>>>>>> for now. Please ignore the patch.
>>>>>>    
>>>>>
>>>>> So were describing empty numa nodes on the command line ever a reasonable
>>>>> thing to do? What happens on x86 machine types when describing empty numa
>>>>> nodes? I'm starting to think that the solution all along was just to
>>>>> error out when a numa node has memory size = 0...  
>>>
>>> memory less nodes are fine as long as there is another type of device
>>> that describes  a node (apic/gic/...).
>>> But there is no way in spec to describe completely empty nodes,
>>> and I dislike adding out of spec entries just to fake an empty node.
>>>   
>>
>> There are reasonable *upcoming* use cases for initially completely empty
>> NUMA nodes with virtio-mem: being able to expose a dynamic amount of
>> performance-differentiated memory to a VM. I don't know of any existing
>> use cases that would require that as of now.
>>
>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>> this memory is exposed via cpu-less, special nodes. In contrast to real
>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>> like that yet, but it might just be a matter of time).
> 
> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
> some by MEMORY entries. Or nodes created dynamically like with normal
> hotplug memory.
> 
> 
>> The same should be true when using DIMMs instead of virtio-mem in this
>> example.
>>
>>>   
>>>> Sorry for the delay as I spent a few days looking into linux virtio-mem
>>>> driver. I'm afraid we still need this patch for ARM64. I don't think x86  
>>>
>>> does it behave the same way is using pc-dimm hotplug instead of virtio-mem?
>>>
>>> CCing David
>>> as it might be virtio-mem issue.  
>>
>> Can someone share the details why it's a problem on arm64 but not on
>> x86-64? I assume this really only applies when having a dedicated, empty
>> node -- correct?
>>
>>>
>>> PS:
>>> maybe for virtio-mem-pci, we need to add GENERIC_AFFINITY entry into SRAT
>>> and describe it as PCI device (we don't do that yet if I'm no mistaken).  
>>
>> virtio-mem exposes the PXM itself, and avoids exposing it memory via any
>> kind of platform specific firmware maps. The PXM gets translated in the
>> guest accordingly. For now there was no need to expose this in SRAT --
>> the SRAT is really only used to expose the maximum possible PFN to the
>> VM, just like it would have to be used to expose "this is a possible node".
>>
>> Of course, we could use any other paravirtualized interface to expose
>> both information. For example, on s390x, I'll have to introduce a new
>> hypercall to query the "device memory region" to detect the maximum
>> possible PFN, because existing interfaces don't allow for that. For now
>> we're ruinning SRAT to expose "maximum possible PFN" simply because it's
>> easy to re-use.
>>
>> But I assume that hotplugging a DIMM to an empty node will have similar
>> issues on arm64.
>>
>>>   
>>>> has this issue even though I didn't experiment on X86. For example, I
>>>> have the following command lines. The hot added memory is put into node#0
>>>> instead of node#2, which is wrong.  
>>
>> I assume Linux will always fallback to node 0 if node X is not possible
>> when translating the PXM.
> 
> I tested how x86 behaves, with pc-dimm, and it seems that
> fc43 guest works only sometimes.
> cmd:
>   -numa node,memdev=mem,cpus=0 -numa node,cpus=1 -numa node -numa node
> 
> 1: hotplug into the empty last node creates a new node dynamically 
> 2: hotplug into intermediate empty node (last-1) is broken, memory goes into 
> the first node


See my other reply: Reason is that we (QEMU) indicate all hotpluggable
memory as belonging to the last NUMA node. When processing that SRAT
entry, Linux maps that PXM to an actual node.

-- 
Thanks,

David / dhildenb

Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI

Reply via email to