Re: [PATCH] hostmem: Round up memory size for qemu_madvise() in host_memory_backend_memory_complete()

Michal Prívozník Mon, 04 Dec 2023 06:13:52 -0800

On 12/4/23 10:21, David Hildenbrand wrote:
> On 01.12.23 10:07, Michal Prívozník wrote:
>> On 11/27/23 14:55, David Hildenbrand wrote:
>>> On 27.11.23 14:37, David Hildenbrand wrote:
>>>> On 27.11.23 13:32, Michal Privoznik wrote:
>>>>> Simple reproducer:
>>>>> qemu.git $ ./build/qemu-system-x86_64 \
>>>>> -m size=8389632k,slots=16,maxmem=25600000k \
>>>>> -object
>>>>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/hugepages2M/","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}'
>>>>>  \
>>>>> -numa node,nodeid=0,cpus=0,memdev=ram-node0
>>>>>
>>>>> With current master I get:
>>>>>
>>>>> qemu-system-x86_64: cannot bind memory to host NUMA nodes: Invalid
>>>>> argument
>>>>>
>>>>> The problem is that memory size (8193MiB) is not an integer
>>>>> multiple of underlying pagesize (2MiB) which triggers a check
>>>>> inside of madvise(), since we can't really set a madvise() policy
>>>>> just to a fraction of a page.
>>>>
>>>> I thought we would just always fail create something that doesn't
>>>> really
>>>> make any sense.
>>>>
>>>> Why would we want to support that case?
>>>>
>>>> Let me dig, I thought we would have had some check there at some point
>>>> that would make that fail (especially: RAM block not aligned to the
>>>> pagesize).
>>>
>>>
>>> At least memory-backend-memfd properly fails for that case:
>>>
>>> $ ./build/qemu-system-x86_64 -object
>>> memory-backend-memfd,hugetlb=on,size=3m,id=tmp
>>> qemu-system-x86_64: failed to resize memfd to 3145728: Invalid argument
>>>
>>> memory-backend-file ends up creating a new file:
>>>
>>>   $ ./build/qemu-system-x86_64 -object
>>> memory-backend-file,share=on,mem-path=/dev/hugepages/tmp,size=3m,id=tmp
>>>
>>> $ stat /dev/hugepages/tmp
>>>    File: /dev/hugepages/tmp
>>>    Size: 4194304         Blocks: 0          IO Block: 2097152 regular
>>> file
>>>
>>> ... and ends up sizing it properly aligned to the huge page size.
>>>
>>>
>>> Seems to be due to:
>>>
>>>      if (memory < block->page_size) {
>>>          error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be
>>> equal to "
>>>                     "or larger than page size 0x%zx",
>>>                     memory, block->page_size);
>>>          return NULL;
>>>      }
>>>
>>>      memory = ROUND_UP(memory, block->page_size);
>>>
>>>      /*
>>>       * ftruncate is not supported by hugetlbfs in older
>>>       * hosts, so don't bother bailing out on errors.
>>>       * If anything goes wrong with it under other filesystems,
>>>       * mmap will fail.
>>>       *
>>>       * Do not truncate the non-empty backend file to avoid corrupting
>>>       * the existing data in the file. Disabling shrinking is not
>>>       * enough. For example, the current vNVDIMM implementation stores
>>>       * the guest NVDIMM labels at the end of the backend file. If the
>>>       * backend file is later extended, QEMU will not be able to find
>>>       * those labels. Therefore, extending the non-empty backend file
>>>       * is disabled as well.
>>>       */
>>>      if (truncate && ftruncate(fd, offset + memory)) {
>>>          perror("ftruncate");
>>>      }
>>>
>>> So we create a bigger file and map the bigger file and also have a
>>> RAMBlock that is bigger. So we'll also consume more memory.
>>>
>>> ... but the memory region is smaller and we tell the VM that it has
>>> less memory. Lot of work with no obvious benefit, and only some
>>> memory waste :)
>>>
>>>
>>> We better should have just rejected such memory backends right from
>>> the start. But now it's likely too late.
>>>
>>> I suspect other things like
>>>   * qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
>>>   * qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
>>>
>>> fail, but we don't care for hugetlb at least regarding merging
>>> and don't even log an error.
>>>
>>> But QEMU_MADV_DONTDUMP might also be broken, because that
>>> qemu_madvise() call will just fail.
>>>
>>> Your fix would be correct. But I do wonder if we want to just let that
>>> case fail and warn users that they are doing something that doesn't
>>> make too much sense.
>>>
>>
>> Yeah, what's suspicious is: if the size is smaller than page size we
>> error out, but if it's larger (but still not aligned) we accept that.
>> I'm failing to see reasoning there. Looks like the ROUND_UP() was added
>> in v0.13.0-rc0~1201^2~4 (though it's done with some bit magic) and the
>> check itself was added in v2.8.0-rc0~30^2~23. So it's a bit late, yes.
> 
> Yeah.
> 
>>
>> OTOH - if users want to waste resources, should we stop them? For
> 
> It's all inconsistent, including memfd handling or what you noted above.
> 
> For example, Having a 1025 MiB guest on gigantic pages, consuming 2 GiB
> really is just absolutely stupid.
> 
> Likely the user wants to know about such mistakes instead of making QEMU
> silence all side effects of that. :)


Agreed. As I said, consistency should win here.

> 
> 
>> instance, when user requests more vCPUs than physical CPUs a warning is
>> printed:
>>
>> $ ./build/qemu-system-x86_64 -accel kvm -smp cpus=128
>> qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested
>> (128) exceeds the recommended cpus supported by KVM (8)
> 
> But that case is still reasonable for testing guest behavior with many
> vCPUs, or migrating from a machine with more vCPUs.
> 
> Here, the guest will actually see all vCPUs. In comparison, the memory
> waste here will never ever be consumable by the VM.

Good point.

> 
>>
>> but that's about it. So maybe the error can be demoted to just a warning?
> 
> The question is what we want to do, for example, with the
> qemu_madvise(QEMU_MADV_DONTDUMP). It will similarly simply fail.

It will. But the retval of qemu_madvise() is not checked here, and in
qemu_ram_setup_dump() it's just a warning.

> 
> I'm curious, are there real customers running into that?
> 
>

No, I haven't seen any bugreport from a customer, just our QE ran into
this issue: https://issues.redhat.com/browse/RHEL-1127 (I've asked to
make this issue public).


> We could fix it all that but always warn when something like that is
> being done.
> 

Fair enough. After all of this, I'm inclined to turn this into a proper
error and deny not page aligned sizes. There's no real benefit in having
them and furthermore, the original bug report is about cryptic error
message.

Michal

Re: [PATCH] hostmem: Round up memory size for qemu_madvise() in host_memory_backend_memory_complete()

Reply via email to