On 12/4/23 10:21, David Hildenbrand wrote: > On 01.12.23 10:07, Michal Prívozník wrote: >> On 11/27/23 14:55, David Hildenbrand wrote: >>> On 27.11.23 14:37, David Hildenbrand wrote: >>>> On 27.11.23 13:32, Michal Privoznik wrote: >>>>> Simple reproducer: >>>>> qemu.git $ ./build/qemu-system-x86_64 \ >>>>> -m size=8389632k,slots=16,maxmem=25600000k \ >>>>> -object >>>>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/hugepages2M/","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' >>>>> \ >>>>> -numa node,nodeid=0,cpus=0,memdev=ram-node0 >>>>> >>>>> With current master I get: >>>>> >>>>> qemu-system-x86_64: cannot bind memory to host NUMA nodes: Invalid >>>>> argument >>>>> >>>>> The problem is that memory size (8193MiB) is not an integer >>>>> multiple of underlying pagesize (2MiB) which triggers a check >>>>> inside of madvise(), since we can't really set a madvise() policy >>>>> just to a fraction of a page. >>>> >>>> I thought we would just always fail create something that doesn't >>>> really >>>> make any sense. >>>> >>>> Why would we want to support that case? >>>> >>>> Let me dig, I thought we would have had some check there at some point >>>> that would make that fail (especially: RAM block not aligned to the >>>> pagesize). >>> >>> >>> At least memory-backend-memfd properly fails for that case: >>> >>> $ ./build/qemu-system-x86_64 -object >>> memory-backend-memfd,hugetlb=on,size=3m,id=tmp >>> qemu-system-x86_64: failed to resize memfd to 3145728: Invalid argument >>> >>> memory-backend-file ends up creating a new file: >>> >>> $ ./build/qemu-system-x86_64 -object >>> memory-backend-file,share=on,mem-path=/dev/hugepages/tmp,size=3m,id=tmp >>> >>> $ stat /dev/hugepages/tmp >>> File: /dev/hugepages/tmp >>> Size: 4194304 Blocks: 0 IO Block: 2097152 regular >>> file >>> >>> ... and ends up sizing it properly aligned to the huge page size. >>> >>> >>> Seems to be due to: >>> >>> if (memory < block->page_size) { >>> error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be >>> equal to " >>> "or larger than page size 0x%zx", >>> memory, block->page_size); >>> return NULL; >>> } >>> >>> memory = ROUND_UP(memory, block->page_size); >>> >>> /* >>> * ftruncate is not supported by hugetlbfs in older >>> * hosts, so don't bother bailing out on errors. >>> * If anything goes wrong with it under other filesystems, >>> * mmap will fail. >>> * >>> * Do not truncate the non-empty backend file to avoid corrupting >>> * the existing data in the file. Disabling shrinking is not >>> * enough. For example, the current vNVDIMM implementation stores >>> * the guest NVDIMM labels at the end of the backend file. If the >>> * backend file is later extended, QEMU will not be able to find >>> * those labels. Therefore, extending the non-empty backend file >>> * is disabled as well. >>> */ >>> if (truncate && ftruncate(fd, offset + memory)) { >>> perror("ftruncate"); >>> } >>> >>> So we create a bigger file and map the bigger file and also have a >>> RAMBlock that is bigger. So we'll also consume more memory. >>> >>> ... but the memory region is smaller and we tell the VM that it has >>> less memory. Lot of work with no obvious benefit, and only some >>> memory waste :) >>> >>> >>> We better should have just rejected such memory backends right from >>> the start. But now it's likely too late. >>> >>> I suspect other things like >>> * qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE); >>> * qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP); >>> >>> fail, but we don't care for hugetlb at least regarding merging >>> and don't even log an error. >>> >>> But QEMU_MADV_DONTDUMP might also be broken, because that >>> qemu_madvise() call will just fail. >>> >>> Your fix would be correct. But I do wonder if we want to just let that >>> case fail and warn users that they are doing something that doesn't >>> make too much sense. >>> >> >> Yeah, what's suspicious is: if the size is smaller than page size we >> error out, but if it's larger (but still not aligned) we accept that. >> I'm failing to see reasoning there. Looks like the ROUND_UP() was added >> in v0.13.0-rc0~1201^2~4 (though it's done with some bit magic) and the >> check itself was added in v2.8.0-rc0~30^2~23. So it's a bit late, yes. > > Yeah. > >> >> OTOH - if users want to waste resources, should we stop them? For > > It's all inconsistent, including memfd handling or what you noted above. > > For example, Having a 1025 MiB guest on gigantic pages, consuming 2 GiB > really is just absolutely stupid. > > Likely the user wants to know about such mistakes instead of making QEMU > silence all side effects of that. :)
Agreed. As I said, consistency should win here. > > >> instance, when user requests more vCPUs than physical CPUs a warning is >> printed: >> >> $ ./build/qemu-system-x86_64 -accel kvm -smp cpus=128 >> qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested >> (128) exceeds the recommended cpus supported by KVM (8) > > But that case is still reasonable for testing guest behavior with many > vCPUs, or migrating from a machine with more vCPUs. > > Here, the guest will actually see all vCPUs. In comparison, the memory > waste here will never ever be consumable by the VM. Good point. > >> >> but that's about it. So maybe the error can be demoted to just a warning? > > The question is what we want to do, for example, with the > qemu_madvise(QEMU_MADV_DONTDUMP). It will similarly simply fail. It will. But the retval of qemu_madvise() is not checked here, and in qemu_ram_setup_dump() it's just a warning. > > I'm curious, are there real customers running into that? > > No, I haven't seen any bugreport from a customer, just our QE ran into this issue: https://issues.redhat.com/browse/RHEL-1127 (I've asked to make this issue public). > We could fix it all that but always warn when something like that is > being done. > Fair enough. After all of this, I'm inclined to turn this into a proper error and deny not page aligned sizes. There's no real benefit in having them and furthermore, the original bug report is about cryptic error message. Michal