On Wed, Oct 30, 2013 at 05:49:49PM +0100, Igor Mammedov wrote: > On Tue, 29 Oct 2013 19:38:44 -0200 > Marcelo Tosatti <mtosa...@redhat.com> wrote: > > > On Tue, Oct 29, 2013 at 07:18:49PM +0100, Igor Mammedov wrote: > > > Otherwise 1GB TLBs cannot be cached for the range. > > > > This fails to back non-1GB-aligned gpas, but 2MB aligned, with 2MB large > > pages. > With current command line only one hugetlbfs mount point is possible, so it > will back with whatever alignment specified hugetlbfs mount point has. > Anything that doesn't fit into page aligned region goes to tail using > non hugepage baked phys_mem_set_alloc()=qemu_anon_ram_alloc() allocator.
The patch you propose allocates the non-1GB aligned tail of RAM with 4k pages. As mentioned, this is not acceptable (2MB pages should be used whenever 1GB alignment is not possible). I believe its easier for the user to allocate enough 1GB pages to back all of guest RAM, since allocation is static, than for him to allocate mixed 1GB/2MB pages in hugetlbfs. > > Since hugetlbfs allocation is static, it requires the user to inform > > different 1GB and 2MB sized hugetlbfs mount points (with proper number > > of corresponding hugetlbfs pages allocated). This is incompatible with > > the current command line, and i'd like to see this problem handled in a > > way that is command line backwards compatible. > patch doesn't change that, it uses provided hugetlbfs and fallbacks (hunk 2) > to phys_mem_alloc if requested memory region is not hugepage size aligned. > So there is no any CLI change, only fixing memory leak. > > > Also, if the argument for one-to-one mapping between dimms and linear host > > virtual address sections holds, it means virtual DIMMs must be > > partitioned into whatever hugepage alignment is necessary (and in > > that case, why they can't be partitioned similarly with the memory > > region aliases?). > Because during hotplug a new memory region of desired size is allocated > and it could be mapped directly without any aliasing. And if some day we > convert adhoc initial memory allocation to dimm devices there is no reason to > alloc one huge block and then invent means how to alias hole somewhere else, > we could just reuse memdev/dimm and allocate several memory regions > with desired properties each represented by a memdev/dimm pair. > > one-one mapping simplifies design and interface with ACPI part during memory > hotplug. > > for hotplug case flow could look like: > memdev_add > id=x1,size=1Gb,mem-path=/hugetlbfs/1gb,other-host-related-stuff-options > #memdev could enforce size to be backend aligned > device_add dimm,id=y1,backend=x1,addr=xxxxxx > #dimm could get alignment from associated memdev or fail if addr > #doesn't meet alignment of memdev backend > > memdev_add id=x2,size=2mb,mem-path=/hugetlbfs/2mb > device_add dimm,id=y2,backend=x2,addr=yyyyyyy > > memdev_add id=x3,size=1mb > device_add dimm,id=y3,backend=x3,addr=xxxxxxx > > linear memory block is allocated at runtime (user has to make sure that enough > hugepages are available) by each memdev_add command and that RAM memory region > is mapped into GPA by virtual DIMM as is, there wouldn't be any need for > aliasing. > > Now back to intial memory and bright future we are looking forward to (i.e. > ability to create machine from configuration file without adhoc codding > like(pc_memory_init)): > > legacy cmdline "-m 4512 -mem-path /hugetlbfs/1gb" could be automatically > translated into: > > -memdev id=x1,size=3g,mem-path=/hugetlbfs/1gb -device dimm,backend=x1,addr=0 > -memdev id=x2,size=1g,mem-path=/hugetlbfs/1gb -device dimm,backend=x2,addr=4g > -memdev id=x3,size=512m -device dimm,backend=x3,addr=5g > > or user could drop legacy CLI and assume fine grained control over memory > configuration: > > -memdev id=x1,size=3g,mem-path=/hugetlbfs/1gb -device dimm,backend=x1,addr=0 > -memdev id=x2,size=1g,mem-path=/hugetlbfs/1gb -device dimm,backend=x2,addr=4g > -memdev id=x3,size=512m,mem-path=/hugetlbfs/2mb -device > dimm,backend=x3,addr=5g > > So if we are going to break migration compatibility for new machine type > lets do a way that could painlessly changed to memdev/device in future. Ok then please improve your proposal to allow for multiple hugetlbfs mount points. > > > PS: > > > as side effect we are not wasting ~1Gb of memory if > > > 1Gb hugepages are used and -m "hpagesize(in Mb)*n + 1" > > > > This is how hugetlbfs works. You waste 1GB hugepage if an extra > > byte is requested. > it looks more a bug than feature, > why do it if leak could be avoided as shown below? Because IMO it is confusing for the user, since hugetlbfs allocation is static. But if you have a necessity for the one-to-one relationship, feel free to support mixed hugetlbfs page sizes.