Re: [Qemu-devel] [RFC PATCH] pc: align gpa<->hpa on 1GB boundary by splitting RAM on several regions

Marcelo Tosatti Wed, 30 Oct 2013 12:04:33 -0700

On Wed, Oct 30, 2013 at 05:49:49PM +0100, Igor Mammedov wrote:
> On Tue, 29 Oct 2013 19:38:44 -0200
> Marcelo Tosatti <mtosa...@redhat.com> wrote:
> 
> > On Tue, Oct 29, 2013 at 07:18:49PM +0100, Igor Mammedov wrote:
> > > Otherwise 1GB TLBs cannot be cached for the range.
> > 
> > This fails to back non-1GB-aligned gpas, but 2MB aligned, with 2MB large
> > pages.
> With current command line only one hugetlbfs mount point is possible, so it
> will back with whatever alignment specified hugetlbfs mount point has.
> Anything that doesn't fit into page aligned region goes to tail using
> non hugepage baked phys_mem_set_alloc()=qemu_anon_ram_alloc() allocator.


The patch you propose allocates the non-1GB aligned tail of RAM with 4k
pages. As mentioned, this is not acceptable (2MB pages should be used
whenever 1GB alignment is not possible).

I believe its easier for the user to allocate enough 1GB pages to back
all of guest RAM, since allocation is static, than for him to allocate
mixed 1GB/2MB pages in hugetlbfs.

> > Since hugetlbfs allocation is static, it requires the user to inform
> > different 1GB and 2MB sized hugetlbfs mount points (with proper number
> > of corresponding hugetlbfs pages allocated). This is incompatible with
> > the current command line, and i'd like to see this problem handled in a
> > way that is command line backwards compatible.
> patch doesn't change that, it uses provided hugetlbfs and fallbacks (hunk 2)
> to phys_mem_alloc if requested memory region is not hugepage size aligned.
> So there is no any CLI change, only fixing memory leak.
> 
> > Also, if the argument for one-to-one mapping between dimms and linear host
> > virtual address sections holds, it means virtual DIMMs must be
> > partitioned into whatever hugepage alignment is necessary (and in
> > that case, why they can't be partitioned similarly with the memory
> > region aliases?).
> Because during hotplug a new memory region of desired size is allocated
> and it could be mapped directly without any aliasing. And if some day we
> convert adhoc initial memory allocation to dimm devices there is no reason to
> alloc one huge block and then invent means how to alias hole somewhere else,
> we could just reuse memdev/dimm and allocate several memory regions
> with desired properties each represented by a memdev/dimm pair.
> 
> one-one mapping simplifies design and interface with ACPI part during memory
> hotplug.
> 
> for hotplug case flow could look like:
>  memdev_add 
> id=x1,size=1Gb,mem-path=/hugetlbfs/1gb,other-host-related-stuff-options
>  #memdev could enforce size to be backend aligned
>  device_add dimm,id=y1,backend=x1,addr=xxxxxx
>  #dimm could get alignment from associated memdev or fail if addr
>  #doesn't meet alignment of memdev backend
> 
>  memdev_add id=x2,size=2mb,mem-path=/hugetlbfs/2mb
>  device_add dimm,id=y2,backend=x2,addr=yyyyyyy
> 
>  memdev_add id=x3,size=1mb
>  device_add dimm,id=y3,backend=x3,addr=xxxxxxx
> 
> linear memory block is allocated at runtime (user has to make sure that enough
> hugepages are available) by each memdev_add command and that RAM memory region
> is mapped into GPA by virtual DIMM as is, there wouldn't be any need for
> aliasing.
> 
> Now back to intial memory and bright future we are looking forward to (i.e.
> ability to create machine from configuration file without adhoc codding
> like(pc_memory_init)):
> 
> legacy cmdline "-m 4512 -mem-path /hugetlbfs/1gb" could be automatically
> translated into:
> 
> -memdev id=x1,size=3g,mem-path=/hugetlbfs/1gb -device dimm,backend=x1,addr=0
> -memdev id=x2,size=1g,mem-path=/hugetlbfs/1gb -device dimm,backend=x2,addr=4g
> -memdev id=x3,size=512m -device dimm,backend=x3,addr=5g
> 
> or user could drop legacy CLI and assume fine grained control over memory
> configuration:
> 
> -memdev id=x1,size=3g,mem-path=/hugetlbfs/1gb -device dimm,backend=x1,addr=0
> -memdev id=x2,size=1g,mem-path=/hugetlbfs/1gb -device dimm,backend=x2,addr=4g
> -memdev id=x3,size=512m,mem-path=/hugetlbfs/2mb -device 
> dimm,backend=x3,addr=5g
> 
> So if we are going to break migration compatibility for new machine type
> lets do a way that could painlessly changed to memdev/device in future.

Ok then please improve your proposal to allow for multiple hugetlbfs
mount points.

> > > PS:
> > > as side effect we are not wasting ~1Gb of memory if
> > > 1Gb hugepages are used and -m "hpagesize(in Mb)*n + 1"
> > 
> > This is how hugetlbfs works. You waste 1GB hugepage if an extra
> > byte is requested.
> it looks more a bug than feature,
> why do it if leak could be avoided as shown below?

Because IMO it is confusing for the user, since hugetlbfs allocation is
static. But if you have a necessity for the one-to-one relationship, 
feel free to support mixed hugetlbfs page sizes.

Re: [Qemu-devel] [RFC PATCH] pc: align gpa<->hpa on 1GB boundary by splitting RAM on several regions

Reply via email to