Re: [Qemu-devel] [PATCH QEMU] transparent hugepage support

Andrea Arcangeli Fri, 12 Mar 2010 08:57:43 -0800

On Fri, Mar 12, 2010 at 04:24:24PM +0000, Paul Brook wrote:
> > On Fri, Mar 12, 2010 at 04:04:03PM +0000, Paul Brook wrote:
> > > > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > > > > > 2097152
> >
> > > Hmm, ok. I'm guessing linux doesn't support anything other than "huge"
> > > and "normal" page sizes now, so it's a question of whether we want it to
> > > expose current implementation details, or say "Align big in-memory things
> > > this much for optimal TLB behavior".
> > 
> > hugetlbfs already exposes the implementation detail. So if you want
> > that it's already available. The whole point of going the extra mile
> > with a transparent solution is to avoid userland to increase in
> > complexity and to keep it as unaware of hugepages as possible. The
> > madvise hint basically means "this won't risk to waste memory if you
> > use large tlb on this mapping" and also "this mapping is more
> > important than others to be backed by hugepages". It's up to the
> > kernel what to do next. For example right now khugepaged doesn't
> > prioritize scanning the madvise regions first, it basically doesn't
> > matter for hypervisor solutions in the cloud (all anon memory in the
> > system is only allocated by kvm...). But later we may prioritize it
> > and try to be smarter from the hint given by userland.
> 
> So shouldn't [the name of] the value the kernel provides for recommended 
> alignment be equally implementation agnostic?


Is sys/kernel/mm/transparent_hugepage directory implementation
agnostic in the first place?

This is not black and white issue, the idea of transparency is to have
userland to know as little as possible but without actually losing any
feature (in fact getting _more_!) than hugetlbfs that requires
userland to setup the whole thing, lose paging, lose ksm (yeah it also
loses ksm right now but we'll fix that with transparent hugepage
support later) etc...

If we want to fully take advantage of the feature (i.e. NPT and qemu
first 2M of guest physical ram where usually kernel resides) userspace
has to know the alignment size the kernel recommends. And so this
information can't be implementation agnostic. In short we do
everything as possible to avoid changing userland, and this results in
a few liner change in fact, but this few liner change is required. be
it an hint to ask kernel to align or use posix_madvise (which is more
efficient as virtual memory is cheaper than vmas IMHO).

Only thing I'm undecided about is if this should be called
hpage_pmd_size or just hpage_size. Suppose amd/intel next year adds
64k pages too and the kernel decides to use them too if it fails to
allocate a 2M page. So we escalate the fallback from 2M -> 64k -> 4k,
and HPAGE_PMD_SIZE becomes 64k. Still qemu has to align on the max
possible hpage_size provided by transparent hugepage. So with this new
reasoning I think hpage_size or max_hpage_size would be better sysfs
name for this. What do you think? hpage_size or max_hpage_size?

Re: [Qemu-devel] [PATCH QEMU] transparent hugepage support

Reply via email to