On Fri, Mar 12, 2010 at 04:24:24PM +0000, Paul Brook wrote: > > On Fri, Mar 12, 2010 at 04:04:03PM +0000, Paul Brook wrote: > > > > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size > > > > > > 2097152 > > > > > Hmm, ok. I'm guessing linux doesn't support anything other than "huge" > > > and "normal" page sizes now, so it's a question of whether we want it to > > > expose current implementation details, or say "Align big in-memory things > > > this much for optimal TLB behavior". > > > > hugetlbfs already exposes the implementation detail. So if you want > > that it's already available. The whole point of going the extra mile > > with a transparent solution is to avoid userland to increase in > > complexity and to keep it as unaware of hugepages as possible. The > > madvise hint basically means "this won't risk to waste memory if you > > use large tlb on this mapping" and also "this mapping is more > > important than others to be backed by hugepages". It's up to the > > kernel what to do next. For example right now khugepaged doesn't > > prioritize scanning the madvise regions first, it basically doesn't > > matter for hypervisor solutions in the cloud (all anon memory in the > > system is only allocated by kvm...). But later we may prioritize it > > and try to be smarter from the hint given by userland. > > So shouldn't [the name of] the value the kernel provides for recommended > alignment be equally implementation agnostic?
Is sys/kernel/mm/transparent_hugepage directory implementation agnostic in the first place? This is not black and white issue, the idea of transparency is to have userland to know as little as possible but without actually losing any feature (in fact getting _more_!) than hugetlbfs that requires userland to setup the whole thing, lose paging, lose ksm (yeah it also loses ksm right now but we'll fix that with transparent hugepage support later) etc... If we want to fully take advantage of the feature (i.e. NPT and qemu first 2M of guest physical ram where usually kernel resides) userspace has to know the alignment size the kernel recommends. And so this information can't be implementation agnostic. In short we do everything as possible to avoid changing userland, and this results in a few liner change in fact, but this few liner change is required. be it an hint to ask kernel to align or use posix_madvise (which is more efficient as virtual memory is cheaper than vmas IMHO). Only thing I'm undecided about is if this should be called hpage_pmd_size or just hpage_size. Suppose amd/intel next year adds 64k pages too and the kernel decides to use them too if it fails to allocate a 2M page. So we escalate the fallback from 2M -> 64k -> 4k, and HPAGE_PMD_SIZE becomes 64k. Still qemu has to align on the max possible hpage_size provided by transparent hugepage. So with this new reasoning I think hpage_size or max_hpage_size would be better sysfs name for this. What do you think? hpage_size or max_hpage_size?