> The page is allocated at an uninteresting point in time. For example, > the boot loaded allocates a bunch of pages.
The far majority of pages are allocated when a process wants them or the kernel uses them for file cache. > > >>executes. First access happens somewhat later, but still we cannot > >>count on the majority of accesses to come from the same cpu as the first > >>access. > >> > > > >It is a reasonable heuristic. It's just like the rather > >successfull default local allocation heuristic the native kernel uses. > > > > It's very different. The kernel expects an application that touched > page X on node Y to continue using page X on node Y. Because > applications know this, they are written to this assumption. However, The far majority of applications do not actually know where memory is. What matters is that you get local accesses most of the time for the memory that is touched on a specific CPU. Even the applications who know won't break if it's somewhere else, because it's only an optimization. As long as you're faster on average (or in the worst case not significantly worse) than not having it you're fine. Also the Linux first touch is a heuristic that can be wrong later, and I don't see too much difference in having another heuristic level on top of it. The scheme I described is a approximate heuristic to get local memory access in many cases without pinning anything to CPUs. It is certainly not perfect and has holes (like any heuristics), but it has the advantage of being fully dynamic. > in a virtualization context, the guest kernel expects that page X > belongs to whatever node the SRAT table points at, without regard to the > first access. > > Guest kernels behave differently from applications, because real > hardware doesn't allocate pages dynamically like the kernel can for > applications. Again the kernel just wants local memory access most of the time for the allocations where it matters. Also NUMA is always an optimization, it's not a big issue if you're wrong occasionally because that doesn't affect correctness. > > (btw, what do you do with cpu-less nodes? I think some sgi hardware has > them) You assign them to a nearby node, But it's really a totally unimportant corner case. > > >>>The alternative is to keep your own pools and allocate from the > >>>correct pool, but then you either need pinning or getcpu() > >>> > >>> > >>This is meaningless in kvm context. Other than small bits of memory > >>needed for I/O and shadow page tables, the bulk of memory is allocated > >>once. > >> > > > >Mapped once. Anyways that could be changed too if there was need. > > > > > > Mapped once and allocated once (not at the same time, but fairly close). That seems dubious to me. > No. Linux will assume a page belongs to the node the SRAT table says it > belongs to. Whether first access will be from the local node depends on > the workload. If the first application running accesses all memory from > a single cpu, we will allocate all memory from one node, but this is wrong. Sorry I don't get your point. Wrong doesn't make sense in this context. You seem to be saying that an allocation that is not local on a native kernel wouldn't be local in the approximate heuristic either. But that's a triviality that is of course true and likely not what you meant anyways. > > >>(2) even without npt/ept, we have no idea how often mappings are used > >>and by which cpu. finding out is expensive. > >> > > > >You see a fault on the first mapping. That fault is on the CPU that > >did the access. Therefore you know which one it was. > > > > It's meaningless information. First access means nothing. And again, At least in Linux the first access to the majority of memory is either through a process page allocation or through a file cache page cache allocation. Yes there are are a few boot loader and temporary kernel pages for which this is not true, but they are a small insignificant fraction of the total memory in a reasonably sized guest. I'm just ignoring them. This can be often observed in that if you have a broken DIMM you only get problems after using some program that uses most of your memory. > the guest doesn't expect the page to move to the node where it touched it. The only thing the guest cares about is to get good performance on the memory access. And even there only if the page is actually used often (i.e. mapped to a process). How that is archived is not its concern. > > (we also see first access with ept) Great. > > >>(3) for many workloads, there are no unused pages. the guest > >>application allocates all memory and manages memory by itself. > >> > > > >First a common case of guest using all memory is file cache, > >but for NUMA purposes file cache locality typically doesn't > >matter because it's not accessed frequently enough that > >non locality is a problem. It really only matters for mapping > >that are used often by the CPU. > > > >When a single application allocates everything and keeps it that is fine > >too because you'll give it approximately local memory on the initial > >set up (assuming the application has reasonable NUMA behaviour by itself > >on a first touch local allocation policy) > > > > Sure, for the simple cases it works. But consider your first example > followed by the second (you can even reboot the guest in the middle, but > the bad assignment sticks). If you have a heuristic to detect remapping you'll recover on each remapping. > > And if the vcpu moves for some reason, things get screwed up permanently. Yes that's a basic problem of anything that doesn't migrate or pin. No way around it. Native Linux has it too. You just hope you typically don't move. And in many cases it works reasonably well. > > We should try to be predictable, NUMA is unfortunately somewhat unpredictable even on native kernels There are always situations whe where a hot page can end up on the wrong node. That tends to make benchmakers unhappy. But so far no good general way is known to avoid it. Migration would be more predictable, but it has the drawback to have very very bad worst cases because the costs are so high. > not depend on behavior the guest has no > real reason to follow, if it follows hardware specs. Sorry, Avi, I suspect you have a somewhat unrealistic mental model of NUMA knowledge in applications and OS. At least Linux's behaviour (and I assume most NUMA optimized OS) will be handled reasonably well by this scheme. I think. It was just one proposal. Anyways it might still not work well in practice -- the only way to find out would be to implement and try -- but I think it should not be dismissed out of hand. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
