André Przywara wrote:
I was partly wrong, the code is in BOCHS CVS, but not in qemu. It wasn't
in BOCHS 2.3.7 release, which qemu is currently based on. Could you pull
the latest BIOS code from BOCHS CVS to qemu? This would give us the
firmware interface for free and I could more easily port my patches.
Working on that right now. BOCHS CVS has diverged a fair bit from what
we have so I'm adjusting our current patches and doing regression testing.
What's actually bothering you with libnuma dependency? I
could directly use the Linux mbind syscall, but I think using a library
is more sane (and probably more portable).
You're making a default policy decision (pin nodes and pin cpus). Your
assuming that Linux will do the wrong thing by default and that the
decision we'll be making is better.
That policy decision requires more validation. We need benchmarks
showing what the perf is like when not pinning vs pinning and we need to
understand whether the bad performance is a Linux bug that can be fixed
or whether it's something fundamental.
What I'm concerned about, is that it'll make the default situation
worse. I advocated punting to management tools because that at least
gives the user the ability to make their own decisions which means you
don't have to prove that this is the correct default decision.
I don't care about a libnuma dependency. Library dependencies are fine
as long as they're optional.
Almost right, but simply calling qemu-system-x86_64 can lead to bad
situations. I lately saw that VCPU #0 was scheduled on one node and
VCPU #1 on another. This leads to random (probably excessive) remote
accesses from the VCPUs, since the guest assumes uniform memory
That seems like Linux is behaving badly, no? Can you describe the
situation more?
That is just my observation. I have to do more research to get a decent
explanation, but I think the problem is that in this early state the
threads barely touch any memory, so Linux tries to distribute them as
best as possible. Just a quick run on a quad node machine with 16 cores
in total:
How does memory migration fit into all of this though? Statistically
speaking, if your NUMA guest is behaving well, it should be easy to
recognize the groupings and perform the appropriate page migration. I
would think even the most naive page migration tool would be able to do
the right thing.
NUMA systems are expensive. If a customer cares about performance
(as opposed to just getting more memory), then I think tools like
numactl are pretty well known.
Well, expensive depends, especially if I think of your employer ;-) In
fact every AMD dual socket server is NUMA, and Intel will join the
game next year.
But the NUMA characteristics on an AMD system are relatively minor. I
doubt that doing static pinning would be what most users wanted since it
could reduce overall system performance noticably.
Even with more traditional NUMA systems, the cost of remote memory
access is often lost by the opportunity cost of leaving a CPU idle.
That's what pinning does, it leaves CPUs potentially idle.
Additionally one could use some kind of home node, so one temporarily
could change the VCPUs affinity and later return to the optimal
affinity (where the memory is located) without specifying it again.
Please resubmit with the first three patches in the front. I don't
think exposing NUMA attributes to a guest is at all controversial so
that's relatively easy to apply.
I'm not saying that the last patch can't be applied, but I don't think
it's as obvious that it's going to be a win when you start doing
performance tests.
Regards,
Anthony Liguori
Comments are welcome.
Regards,
Andre.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html