André Przywara wrote:
I was partly wrong, the code is in BOCHS CVS, but not in qemu. It wasn't
in BOCHS 2.3.7 release, which qemu is currently based on. Could you pull
the latest BIOS code from BOCHS CVS to qemu? This would give us the
firmware interface for free and I could more easily port my patches.

Working on that right now. BOCHS CVS has diverged a fair bit from what we have so I'm adjusting our current patches and doing regression testing.

What's actually bothering you with libnuma dependency? I
could directly use the Linux mbind syscall, but I think using a library
is more sane (and probably more portable).

You're making a default policy decision (pin nodes and pin cpus). Your assuming that Linux will do the wrong thing by default and that the decision we'll be making is better.

That policy decision requires more validation. We need benchmarks showing what the perf is like when not pinning vs pinning and we need to understand whether the bad performance is a Linux bug that can be fixed or whether it's something fundamental.

What I'm concerned about, is that it'll make the default situation worse. I advocated punting to management tools because that at least gives the user the ability to make their own decisions which means you don't have to prove that this is the correct default decision.

I don't care about a libnuma dependency. Library dependencies are fine as long as they're optional.

Almost right, but simply calling qemu-system-x86_64 can lead to bad situations. I lately saw that VCPU #0 was scheduled on one node and VCPU #1 on another. This leads to random (probably excessive) remote accesses from the VCPUs, since the guest assumes uniform memory
That seems like Linux is behaving badly, no? Can you describe the situation more?
That is just my observation. I have to do more research to get a decent
explanation, but I think the problem is that in this early state the
threads barely touch any memory, so Linux tries to distribute them as
best as possible. Just a quick run on a quad node machine with 16 cores
in total:

How does memory migration fit into all of this though? Statistically speaking, if your NUMA guest is behaving well, it should be easy to recognize the groupings and perform the appropriate page migration. I would think even the most naive page migration tool would be able to do the right thing.

NUMA systems are expensive. If a customer cares about performance (as opposed to just getting more memory), then I think tools like numactl are pretty well known.
Well, expensive depends, especially if I think of your employer ;-) In
fact every AMD dual socket server is NUMA, and Intel will join the game next year.

But the NUMA characteristics on an AMD system are relatively minor. I doubt that doing static pinning would be what most users wanted since it could reduce overall system performance noticably.

Even with more traditional NUMA systems, the cost of remote memory access is often lost by the opportunity cost of leaving a CPU idle. That's what pinning does, it leaves CPUs potentially idle.

Additionally one could use some kind of home node, so one temporarily could change the VCPUs affinity and later return to the optimal affinity (where the memory is located) without specifying it again.

Please resubmit with the first three patches in the front. I don't think exposing NUMA attributes to a guest is at all controversial so that's relatively easy to apply.

I'm not saying that the last patch can't be applied, but I don't think it's as obvious that it's going to be a win when you start doing performance tests.

Regards,

Anthony Liguori

Comments are welcome.

Regards,
Andre.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to