"Wil Reichert" <[EMAIL PROTECTED]> posted
[EMAIL PROTECTED], excerpted
below, on Wed, 29 Aug 2007 07:49:23 -0400:
> I often have long running cpu-intensive tasks running on my dual core
> machines. Will maintaining cpu / core affinity result in noticeably
> better performance? Or was that answered in the previous thread you
> referenced that I missed =)
Short version: In general, the automatic balancing does a reasonable
job. There's little reason to mess with things unless you have special
scheduling needs due to the hardware or software you are running.
Medium version: Core bounces cause a slight loss of efficiency due to the
cache bouncing, but the kernel has an affinity metric that should keep
that from happening too much, so unless you have special scheduling
needs, it's generally not worth worrying about. If you want to encourage
stronger "automatic" core affinity, tell the kernel to use SMP without
enabling multi-core. That gives it more resistance to switching cores,
because you are telling it to use the socket scheduling code, instead of
the core scheduling code, and the cost to switch between sockets is
higher so it tries harder not to do it. Note also the difference between
true multi-core and the Intel trick of gluing multiple CPUs (each of
which may be multiple cores) together using the same socket. The latter
is for all intents and purposes, the same as separate sockets. There's
also the NUMA differences, AMD has the concept of socket local memory,
using a NUMA, non-uniform-memory-architecture, model, while most Intel
chipsets use a global memory model -- it's all the same, accessed over
the front-side bus.
In somewhat more detail: The kernel has a multi-layer CPU scheduling
model based on the expense of transferring threads and accessing memory
across CPUs/cores. Simplified layout (there may be more layers and the
number of nodes at each level doesn't have to be balanced, speeds and
access to resources may be vastly different between machines in a compute
cluster, for instance, but you get the idea).
Machines/boards
machine0
sockets
m0/socket0
cores
m0/s0/core0
m0/s0/core1
m0/socket1
m0/s1/core0
m0/s1/core1
machine1
m1/socket0
m1/s0/core0
m1/s0/core1
...
Note that I didn't model Intel's SMT technology. It'd be another level
beyond core, basically no cost to switch.
Transferring execution threads between cores is relatively cheap.
Depending on the specific architecture, often all you lose is L1 and
possibly L2 cache.
Transferring threads between sockets is somewhat more expensive.
Depending on the architecture, you may be losing local memory-controller
memory access, and you definitely lose L2 and possibly L3 cache (if the
arch includes L3). The kernel can either continue to access memory where
it is or transfer it to locally addressed memory. I believe it uses need-
based memory transfer. That is, the memory blocks are faulted into local
memory as they are accessed by the thread.
Transferring between machines is comparatively VERY expensive. You
definitely lose local memory access and often lose local permanent
storage access, if the task is accessing files. Memory transfer tends to
be everything at once. The affect on storage access depends on the site
architecture; if everything is stored in SAN anyway, not a lot of affect
on it, but if one is now accessing the formerly local storage remotely,
or has to transfer it to the new machine, that's further cost.
The kernel's (and clustering if appropriate) scheduling accounts for all
this variation, the various differing levels of cost to transfer between
cores vs. sockets vs. machines. It normally does a fairly good job of
calculating the transfer cost vs. payback, so there's a lot less
resistance to transferring between cores as opposed to sockets, and
sockets as opposed to machines. The higher up the hierarchy you go, the
greater the differential between overscheduled on current and idle on
potential target has to be, in order to overcome the switching
resistance, because the cost is so much higher.
So what's an example of an exception, where one would want to manually
interfere using taskset or the like?
Here's mine (desktop/workstation scenario). I have a dual CPU (socket)
machine, dual Opteron 242s at present, to be replaced with Opteron 290s
as soon as the quad-cores come online and bump down the prices (dual
Opteron 290s run $1200 plus ATM, $600x2). On this chipset (original AMD
8000 series), socket0 connects directly to all the peripherals, video AGP
(pre-PCI-E), multiple channel PCI-X (pre-PCI-E), Gigabit Ethernet on the
PCI-X, the southbridge with everything else. Socket1 only has two direct
connections, to the memory hanging off its own controller, and the hyper-
transport link to Socket0.
So here, tasks such as X that need fast access to the video, and to a
lessor extent (since the speed is lower) tasks that heavily access the
RAID array, networking, and components on the southbridge, work much more
efficiently if bound to Socket0 (currently CPU0, since I'm only running
single cores, eventually CPUs 0 and 1 when I upgrade to dual-cores). CPU
intensive stuff like emerging, or running emulated games (the single
closed source app remaining here is the original Master of Orion (MOO),
copyright 1993, which I run using DOSBox emulation) can run just fine on
Socket1 (currently CPU1, CPUs 2 and 3 once I upgrade to dual-cores) with
virtually no loss of efficiency.
So if I'm going to be gaming MOO I set X and amarok (which I often have
running in the background) to CPU0 and run DOSBox set to CPU1, where it
gets nearly 100% of the CPU all to itself, no X or anything else taking
their share.
Similarly, I don't generally do so as setting PORTAGE_NICENESS=19 in
make.conf tends to be sufficient and I can then let it run on both CPUs,
but if I wanted to, I could do my emerges exclusively on CPU1, and not
have it bother X and my regular activity on CPU0 at all, almost as if the
compiling were happening on another machine entirely. (Note that since I
have PORTAGE_TMPDIR on tmpfs, it doesn't do disk I/O for all those
temporary files created during compilation, only writing the final
package to the main filesystem. I run ccache so there's a bit of writing
for it, and when it needs something not already in disk cache there's a
bit of read access, but nothing major.)
I'm really looking forward to getting the dual-cores, so I have two cores
directly connected to video and I/O, and two for doing primarily CPU
intensive stuff. That'll give me a lot more flexibility in scheduling
since I'll be able to schedule X on one direct-connect core and
everything else interactive on the other, while still having the two
socket1 cores to do CPU intensive stuff like merges. I expect I WILL
take advantage of taskset for emerging at that point. MOO in DOSBox is
CPU intensive enough that it takes a full CPU core for itself. Since X
and my other tasks take cycles on what's currently the only other CPU
core, I can't do anything else intensive like merging stuff while running
MOO. The 2x2-way will let me place MOO on socket0/core1, X and etc on
socket0/core0, and still let emerges go full steam on the two socket1
cores, which will be /very/ nice.
Another example (commercial server scenario), all the rage ATM, is
virtualization. On a multi-way system, it's quite useful to be able to
dedicate a core or two to specific heavier use VMs, while a number of
other VMs get to share a core or two, and the host system gets a core of
its own as well.
That sort of resource partitioning has been big iron (IBM s390s, Sun's
multicores) territory for some time, and the 4 and 8 socket Opteron
systems allowed some of that altho the tools to control it are only now
getting mature, but with quad-core now just hitting mainstream and dual-
socket dual-core already reasonably priced mainstream, such scenarios are
now actually within the reach of "ordinary humans", for both mainstream
business server use and for workstation and now desktop and even limited
laptop use. =8^)
They say ordinary users don't need multi-core, but while Gentoo's
certainly a bit beyond ordinary user, it's not /that/ far out, and this
dual-CPU, my first dual-CPU system, filled all my expectations, but is
now feeling just as cramped as that old 1.2 GHz Athlon system did when I
finally upgraded to this dual Opteron 242 system. So while I can't say
for sure I'll have use for 8-way or greater systems, I can certainly use
a four-way system, which means with a bit of education if necessary,
ordinary users should now be finding at least two-way systems useful,
just as I did several years ago.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
--
[EMAIL PROTECTED] mailing list