"Wil Reichert" <[EMAIL PROTECTED]> posted
[EMAIL PROTECTED], excerpted
below, on  Wed, 29 Aug 2007 07:49:23 -0400:

> I often have long running cpu-intensive tasks running on my dual core
> machines.  Will maintaining cpu / core affinity result in noticeably
> better performance?  Or was that answered in the previous thread you
> referenced that I missed =)

Short version: In general, the automatic balancing does a reasonable 
job.  There's little reason to mess with things unless you have special 
scheduling needs due to the hardware or software you are running.

Medium version: Core bounces cause a slight loss of efficiency due to the 
cache bouncing, but the kernel has an affinity metric that should keep 
that from happening too much, so unless you have special scheduling 
needs, it's generally not worth worrying about.  If you want to encourage 
stronger "automatic" core affinity, tell the kernel to use SMP without 
enabling multi-core.  That gives it more resistance to switching cores, 
because you are telling it to use the socket scheduling code, instead of 
the core scheduling code, and the cost to switch between sockets is 
higher so it tries harder not to do it.  Note also the difference between 
true multi-core and the Intel trick of gluing multiple CPUs (each of 
which may be multiple cores) together using the same socket.  The latter 
is for all intents and purposes, the same as separate sockets.  There's 
also the NUMA differences, AMD has the concept of socket local memory, 
using a NUMA, non-uniform-memory-architecture, model, while most Intel 
chipsets use a global memory model -- it's all the same, accessed over 
the front-side bus.

In somewhat more detail:  The kernel has a multi-layer CPU scheduling 
model based on the expense of transferring threads and accessing memory 
across CPUs/cores.  Simplified layout (there may be more layers and the 
number of nodes at each level doesn't have to be balanced, speeds and 
access to resources may be vastly different between machines in a compute 
cluster, for instance, but you get the idea).

Machines/boards
  machine0
    sockets
      m0/socket0
        cores
          m0/s0/core0
          m0/s0/core1
      m0/socket1
          m0/s1/core0
          m0/s1/core1
  machine1
      m1/socket0
          m1/s0/core0
          m1/s0/core1
...

Note that I didn't model Intel's SMT technology.  It'd be another level 
beyond core, basically no cost to switch.

Transferring execution threads between cores is relatively cheap.  
Depending on the specific architecture, often all you lose is L1 and 
possibly L2 cache.

Transferring threads between sockets is somewhat more expensive.  
Depending on the architecture, you may be losing local memory-controller 
memory access, and you definitely lose L2 and possibly L3 cache (if the 
arch includes L3).  The kernel can either continue to access memory where 
it is or transfer it to locally addressed memory.  I believe it uses need-
based memory transfer.  That is, the memory blocks are faulted into local 
memory as they are accessed by the thread.

Transferring between machines is comparatively VERY expensive.  You 
definitely lose local memory access and often lose local permanent 
storage access, if the task is accessing files.  Memory transfer tends to 
be everything at once.  The affect on storage access depends on the site 
architecture; if everything is stored in SAN anyway, not a lot of affect 
on it, but if one is now accessing the formerly local storage remotely, 
or has to transfer it to the new machine, that's further cost.

The kernel's (and clustering if appropriate) scheduling accounts for all 
this variation, the various differing levels of cost to transfer between 
cores vs. sockets vs. machines.  It normally does a fairly good job of 
calculating the transfer cost vs. payback, so there's a lot less 
resistance to transferring between cores as opposed to sockets, and 
sockets as opposed to machines.  The higher up the hierarchy you go, the 
greater the differential between overscheduled on current and idle on 
potential target has to be, in order to overcome the switching 
resistance, because the cost is so much higher.

So what's an example of an exception, where one would want to manually 
interfere using taskset or the like?

Here's mine (desktop/workstation scenario).  I have a dual CPU (socket) 
machine, dual Opteron 242s at present, to be replaced with Opteron 290s 
as soon as the quad-cores come online and bump down the prices (dual 
Opteron 290s run $1200 plus ATM, $600x2).  On this chipset (original AMD 
8000 series), socket0 connects directly to all the peripherals, video AGP 
(pre-PCI-E), multiple channel PCI-X (pre-PCI-E), Gigabit Ethernet on the 
PCI-X, the southbridge with everything else.  Socket1 only has two direct 
connections, to the memory hanging off its own controller, and the hyper-
transport link to Socket0.

So here, tasks such as X that need fast access to the video, and to a 
lessor extent (since the speed is lower) tasks that heavily access the 
RAID array, networking, and components on the southbridge, work much more 
efficiently if bound to Socket0 (currently CPU0, since I'm only running 
single cores, eventually CPUs 0 and 1 when I upgrade to dual-cores).  CPU 
intensive stuff like emerging, or running emulated games (the single 
closed source app remaining here is the original Master of Orion (MOO), 
copyright 1993, which I run using DOSBox emulation) can run just fine on 
Socket1 (currently CPU1, CPUs 2 and 3 once I upgrade to dual-cores) with 
virtually no loss of efficiency.

So if I'm going to be gaming MOO I set X and amarok (which I often have 
running in the background) to CPU0 and run DOSBox set to CPU1, where it 
gets nearly 100% of the CPU all to itself, no X or anything else taking 
their share.

Similarly, I don't generally do so as setting PORTAGE_NICENESS=19 in 
make.conf tends to be sufficient and I can then let it run on both CPUs, 
but if I wanted to, I could do my emerges exclusively on CPU1, and not 
have it bother X and my regular activity on CPU0 at all, almost as if the 
compiling were happening on another machine entirely.  (Note that since I 
have PORTAGE_TMPDIR on tmpfs, it doesn't do disk I/O for all those 
temporary files created during compilation, only writing the final 
package to the main filesystem.  I run ccache so there's a bit of writing 
for it, and when it needs something not already in disk cache there's a 
bit of read access, but nothing major.)

I'm really looking forward to getting the dual-cores, so I have two cores 
directly connected to video and I/O, and two for doing primarily CPU 
intensive stuff.  That'll give me a lot more flexibility in scheduling 
since I'll be able to schedule X on one direct-connect core and 
everything else interactive on the other, while still having the two 
socket1 cores to do CPU intensive stuff like merges.  I expect I WILL 
take advantage of taskset for emerging at that point.  MOO in DOSBox is 
CPU intensive enough that it takes a full CPU core for itself.  Since X 
and my other tasks take cycles on what's currently the only other CPU 
core, I can't do anything else intensive like merging stuff while running 
MOO.  The 2x2-way will let me place MOO on socket0/core1, X and etc on 
socket0/core0, and still let emerges go full steam on the two socket1 
cores, which will be /very/ nice.

Another example (commercial server scenario), all the rage ATM, is 
virtualization.  On a multi-way system, it's quite useful to be able to 
dedicate a core or two to specific heavier use VMs, while a number of 
other VMs get to share a core or two, and the host system gets a core of 
its own as well.  

That sort of resource partitioning has been big iron (IBM s390s, Sun's 
multicores) territory for some time, and the 4 and 8 socket Opteron 
systems allowed some of that altho the tools to control it are only now 
getting mature, but with quad-core now just hitting mainstream and dual-
socket dual-core already reasonably priced mainstream, such scenarios are 
now actually within the reach of "ordinary humans", for both mainstream 
business server use and for workstation and now desktop and even limited 
laptop use. =8^)

They say ordinary users don't need multi-core, but while Gentoo's 
certainly a bit beyond ordinary user, it's not /that/ far out, and this 
dual-CPU, my first dual-CPU system, filled all my expectations, but is 
now feeling just as cramped as that old 1.2 GHz Athlon system did when I 
finally upgraded to this dual Opteron 242 system.  So while I can't say 
for sure I'll have use for 8-way or greater systems, I can certainly use 
a four-way system, which means with a bit of education if necessary, 
ordinary users should now be finding at least two-way systems useful, 
just as I did several years ago.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

-- 
[EMAIL PROTECTED] mailing list

Reply via email to