Dmitri Pogosyan posted
<[EMAIL PROTECTED]>, excerpted below,  on
Wed, 22 Jun 2005 22:40:29 -0600:

> I would be very interested to check if somebody could benchmark if this
> does not hurt the memory access performance.
> 
> Celestica is quite adamant that it will (but it is out of opteron business,
> so its opinion may be old)

There are a couple factors that affect memory performance as available
physical memory goes up.  First, beyond some level (not a significant
factor for 64-bit, or even the 40-bit hardware memory limits of amd64 at
this point, but /very/ significant for 32-bit), there's a hit as the
system increases the level of indirection needed to address the memory. 
With 32-bit, there are thresholds at 2GB and 4GB (the physical threshold
is 4GB, but since traditionally that was split into a 2GB kernel address
space, and a 2GB user address space, there's a 2GB threshold as well),
with anything beyond that requiring "hacks" and further levels of paging
tables and indirection.  I believe the current total addressable system
limit, with four levels of indirection, is 64GB on 32-bit hardware, altho
individual programs have limits below that.  

The current 40-bit physical memory address space limit of the amd64 arch
is somewhere well into the multi-terabyte range (IDR the exact numbers and
am too lazy to do the math or look it up, but someone else might have them
handy), and the virtual memory isn't limited even to that, as it can use
the full 64-bit address space, so as I said, our arch isn't practically
affected by this limit at the moment.  By the time the 40-bit physical
memory constraint is reached, they will have upped that likely to 48 bit,
and it can eventually be upped to 64-bit, if and when that becomes
necessary, before another bitness switch like that of 32-bit to 64-bit,
becomes necessary.  We'll be dealing with the next Y2K like issue, the
rollover of the 32-bit Unix clock in IIRC 2038, first, and well before the
64-bit space runs out.

(Do note that the entire memory hole just under 4GB issue is related to
32-bit legacy issues as well, in this case, the originally 32-bit speced
PCI bus.  Certain legacy 32-bit PCI devices, and certain legacy drivers
even where the hardware can handle it, aren't prepared to address more
than 4GB of memory, so that address hole must be left below 4GB in ordered
for them to be able to place their I/O addresses there and successfully
function within even a 64-bit system.)

The other and more pressing issue for 64-bit systems, is the BIOS memory
arch thing.  Filling beyond the lower pair of memory slots addressed by an
amd64 MMU requires doubling the command que lengths from 1 command to two.
That decreases performance somewhat.  However, I don't understand the
details of the hows and whys, so don't expect me to explain them. <g>

I /do/ know this, however -- the effect is there regardless of whether you
are using more than a pair of 1/4 gig sticks, or a pair of 2 gig sticks,
the biggest the amd64 MMUs are prepared to deal with at this point.  Thus,
performance-wise, if one were to need 4 gig addressable per CPU/MMU, it'd
be better to get two 2-gig sticks than to fill up the slots with four 1
gig sticks.  Of course, 2-gig sticks still cost big money at this point,
but the same applies to lower memory requirements as well -- a pair of gig
sticks will be more efficient than a quad of half-gig sticks, while the
cost difference is generally trivial.

Also note that this is per mmu, with the mmu residing on the CPU chip for
amd64.  Thus my dual Opteron Tyan mobo has eight slots, four assigned to
each CPU, and I could (and shortly will) install 4 1-gig sticks, only
filling up the first pair of slots assigned to each CPU.  (The dual-core
chips still have a single mmu, shared by the cores, so if I upgraded to
them, and yes, the mobo is dual-core upgradable, I'd still have two mmus,
one for each chip, not 4, one for each core.)  Since I'm only filling the
first pair of slots assigned to each mmu, I can still use the more
efficient single command queue length.

All that said, however, note that ANY physical RAM is still going to be
VASTLY faster than swap (which I currently have disabled, with just a gig
of memory, so I'll definitely keep it that way when upgrading), even if it
was still PC2100 DDR SDRAM instead of the higher PC3200 that's the max
speed the Opterons are rated for, and even with all slots full
necessitating a double command queue length.  Someone that's seriously
considering spending the $$ on 4 gig or more of memory, instead of going
dual-core, for instance, most likely is USING enough memory that they'll
see improvements from the additional memory, even if it DOES mean 2x
command queue lengths and nominally less efficient memory access, because
that'll still mean accesses to memory that /would/ have been to swap,
otherwise, or to in-memory cache, that would have been dumped from cache
therefore necessitating grabbing the data from disk again.  That memory
access is still a good two orders of magnitude faster than the access to
disk would have been, so until the frequency of access to the additional
memory area, that /would/ have otherwise gone to disk, drops below that
two orders of magnitude threshold, it's STILL more efficient to have the
extra memory, even at the expense of a slight drop in memory access
efficiency.

Practically speaking, that access frequency threshold is probably going to
be between 1 and 4 GB, for an amd64 Gentoo workstation system, updating
and compiling from source "on the fly", depending on what other uses the
system is put to as well.  (Developers and those doing a lot of
multi-media work will likely be closer to 4GB, ordinary
desktop/office/gamer use, single game running at a time in gamer mode,
will likely be closer to the 1GB end.)  Again, note for a dual Opteron
(therefore dual mmu) system, 4GB still fits within the performance and
memory envelope, as 4 1-gig sticks, a pair each in the low slots assigned
to each mmu/cpu.  With 1-gig sticks now roughly twice the cost of 1/2-gig
sticks, that's not a big issue, tho it would be going to 2-gig sticks,
which are the other side of the pricing "knee", and still MORE than twice
as expensive as the 1-gig sticks.

Also practically speaking, that means that it's not a big issue for most
users, because most users will naturally find other things to do with the
money that /could/ be upgrading them to more than 4GB in the first place. 
The only that might find themselves making this mistake are gamers with a
$10K budget to blow on their gaming machine.

Of course, servers are a rather different ballgame.  Here, actual memory
performance doesn't rate high on the priority scale at all, while often
the more GB of memory that can be stuffed into the system, the better
performance gets, due to a working data set often tens of GBs in size, the
more of which can be cached in memory, the better.  It's HERE that a
quad opteron system (8 cores if dual-cored, tho that doesn't affect
memory), with quad chips and therefore quad mmu, each of which can access
8GB in 4 slots of memory off the local controller, therefore totalling
32GB of physical memory, <WHAT a system! drool, drool>, can /still/ be
minimally adequate to do the job.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman in
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html


-- 
[email protected] mailing list

Reply via email to