Dmitri Pogosyan posted <[EMAIL PROTECTED]>, excerpted below, on Wed, 22 Jun 2005 22:40:29 -0600:
> I would be very interested to check if somebody could benchmark if this > does not hurt the memory access performance. > > Celestica is quite adamant that it will (but it is out of opteron business, > so its opinion may be old) There are a couple factors that affect memory performance as available physical memory goes up. First, beyond some level (not a significant factor for 64-bit, or even the 40-bit hardware memory limits of amd64 at this point, but /very/ significant for 32-bit), there's a hit as the system increases the level of indirection needed to address the memory. With 32-bit, there are thresholds at 2GB and 4GB (the physical threshold is 4GB, but since traditionally that was split into a 2GB kernel address space, and a 2GB user address space, there's a 2GB threshold as well), with anything beyond that requiring "hacks" and further levels of paging tables and indirection. I believe the current total addressable system limit, with four levels of indirection, is 64GB on 32-bit hardware, altho individual programs have limits below that. The current 40-bit physical memory address space limit of the amd64 arch is somewhere well into the multi-terabyte range (IDR the exact numbers and am too lazy to do the math or look it up, but someone else might have them handy), and the virtual memory isn't limited even to that, as it can use the full 64-bit address space, so as I said, our arch isn't practically affected by this limit at the moment. By the time the 40-bit physical memory constraint is reached, they will have upped that likely to 48 bit, and it can eventually be upped to 64-bit, if and when that becomes necessary, before another bitness switch like that of 32-bit to 64-bit, becomes necessary. We'll be dealing with the next Y2K like issue, the rollover of the 32-bit Unix clock in IIRC 2038, first, and well before the 64-bit space runs out. (Do note that the entire memory hole just under 4GB issue is related to 32-bit legacy issues as well, in this case, the originally 32-bit speced PCI bus. Certain legacy 32-bit PCI devices, and certain legacy drivers even where the hardware can handle it, aren't prepared to address more than 4GB of memory, so that address hole must be left below 4GB in ordered for them to be able to place their I/O addresses there and successfully function within even a 64-bit system.) The other and more pressing issue for 64-bit systems, is the BIOS memory arch thing. Filling beyond the lower pair of memory slots addressed by an amd64 MMU requires doubling the command que lengths from 1 command to two. That decreases performance somewhat. However, I don't understand the details of the hows and whys, so don't expect me to explain them. <g> I /do/ know this, however -- the effect is there regardless of whether you are using more than a pair of 1/4 gig sticks, or a pair of 2 gig sticks, the biggest the amd64 MMUs are prepared to deal with at this point. Thus, performance-wise, if one were to need 4 gig addressable per CPU/MMU, it'd be better to get two 2-gig sticks than to fill up the slots with four 1 gig sticks. Of course, 2-gig sticks still cost big money at this point, but the same applies to lower memory requirements as well -- a pair of gig sticks will be more efficient than a quad of half-gig sticks, while the cost difference is generally trivial. Also note that this is per mmu, with the mmu residing on the CPU chip for amd64. Thus my dual Opteron Tyan mobo has eight slots, four assigned to each CPU, and I could (and shortly will) install 4 1-gig sticks, only filling up the first pair of slots assigned to each CPU. (The dual-core chips still have a single mmu, shared by the cores, so if I upgraded to them, and yes, the mobo is dual-core upgradable, I'd still have two mmus, one for each chip, not 4, one for each core.) Since I'm only filling the first pair of slots assigned to each mmu, I can still use the more efficient single command queue length. All that said, however, note that ANY physical RAM is still going to be VASTLY faster than swap (which I currently have disabled, with just a gig of memory, so I'll definitely keep it that way when upgrading), even if it was still PC2100 DDR SDRAM instead of the higher PC3200 that's the max speed the Opterons are rated for, and even with all slots full necessitating a double command queue length. Someone that's seriously considering spending the $$ on 4 gig or more of memory, instead of going dual-core, for instance, most likely is USING enough memory that they'll see improvements from the additional memory, even if it DOES mean 2x command queue lengths and nominally less efficient memory access, because that'll still mean accesses to memory that /would/ have been to swap, otherwise, or to in-memory cache, that would have been dumped from cache therefore necessitating grabbing the data from disk again. That memory access is still a good two orders of magnitude faster than the access to disk would have been, so until the frequency of access to the additional memory area, that /would/ have otherwise gone to disk, drops below that two orders of magnitude threshold, it's STILL more efficient to have the extra memory, even at the expense of a slight drop in memory access efficiency. Practically speaking, that access frequency threshold is probably going to be between 1 and 4 GB, for an amd64 Gentoo workstation system, updating and compiling from source "on the fly", depending on what other uses the system is put to as well. (Developers and those doing a lot of multi-media work will likely be closer to 4GB, ordinary desktop/office/gamer use, single game running at a time in gamer mode, will likely be closer to the 1GB end.) Again, note for a dual Opteron (therefore dual mmu) system, 4GB still fits within the performance and memory envelope, as 4 1-gig sticks, a pair each in the low slots assigned to each mmu/cpu. With 1-gig sticks now roughly twice the cost of 1/2-gig sticks, that's not a big issue, tho it would be going to 2-gig sticks, which are the other side of the pricing "knee", and still MORE than twice as expensive as the 1-gig sticks. Also practically speaking, that means that it's not a big issue for most users, because most users will naturally find other things to do with the money that /could/ be upgrading them to more than 4GB in the first place. The only that might find themselves making this mistake are gamers with a $10K budget to blow on their gaming machine. Of course, servers are a rather different ballgame. Here, actual memory performance doesn't rate high on the priority scale at all, while often the more GB of memory that can be stuffed into the system, the better performance gets, due to a working data set often tens of GBs in size, the more of which can be cached in memory, the better. It's HERE that a quad opteron system (8 cores if dual-cored, tho that doesn't affect memory), with quad chips and therefore quad mmu, each of which can access 8GB in 4 slots of memory off the local controller, therefore totalling 32GB of physical memory, <WHAT a system! drool, drool>, can /still/ be minimally adequate to do the job. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman in http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html -- [email protected] mailing list
