On Mon, 21 Apr 2008, amjad ali wrote: > Hello Petsc team (especially Satish and Barry). > > YOU SAID: FOR Better performance > > (1) high per-CPU memory performance. Each CPU (core in dual core systems) > needs to have its own memory bandwith of roughly 2 or more gigabytes.
This 2GB/core number is a rabbit out of the hat. We just put some reference point out - a few years back for SMP machines [when the age of multi-core chips hasn't yet begun]. Now Intel has chipsets that can give 25GB/s. They now put 4 cores or 8 cores on this machine. [i.e 6Gb/s for 4core and 3Gb/s for the 8core machine] But the trend now is to cram more and more cores - so expect the number of cores to increase faster than the chipset memory-bandwidth. [i.e badwidth per core is likely to get smaller and smaller] > > (2) MEMORY BANDWDITH PER CORE, the higher that is the better performance you > get. > > From these points I started to look for RAM Sticks with higher MHz rates > (and obviously CPUs and motherboards supporting this speed). > > But you also reflected to: > > http://www.intel.com/performance/server/xeon/hpc_ansys.htm > http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm > > On these pages you pointed out that: systems with CPUs of 20% higher FSB > speed are performing 20% better. But you see also RAM speed is 20% higher > for the better performing system (i.e 800MHz vs 667 MHz). > > So my question is that which is the actual indicator of "memory > bandwidth"per core? > Whether it is > (1) CPU's FSB speed > (2) RAM speed > (3) Motherboard's System Bus Speed. The answer is a bit complicated here. It depends upon the system architure. CPU Chip[s] <-----> chipset <-----> memory [banks] - Is the bandwidth on the CPU-Chip side is same as on the memory side? [there are machines where this is different, but most macines use *synchronous* buses - so that the 'memory chipset' does not have to do translation/buffering] For eg - On intel Xeon machine with DDR2-800 - you have [othe memory bus side]: bandwidth = 2(banks)* 2(ddr)* 8(bytes bus) * 800 MHz/sec * = 25.6GByte/sec The othe CPU side - its balanced by FSB1600 => Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se So generally all the 3 things you've listed has to *match* correctly. [Some CPUs and chipsets support multiple FSB frequencies - so have to check what freq is set for the machine you are buying.] This choice can have *cost* implications.. Is it worth it to spend 20% more to get 20%more bandwidth? Perhaps yes for sparse-matrix appliations - but not for others.. > How we could ensure "memory bandwith of roughly 2 or more gigabytes" per CPU > core? (Higher CPU's FSB speed, or RAM speed or Motherboard's System Bus > Speed). As mentioned 2GB/core is a approximate nubmer we thought off a few years back - when there were no multi-core machine [just SMP chipsets]. All we can do is eavalue the memorybandwidth number for a given machine. We can't *ensure* it - as this is a choice made by and other chip designers.[intel, amd, ibm etc..] The choice for the currently available products was probably made a few years back. There is another component to this memory bandwidth debate. Which of the following do we want? 1. best scalability chip? [when comparing the performance from 1-N cores] 2. overall best performance on 1-core. or N cores [i.e node]. And from the system architecture issues - mentioned above - there are a couple of other issues that influcene this. - are the CPU-Chips sharing bandwidth or spliting bandwidth? - within the CPU-Chip [multi-core] is the memory bus shared or split? The first one can achieved by the hardware spliting up 1/Nth total available bandwidth per core. So it shows scalable results. But the 1-core performance can be low. The second choice could happen by not spliting - but sharing at the core level. For eg: Intel machines - memory bandwidth is divided at the CPU-chip level. For the example case MatMult from ex2 on 8-core intel machine had the following performance on 1,2,4,8 cores: 397, 632, 724, 749 [MFlop/s] To me - its not clear which architecture is better. For publishing scalability results - the above numbers don't look good. [but it could be the best performance you can squeze out any sequential job - or out of any 8-core architecture] Satish
