One more comment in regards to single core vs dual core opteron: There are two ways to evaluate the performance. Performance per core - or performance for the price [of the machine].
Ideally we'd like the performance per core be scalable [for publishing pretty graphs]. However the dual core machine does not cost twice the cost of single core machine. [Its probably costs 10-30% more]. So realistically - if one can get the same factor of improvement in performance with 16nodes vs 8nodes, one can consider the dual core machine as providing reasonable performance. Satish On Mon, 5 Feb 2007, Satish Balay wrote: > A couple of comments: > > - with the dual core opteron - the memorybandwith per core is now > reduced by half - so the performance suffers. However memory > bandwidth across CPUs is scalable. [6.4 Gb/s per each node or 3.2Gb/s > per core] > > - Current generation Intel Core 2 duo appears to claim having > sufficient bandwidth [15.3Gb/s per node = 7.6Gb/s per core?] so from > this bandwidth number - this chip might do better than the AMD > chip. However I'm not sure if there is a SMP with this chip - which > has scalable memory system [across say 8 nodes - as you currently > have..] > > - Older intel SMP boxes has a single memory bank shared across all the > CPUs [so effective bandwidth per CPU was pretty small. Optrons' > scalable architecture looked much better than the older intel SMPs] > > - From previous log_summary - part of the inefficiency of the SMP box > [when compared to the cluster] was in the MPI performance. Do you > still see this effect in the '-np 8' runs? If so this could be the > [part of the] reason for this 30% reduction in performance. > > Satish > > On Mon, 5 Feb 2007, Shi Jin wrote: > > > Hi there, > > > > I have made some new progress on the issue of SMP > > performance. Since my shared memory machine is a 8 > > dual-core Opteron machine. I think the two cores on a > > single CPU chip shares the memory bandwidth. > > Therefore, if I can avoid using the same core on the > > chip, I can get some performance improvement. Indeed, > > I am able to do this by the linux command taskset. > > Here is what I did: > > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ../spAF > > This way, I specifically ask the processes to be run > > on the first core on the CPUs. > > By doing this, my performance is doubled compared with > > the simple petscmpirun -n 8 ../spAF > > > > So this test shows that we do suffer from the > > competition of resources of multiple processes, > > especially when we use 16 processes. > > > > However, I should point out that even with the help > > taskset, the shared-memory performance is still 30% > > less than that on the cluster. > > > > I am not sure whether this problem exists specifically > > for the AMD machines or it applys to any shared-memory > > architecture. > > > > Thanks. > > Shi > > > > --- Shi Jin <jinzishuai at yahoo.com> wrote: > > > > > Hi there, > > > > > > I am fairly new to PETSc but have 5 years of MPI > > > programming already. I recently took on a project of > > > analyzing a finite element code written in C with > > > PETSc. > > > I found out that on a shared-memory machine (60GB > > > RAM, > > > 16 CPUS), the code runs around 4 times slower > > > than > > > on a distributed memory cluster (4GB Ram, > > > 4CPU/node), > > > although they yield identical results. > > > There are 1.6Million finite elements in the problem > > > so > > > it is a fairly large calculation. The total memory > > > used is 3GBx16=48GB. > > > > > > Both the two systems run Linux as OS and the same > > > code > > > is compiled against the same version of MPICH-2 and > > > PETSc. > > > > > > The shared-memory machine is actually a little > > > faster > > > than the cluster machines in terms of single process > > > runs. > > > > > > I am surprised at this result since we usually tend > > > to > > > think that shared-memory would be much faster since > > > the in-memory operation is much faster that the > > > network communication. > > > > > > However, I read the PETSc FAQ and found that "the > > > speed of sparse matrix computations is almost > > > totally > > > determined by the speed of the memory, not the speed > > > of the CPU". > > > This makes me wonder whether the poor performance of > > > my code on a shared-memory machine is due to the > > > competition of different process on the same memory > > > bus. Since the code is still MPI based, a lot of > > > data > > > are moving around inside the memory. Is this a > > > reasonable explanation of what I observed? > > > > > > Thank you very much. > > > > > > Shi > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Do you Yahoo!? > > > Everyone is raving about the all-new Yahoo! Mail > > > beta. > > > http://new.mail.yahoo.com > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > Expecting? Get great news right away with email Auto-Check. > > Try the Yahoo! Mail Beta. > > http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html > > > > > >