Hi, Aron, Can you please give link of Barry's talk about multigrid memory access patterns (u just mentioned)?
thanks On Wed, Nov 18, 2009 at 10:26 AM, Aron Ahmadia <aron.ahmadia at kaust.edu.sa>wrote: > Does anybody have good references in the literature analyzing the memory > access patterns for sparse solvers and how they scale? I remember seeing > Barry's talk about multigrid memory access patterns, but I'm not sure if > I've ever seen a good paper reference. > > Cheers, > Aron > > > On Wed, Nov 18, 2009 at 6:14 PM, Satish Balay <balay at mcs.anl.gov> wrote: > >> Just want to add one more point to this. >> >> Most multicore machines do not provide scalable hardware. [yeah - the >> FPUs cores are scalable - but the memory subsystem is not]. So one >> should not expect scalable performance out of them. You should take >> the 'max' performance you can get out out them - and then look for >> scalability with multiple nodes. >> >> Satish >> >> On Wed, 18 Nov 2009, Jed Brown wrote: >> >> > jarunan at ascomp.ch wrote: >> > > >> > > Hello, >> > > >> > > I have read the topic about performance of a machine with 2 dual-core >> > > chips, and it is written that with -np 2 it should scale the best. I >> > > would like to ask about 4-core machine. >> > > >> > > I run the test on a quad core machine with mpiexec -n 1, 2 and 4 to >> see >> > > the parallel scaling. The cpu times of the test are: >> > > >> > > Solver/Precond/Sub_Precond >> > > >> > > gmres/bjacobi/ilu >> > > >> > > -n 1, 1917.5730 sec, >> > > -n 2, 1699.9490 sec, efficiency = 56.40% >> > > -n 4, 1661.6810 sec, efficiency = 28.86% >> > > >> > > bicgstab/asm/ilu >> > > >> > > -n 1, 1800.8380 sec, >> > > -n 2, 1415.0170 sec, efficiency = 63.63% >> > > -n 4, 1119.3480 sec, efficiency = 40.22% >> > >> > These numbers are worthless without at least knowing iteration counts. >> > >> > > Why is the scaling so low, especially with option -n 4? >> > > Would it be expected to be better running with real 4 CPU's instead of >> a >> > > quad core ship? >> > >> > 4 sockets using a single core each (4x1) will generally do better than >> > 2x2 or 1x4, but 4x4 costs about the same as 4x1 these days. This is a >> > very common question, the answer is that a single floating point unit is >> > about 10 times faster than memory for the sort of operations that we do >> > when solving PDE. You don't get another memory bus every time you add a >> > core so the ratio becomes worse. More cores are not a complete loss >> > because at least you get an extra L1 cache for each core, but sparse >> > matrix and vector kernels are atrocious at reusing cache (there's not >> > much to reuse because most values are only needed to perform one >> > operation). >> > >> > Getting better multicore performance requires changing the algorithms to >> > better reuse L1 cache. This means moving away from assembled matrices >> > where possible and of course finding good preconditioners. High-order >> > and fast multipole methods are good for this. But it's very much an >> > open problem and unless you want to do research in the field, you have >> > to live with poor multicore performance. >> > >> > When buying hardware, remember that you are buying memory bandwidth (and >> > a low-latency network) instead of floating point units. >> > >> > Jed >> > >> > >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20091118/58e1e96f/attachment-0001.htm>
