On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote: > On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote: >> >> What I see in your results is about 7x speedup by using 16 threads. I >> think you should get better results by running 8 threads with 2 >> processes because the memory can be allocated on separate memory >> controllers, and the memory will be physically closer to the cores. >> I'm surprised that you get worse results. > > > Our intent is for the threads to use an explicit first-touch policy so that > they get local memory even when you have threads across multiple NUMA zones.
Great. I still think the performance using jacobi (as Dave does) should be no worse using 2x(MPI) and 8x(thread) than it is with 1x(MPI) and 16x(thread). >> >> It doesn't surprise me that an explicit code gets much better speedup. > > > The explicit code is much less dependent on memory bandwidth relative to > floating point. > >> >> >> > I also get about the same performance results on the ex2 problem when >> > running it with just >> > mpi alone i.e. with 16 mpi processes. >> > >> > So from my perspective, the new pthreads/openmp support is looking >> > pretty good assuming >> > the issue with the MKL/external packages interaction can be fixed. >> > >> > I was just using jacobi preconditioning for ex2. I'm wondering if there >> > are any other preconditioners >> > that might be multi-threaded. Or maybe a polynomial preconditioner >> > could work well for the >> > new pthreads/openmp support. >> >> GAMG with SOR smoothing seems like a prime candidate for threading. I >> wonder if anybody has worked on this yet? > > > SOR is not great because it's sequential. For structured grids we have multi-color schemes and temporally blocked schemes as in this paper, http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf For unstructured grids, could we do some analagous decomposition using e.g. parmetis? http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764 Regards, John > A block Jacobi/SOR parallelizes > fine, but does not guarantee stability without additional > (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well > with threads (but not all the kernels are ready). > > Coarsening and the Galerkin triple product is more difficult to thread.
