On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
> What I see in your results is about 7x speedup by using 16 threads. I > think you should get better results by running 8 threads with 2 > processes because the memory can be allocated on separate memory > controllers, and the memory will be physically closer to the cores. > I'm surprised that you get worse results. > Our intent is for the threads to use an explicit first-touch policy so that they get local memory even when you have threads across multiple NUMA zones. > It doesn't surprise me that an explicit code gets much better speedup. > The explicit code is much less dependent on memory bandwidth relative to floating point. > > > I also get about the same performance results on the ex2 problem when > running it with just > > mpi alone i.e. with 16 mpi processes. > > > > So from my perspective, the new pthreads/openmp support is looking > pretty good assuming > > the issue with the MKL/external packages interaction can be fixed. > > > > I was just using jacobi preconditioning for ex2. I'm wondering if there > are any other preconditioners > > that might be multi-threaded. Or maybe a polynomial preconditioner > could work well for the > > new pthreads/openmp support. > > GAMG with SOR smoothing seems like a prime candidate for threading. I > wonder if anybody has worked on this yet? > SOR is not great because it's sequential. A block Jacobi/SOR parallelizes fine, but does not guarantee stability without additional (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well with threads (but not all the kernels are ready). Coarsening and the Galerkin triple product is more difficult to thread. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121025/e72b68e8/attachment.html>
