Are there any petsc examples that do cache blocking that would work for the new threads support? I was initially investigating DMDA but that looks like it only works for mpi processes. I was looking at ex34.c and ex45.c located in petsc-dev/src/ksp/ksp/examples/tutorials.
Thanks, Dave ________________________________________ From: Nystrom, William D Sent: Friday, October 26, 2012 10:53 AM To: Karl Rupp Cc: For users of the development version of PETSc; Nystrom, William D Subject: RE: [petsc-dev] Status of pthreads and OpenMP support Karli, Thanks. Sounds like I need to actually do the memory bandwidth calculation to get more quantitative. Thanks again, Dave ________________________________________ From: Karl Rupp [[email protected]] Sent: Friday, October 26, 2012 10:47 AM To: Nystrom, William D Cc: For users of the development version of PETSc Subject: Re: [petsc-dev] Status of pthreads and OpenMP support Hi, > Thanks for your reply. Doing the memory bandwidth calculation seems like a useful exercise. I'll > give that a try. I was also trying to think of this from a higher level > perspective. Does this seem > reasonable? > > T_vec_op = T_vec_compute + T_vec_memory > > where these are times but using multiple threads only speeds up the > T_vec_compute part while > T_vec_memory is relatively constant whether I am doing memory operations with > a single thread > or multiple threads. Yes and no :-) Due to possible multiple physical memory links and NUMA, T_vec_memory shows a dependence on the number and affinity of threads. Also, T_vec_op = max(T_vec_compute, T_vec_memory) can be a better approximation, as memory transfers and actual arithmetics may overlap ('prefetching'). Still, the main speed-up when using threads (or multiple processes) is in T_vec_compute. However, hardware processing speed has evolved such that T_vec_memory is now often dominant (exceptions are mostly BLAS level 3 algorithms), making proper data layout and affinity even more important. Best regards, Karli > ________________________________________ > From: Karl Rupp [rupp at mcs.anl.gov] > Sent: Friday, October 26, 2012 10:20 AM > To: For users of the development version of PETSc > Cc: Nystrom, William D > Subject: Re: [petsc-dev] Status of pthreads and OpenMP support > > Hi Dave, > > let me just comment on the expected speed-up: As the arithmetic > intensity of vector operations is small, you are in a memory-bandwidth > limited regime. If you use smaller vectors in order to stay in cache, > you may still not obtain the expected speedup because then thread > management overhead becomes more of an issue. I suggest you compute the > effective memory bandwidth of your vector operations, because I suspect > you are pretty close to bandwidth saturation already. > > Best regards, > Karli > > > On 10/26/2012 10:58 AM, Nystrom, William D wrote: >> Jed or Shri, >> >> Are there other preconditioners I could use/try now with the petsc thread >> support besides jacobi? >> I looked around in the documentation for something like least squares >> polynomial preconditioning >> that is referenced in a paper by Li and Saad titled "GPU-Accelerated >> Preconditioned Iterative >> Linear Solvers" but did not find anything like that. Would block jacobi >> with lu/cholesky for the >> block solves work with the current thread support? >> >> Regarding the performance of my recent runs, I was surprised that I was not >> getting closer to a >> 16x speedup for the purely vector operations when using 16 threads compared >> to 1 thread. I'm >> running on a single node of a cluster where the nodes are dual socked >> sandybridge cpus and >> the OS is TOSS 2 Linux from Livermore. So I'm assuming that is not really >> an "unknown" sort >> of system. One thing I am wondering is whether there is an issue with my >> thread affinities. I am >> setting them but am wondering if there could be issues with which chunk of a >> vector a given >> threads gets. For instance, assuming a single mpi process on a single node >> and using 16 threads, >> I would assume that the vector occupies a contiguous chunk of memory and >> that it will get divided >> into 16 chunks. If thread 13 is the first to launch, does it get the first >> chunk of the vector or the >> 13th chunk of the vector? If the latter, then I would think my assignment >> of thread affinities is >> optimal. If my thread assignment is optimal, then is the less than 16x >> speedup in the vector >> operations because of memory bandwidth limitations or cache effects? >> >> What profiling tools do you recommend to use with petsc? I have >> investigated and tried Openspeedshop, >> HPC Toolkit and Tau but have not tried any with petsc. I was told that >> there were some issues with >> using Tau with petsc. Not sure what they are. So far, I have liked Tau >> best. >> >> Dave >> >> ________________________________________ >> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on >> behalf of John Fettig [john.fettig at gmail.com] >> Sent: Friday, October 26, 2012 7:47 AM >> To: For users of the development version of PETSc >> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support >> >> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote: >>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> >>> wrote: >>>> >>>> What I see in your results is about 7x speedup by using 16 threads. I >>>> think you should get better results by running 8 threads with 2 >>>> processes because the memory can be allocated on separate memory >>>> controllers, and the memory will be physically closer to the cores. >>>> I'm surprised that you get worse results. >>> >>> >>> Our intent is for the threads to use an explicit first-touch policy so that >>> they get local memory even when you have threads across multiple NUMA zones. >> >> Great. I still think the performance using jacobi (as Dave does) >> should be no worse using 2x(MPI) and 8x(thread) than it is with >> 1x(MPI) and 16x(thread). >> >>>> >>>> It doesn't surprise me that an explicit code gets much better speedup. >>> >>> >>> The explicit code is much less dependent on memory bandwidth relative to >>> floating point. >>> >>>> >>>> >>>>> I also get about the same performance results on the ex2 problem when >>>>> running it with just >>>>> mpi alone i.e. with 16 mpi processes. >>>>> >>>>> So from my perspective, the new pthreads/openmp support is looking >>>>> pretty good assuming >>>>> the issue with the MKL/external packages interaction can be fixed. >>>>> >>>>> I was just using jacobi preconditioning for ex2. I'm wondering if there >>>>> are any other preconditioners >>>>> that might be multi-threaded. Or maybe a polynomial preconditioner >>>>> could work well for the >>>>> new pthreads/openmp support. >>>> >>>> GAMG with SOR smoothing seems like a prime candidate for threading. I >>>> wonder if anybody has worked on this yet? >>> >>> >>> SOR is not great because it's sequential. >> >> For structured grids we have multi-color schemes and temporally >> blocked schemes as in this paper, >> >> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf >> >> For unstructured grids, could we do some analagous decomposition using >> e.g. parmetis? >> >> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764 >> >> Regards, >> John >> >>> A block Jacobi/SOR parallelizes >>> fine, but does not guarantee stability without additional >>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well >>> with threads (but not all the kernels are ready). >>> >>> Coarsening and the Galerkin triple product is more difficult to thread. >
