Shri, Have you had a chance to investigate the issues related to the new PETSc threads package and MKL?
Dave ________________________________________ From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Shri [[email protected]] Sent: Friday, October 26, 2012 5:35 PM To: For users of the development version of PETSc Subject: Re: [petsc-dev] Status of pthreads and OpenMP support On Oct 26, 2012, at 3:08 PM, Nystrom, William D wrote: > Are there any petsc examples that do cache blocking that would work for the > new > threads support? I don't think there are any examples that can do cache blocking using threads. > I was initially investigating DMDA but that looks like it only works > for mpi processes. I was looking at ex34.c and ex45.c located in > petsc-dev/src/ksp/ksp/examples/tutorials. > > Thanks, > > Dave > > ________________________________________ > From: Nystrom, William D > Sent: Friday, October 26, 2012 10:53 AM > To: Karl Rupp > Cc: For users of the development version of PETSc; Nystrom, William D > Subject: RE: [petsc-dev] Status of pthreads and OpenMP support > > Karli, > > Thanks. Sounds like I need to actually do the memory bandwidth calculation > to get more > quantitative. > > Thanks again, > > Dave > > ________________________________________ > From: Karl Rupp [rupp at mcs.anl.gov] > Sent: Friday, October 26, 2012 10:47 AM > To: Nystrom, William D > Cc: For users of the development version of PETSc > Subject: Re: [petsc-dev] Status of pthreads and OpenMP support > > Hi, > >> Thanks for your reply. Doing the memory bandwidth calculation seems > like a useful exercise. I'll >> give that a try. I was also trying to think of this from a higher level >> perspective. Does this seem >> reasonable? >> >> T_vec_op = T_vec_compute + T_vec_memory >> >> where these are times but using multiple threads only speeds up the >> T_vec_compute part while >> T_vec_memory is relatively constant whether I am doing memory operations >> with a single thread >> or multiple threads. > > Yes and no :-) > Due to possible multiple physical memory links and NUMA, T_vec_memory > shows a dependence on the number and affinity of threads. Also, > > T_vec_op = max(T_vec_compute, T_vec_memory) > > can be a better approximation, as memory transfers and actual > arithmetics may overlap ('prefetching'). > > Still, the main speed-up when using threads (or multiple processes) is > in T_vec_compute. However, hardware processing speed has evolved such > that T_vec_memory is now often dominant (exceptions are mostly BLAS > level 3 algorithms), making proper data layout and affinity even more > important. > > Best regards, > Karli > > > >> ________________________________________ >> From: Karl Rupp [rupp at mcs.anl.gov] >> Sent: Friday, October 26, 2012 10:20 AM >> To: For users of the development version of PETSc >> Cc: Nystrom, William D >> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support >> >> Hi Dave, >> >> let me just comment on the expected speed-up: As the arithmetic >> intensity of vector operations is small, you are in a memory-bandwidth >> limited regime. If you use smaller vectors in order to stay in cache, >> you may still not obtain the expected speedup because then thread >> management overhead becomes more of an issue. I suggest you compute the >> effective memory bandwidth of your vector operations, because I suspect >> you are pretty close to bandwidth saturation already. >> >> Best regards, >> Karli >> >> >> On 10/26/2012 10:58 AM, Nystrom, William D wrote: >>> Jed or Shri, >>> >>> Are there other preconditioners I could use/try now with the petsc thread >>> support besides jacobi? >>> I looked around in the documentation for something like least squares >>> polynomial preconditioning >>> that is referenced in a paper by Li and Saad titled "GPU-Accelerated >>> Preconditioned Iterative >>> Linear Solvers" but did not find anything like that. Would block jacobi >>> with lu/cholesky for the >>> block solves work with the current thread support? >>> >>> Regarding the performance of my recent runs, I was surprised that I was not >>> getting closer to a >>> 16x speedup for the purely vector operations when using 16 threads compared >>> to 1 thread. I'm >>> running on a single node of a cluster where the nodes are dual socked >>> sandybridge cpus and >>> the OS is TOSS 2 Linux from Livermore. So I'm assuming that is not really >>> an "unknown" sort >>> of system. One thing I am wondering is whether there is an issue with my >>> thread affinities. I am >>> setting them but am wondering if there could be issues with which chunk of >>> a vector a given >>> threads gets. For instance, assuming a single mpi process on a single node >>> and using 16 threads, >>> I would assume that the vector occupies a contiguous chunk of memory and >>> that it will get divided >>> into 16 chunks. If thread 13 is the first to launch, does it get the first >>> chunk of the vector or the >>> 13th chunk of the vector? If the latter, then I would think my assignment >>> of thread affinities is >>> optimal. If my thread assignment is optimal, then is the less than 16x >>> speedup in the vector >>> operations because of memory bandwidth limitations or cache effects? >>> >>> What profiling tools do you recommend to use with petsc? I have >>> investigated and tried Openspeedshop, >>> HPC Toolkit and Tau but have not tried any with petsc. I was told that >>> there were some issues with >>> using Tau with petsc. Not sure what they are. So far, I have liked Tau >>> best. >>> >>> Dave >>> >>> ________________________________________ >>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] >>> on behalf of John Fettig [john.fettig at gmail.com] >>> Sent: Friday, October 26, 2012 7:47 AM >>> To: For users of the development version of PETSc >>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support >>> >>> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote: >>>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> >>>> wrote: >>>>> >>>>> What I see in your results is about 7x speedup by using 16 threads. I >>>>> think you should get better results by running 8 threads with 2 >>>>> processes because the memory can be allocated on separate memory >>>>> controllers, and the memory will be physically closer to the cores. >>>>> I'm surprised that you get worse results. >>>> >>>> >>>> Our intent is for the threads to use an explicit first-touch policy so that >>>> they get local memory even when you have threads across multiple NUMA >>>> zones. >>> >>> Great. I still think the performance using jacobi (as Dave does) >>> should be no worse using 2x(MPI) and 8x(thread) than it is with >>> 1x(MPI) and 16x(thread). >>> >>>>> >>>>> It doesn't surprise me that an explicit code gets much better speedup. >>>> >>>> >>>> The explicit code is much less dependent on memory bandwidth relative to >>>> floating point. >>>> >>>>> >>>>> >>>>>> I also get about the same performance results on the ex2 problem when >>>>>> running it with just >>>>>> mpi alone i.e. with 16 mpi processes. >>>>>> >>>>>> So from my perspective, the new pthreads/openmp support is looking >>>>>> pretty good assuming >>>>>> the issue with the MKL/external packages interaction can be fixed. >>>>>> >>>>>> I was just using jacobi preconditioning for ex2. I'm wondering if there >>>>>> are any other preconditioners >>>>>> that might be multi-threaded. Or maybe a polynomial preconditioner >>>>>> could work well for the >>>>>> new pthreads/openmp support. >>>>> >>>>> GAMG with SOR smoothing seems like a prime candidate for threading. I >>>>> wonder if anybody has worked on this yet? >>>> >>>> >>>> SOR is not great because it's sequential. >>> >>> For structured grids we have multi-color schemes and temporally >>> blocked schemes as in this paper, >>> >>> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf >>> >>> For unstructured grids, could we do some analagous decomposition using >>> e.g. parmetis? >>> >>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764 >>> >>> Regards, >>> John >>> >>>> A block Jacobi/SOR parallelizes >>>> fine, but does not guarantee stability without additional >>>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well >>>> with threads (but not all the kernels are ready). >>>> >>>> Coarsening and the Galerkin triple product is more difficult to thread. >> >
