[petsc-dev] Status of pthreads and OpenMP support

Nystrom, William D Wed, 31 Oct 2012 16:41:57 +0000

Shri,

Have you had a chance to investigate the issues related to the new PETSc threads
package and MKL?


Dave

________________________________________
From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on 
behalf of Shri [[email protected]]
Sent: Friday, October 26, 2012 5:35 PM
To: For users of the development version of PETSc
Subject: Re: [petsc-dev] Status of pthreads and OpenMP support

On Oct 26, 2012, at 3:08 PM, Nystrom, William D wrote:

> Are there any petsc examples that do cache blocking that would work for the 
> new
> threads support?

I don't think there are any examples that can do cache blocking using threads.

> I was initially investigating DMDA but that looks like it only works
> for mpi processes.  I was looking at ex34.c and ex45.c located in 
> petsc-dev/src/ksp/ksp/examples/tutorials.
>
> Thanks,
>
> Dave
>
> ________________________________________
> From: Nystrom, William D
> Sent: Friday, October 26, 2012 10:53 AM
> To: Karl Rupp
> Cc: For users of the development version of PETSc; Nystrom, William D
> Subject: RE: [petsc-dev] Status of pthreads and OpenMP support
>
> Karli,
>
> Thanks.  Sounds like I need to actually do the memory bandwidth calculation 
> to get more
> quantitative.
>
> Thanks again,
>
> Dave
>
> ________________________________________
> From: Karl Rupp [rupp at mcs.anl.gov]
> Sent: Friday, October 26, 2012 10:47 AM
> To: Nystrom, William D
> Cc: For users of the development version of PETSc
> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>
> Hi,
>
>> Thanks for your reply.  Doing the memory bandwidth calculation seems
> like a useful exercise.  I'll
>> give that a try.  I was also trying to think of this from a higher level 
>> perspective.  Does this seem
>> reasonable?
>>
>> T_vec_op = T_vec_compute + T_vec_memory
>>
>> where these are times but using multiple threads only speeds up the 
>> T_vec_compute part while
>> T_vec_memory is relatively constant whether I am doing memory operations 
>> with a single thread
>> or multiple threads.
>
> Yes and no :-)
> Due to possible multiple physical memory links and NUMA, T_vec_memory
> shows a dependence on the number and affinity of threads. Also,
>
>  T_vec_op = max(T_vec_compute, T_vec_memory)
>
> can be a better approximation, as memory transfers and actual
> arithmetics may overlap ('prefetching').
>
> Still, the main speed-up when using threads (or multiple processes) is
> in T_vec_compute. However, hardware processing speed has evolved such
> that T_vec_memory is now often dominant (exceptions are mostly BLAS
> level 3 algorithms), making proper data layout and affinity even more
> important.
>
> Best regards,
> Karli
>
>
>
>> ________________________________________
>> From: Karl Rupp [rupp at mcs.anl.gov]
>> Sent: Friday, October 26, 2012 10:20 AM
>> To: For users of the development version of PETSc
>> Cc: Nystrom, William D
>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>>
>> Hi Dave,
>>
>> let me just comment on the expected speed-up: As the arithmetic
>> intensity of vector operations is small, you are in a memory-bandwidth
>> limited regime. If you use smaller vectors in order to stay in cache,
>> you may still not obtain the expected speedup because then thread
>> management overhead becomes more of an issue. I suggest you compute the
>> effective memory bandwidth of your vector operations, because I suspect
>> you are pretty close to bandwidth saturation already.
>>
>> Best regards,
>> Karli
>>
>>
>> On 10/26/2012 10:58 AM, Nystrom, William D wrote:
>>> Jed or Shri,
>>>
>>> Are there other preconditioners I could use/try now with the petsc thread 
>>> support besides jacobi?
>>> I looked around in the documentation for something like least squares 
>>> polynomial preconditioning
>>> that is referenced in a paper by Li and Saad titled "GPU-Accelerated 
>>> Preconditioned Iterative
>>> Linear Solvers" but did not find anything like that.  Would block jacobi 
>>> with lu/cholesky for the
>>> block solves work with the current thread support?
>>>
>>> Regarding the performance of my recent runs, I was surprised that I was not 
>>> getting closer to a
>>> 16x speedup for the purely vector operations when using 16 threads compared 
>>> to 1 thread.  I'm
>>> running on a single node of a cluster where the nodes are dual socked 
>>> sandybridge cpus and
>>> the OS is TOSS 2 Linux from Livermore.  So I'm assuming that is not really 
>>> an "unknown" sort
>>> of system.  One thing I am wondering is whether there is an issue with my 
>>> thread affinities.  I am
>>> setting them but am wondering if there could be issues with which chunk of 
>>> a vector a given
>>> threads gets.  For instance, assuming a single mpi process on a single node 
>>> and using 16 threads,
>>> I would assume that the vector occupies a contiguous chunk of memory and 
>>> that it will get divided
>>> into 16 chunks.  If thread 13 is the first to launch, does it get the first 
>>> chunk of the vector or the
>>> 13th chunk of the vector?  If the latter, then I would think my assignment 
>>> of thread affinities is
>>> optimal.  If my thread assignment is optimal, then is the less than 16x 
>>> speedup in the vector
>>> operations because of memory bandwidth limitations or cache effects?
>>>
>>> What profiling tools do you recommend to use with petsc?  I have 
>>> investigated and tried Openspeedshop,
>>> HPC Toolkit and Tau but have not tried any with petsc.  I was told that 
>>> there were some issues with
>>> using Tau with petsc.  Not sure what they are.  So far, I have liked Tau 
>>> best.
>>>
>>> Dave
>>>
>>> ________________________________________
>>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] 
>>> on behalf of John Fettig [john.fettig at gmail.com]
>>> Sent: Friday, October 26, 2012 7:47 AM
>>> To: For users of the development version of PETSc
>>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>>>
>>> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> 
>>>> wrote:
>>>>>
>>>>> What I see in your results is about 7x speedup by using 16 threads.  I
>>>>> think you should get better results by running 8 threads with 2
>>>>> processes because the memory can be allocated on separate memory
>>>>> controllers, and the memory will be physically closer to the cores.
>>>>> I'm surprised that you get worse results.
>>>>
>>>>
>>>> Our intent is for the threads to use an explicit first-touch policy so that
>>>> they get local memory even when you have threads across multiple NUMA 
>>>> zones.
>>>
>>> Great.  I still think the performance using jacobi (as Dave does)
>>> should be no worse using 2x(MPI) and 8x(thread) than it is with
>>> 1x(MPI) and 16x(thread).
>>>
>>>>>
>>>>> It doesn't surprise me that an explicit code gets much better speedup.
>>>>
>>>>
>>>> The explicit code is much less dependent on memory bandwidth relative to
>>>> floating point.
>>>>
>>>>>
>>>>>
>>>>>> I also get about the same performance results on the ex2 problem when
>>>>>> running it with just
>>>>>> mpi alone i.e. with 16 mpi processes.
>>>>>>
>>>>>> So from my perspective, the new pthreads/openmp support is looking
>>>>>> pretty good assuming
>>>>>> the issue with the MKL/external packages interaction can be fixed.
>>>>>>
>>>>>> I was just using jacobi preconditioning for ex2.  I'm wondering if there
>>>>>> are any other preconditioners
>>>>>> that might be multi-threaded.  Or maybe a polynomial preconditioner
>>>>>> could work well for the
>>>>>> new pthreads/openmp support.
>>>>>
>>>>> GAMG with SOR smoothing seems like a prime candidate for threading.  I
>>>>> wonder if anybody has worked on this yet?
>>>>
>>>>
>>>> SOR is not great because it's sequential.
>>>
>>> For structured grids we have multi-color schemes and temporally
>>> blocked schemes as in this paper,
>>>
>>> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf
>>>
>>> For unstructured grids, could we do some analagous decomposition using
>>> e.g. parmetis?
>>>
>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764
>>>
>>> Regards,
>>> John
>>>
>>>> A block Jacobi/SOR parallelizes
>>>> fine, but does not guarantee stability without additional
>>>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
>>>> with threads (but not all the kernels are ready).
>>>>
>>>> Coarsening and the Galerkin triple product is more difficult to thread.
>>
>

[petsc-dev] Status of pthreads and OpenMP support

Reply via email to