Re: [petsc-dev] CUDA + OMP make error

Matthew Knepley Thu, 16 Apr 2020 07:20:05 -0700

On Thu, Apr 16, 2020 at 10:11 AM Mark Adams <[email protected]> wrote:

> On Thu, Apr 16, 2020 at 9:31 AM Matthew Knepley <[email protected]> wrote:
>
>> On Thu, Apr 16, 2020 at 8:42 AM Mark Adams <[email protected]> wrote:
>>
>>> Yea, GPU assembly would be great. I was figuring OMP might be simpler.
>>>
>>> As far as the interface, I am flexible, the simplest way to do it would
>>> be to take an array of element matrices and a DMPlex and call
>>> to DMPlexMatSetClosure. You can see this code in
>>> mark/feature-xgc-interface-rebase, at the bottom of
>>> src/vec/vec/impls/seq/seqcuda/landau.cu.
>>>
>>> I was shy about putting a version of DMPlexMatSetClosure in CUDA, but
>>> maybe that is easier, just plow through it and cut stuff that we don't
>>> need. OMP broke because there are some temp arrays that Matt caches that
>>> need to be "private" of dealt with in some way.
>>>
>>
>> We should refactor so that all temp arrays are sized and constructed up
>> front, and then the work is done in an internal function which is passed
>> those arrays. I tried to do this, but might have crapped out here. Then
>> you can just call the internal function directly with your arrays.
>>
>
> Just to be clear, we have to copy the code to a .cu file and declare
> everything as a device method (__global__) right?
>
> Or write a batched version of DMPlexMatSetClosure and when it gets down to
> some kernel like MatSetValues, we, for example, move the Mat pointers to
> the GPU, copy to the element matrices to the device, in a .cu file, launch
> a kernel that calls a __global__ version of MatSetValues_SeqAIJ (with
> mallocs stripped out), then Copy the Mat pointers back to the CPU. All this
> copy stuff is usually done with a shadow copy of the object, but that is
> complicated by cusparse matrices which, from what I can tell, have a
> different, cusparse friendly, device Mat object.
>


I would just get the closure indices for that batch of cells, push them to
the GPU, and call MatSetValues() from the GPU.

Here is the way I am thinking. You can not going to push the Plex to the
GPU, so you have to do Plex->Indices on the CPU. Just do that,
push them down, and use PETSc Mat from there.

  Matt


>
>>   Matt
>>
>>
>>> Coloring is not attractive to me because GPUs demand a lot of
>>> parallelism and the code that this serial (velocity space) solver would be
>>> embedded in a full 3D code that does not use a huge amount of MPI
>>> parallelism. For instance if the app code was to use 6 (or 7 max in SUMMIT)
>>> cores per GPU (or even 4x that with hardware threads) then *I could
>>> imagine* there would be enough parallelism, with coloring, to fuse the
>>> element construction and assembly, so assembling the element matrices right
>>> after they are created. That would be great in terms of not storing all
>>> these matrices and then assembling them all at once. The app that I am
>>> targeting does not use that much MPI parallelism though. But we could
>>> explore that, coloring, space and my mental model could be inaccurate.
>>> (note, I did recently add 8x more parallelism to my code this week and got
>>> a 25% speedup, using one whole GPU).
>>>
>>> Or if you have some sort of lower level synchronization that could allow
>>> for fusing the the assembly with the element creation, then, by all means,
>>> we can explore that.
>>>
>>> I'd be happy to work with you on this.
>>>
>>> Thanks,
>>> Mark
>>>
>>> On Mon, Apr 13, 2020 at 7:08 PM Junchao Zhang <[email protected]>
>>> wrote:
>>>
>>>> Probably matrix assembly on GPU is more important. Do you have an
>>>> example for me to play to see what GPU interface we should have?
>>>> --Junchao Zhang
>>>>
>>>> On Mon, Apr 13, 2020 at 5:44 PM Mark Adams <[email protected]> wrote:
>>>>
>>>>> I was looking into assembling matrices with threads. I have a coloring
>>>>> to avoid conflicts.
>>>>>
>>>>> Turning off all the logging seems way overkill and for methods that
>>>>> can get called in a thread then we could use PETSC_HAVE_THREADSAFTEY
>>>>> (thingy) to protect logging functions. So one can still get timings for 
>>>>> the
>>>>> whole assembly process, just not for MatSetValues. Few people are going to
>>>>> do this. I don't think it will be a time sink, and if it is we just revert
>>>>> back to saying 'turn logging off'. I don't see a good argument for
>>>>> insisting on turning off logging, it is pretty important, if we just say
>>>>> that we are going to protect methods as needed.
>>>>>
>>>>> It is not a big deal, I am just exploring this idea. It is such a
>>>>> basic concept in shared memory sparse linear algebra that it seems like a
>>>>> good thing to be able to support and have in an example to say we can
>>>>> assemble matrices in threads (not that it is a great idea). We have all 
>>>>> the
>>>>> tools (eg, coloring methods) that it is just a matter of protecting code a
>>>>> few methods. I use DMPlex MatClosure instead of MatSetValues and this is
>>>>> where I die now with non-thread safe code. We have an idea, from Jed, on
>>>>> how to fix it.
>>>>>
>>>>> Anyway, thanks for your help, but I think we should hold off on doing
>>>>> anything until we have some consensus that this would be a good idea to 
>>>>> put
>>>>> some effort into getting a thread safe PETSc that can support OMP matrix
>>>>> assembly with a nice compact example.
>>>>>
>>>>> Thanks again,
>>>>> Mark
>>>>>
>>>>> On Mon, Apr 13, 2020 at 5:44 PM Junchao Zhang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Mark,
>>>>>>  I saw you had "--with-threadsaftey --with-log=0".  Do you really
>>>>>> want to call petsc from multiple threads (in contrast to letting petsc 
>>>>>> call
>>>>>> other libraries, e.g., BLAS, doing multithreading)?  If not, you can
>>>>>> drop --with-threadsaftey.
>>>>>>  I have https://gitlab.com/petsc/petsc/-/merge_requests/2714 that
>>>>>> should fix your original compilation errors.
>>>>>>
>>>>>> --Junchao Zhang
>>>>>>
>>>>>> On Mon, Apr 13, 2020 at 2:07 PM Mark Adams <[email protected]> wrote:
>>>>>>
>>>>>>> https://www.mcs.anl.gov/petsc/miscellaneous/petscthreads.html
>>>>>>>
>>>>>>> and I see this on my Mac:
>>>>>>>
>>>>>>> 14:23 1 mark/feature-xgc-interface-rebase *= ~/Codes/petsc$
>>>>>>> ../arch-macosx-gnu-O-omp.py
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ===============================================================================
>>>>>>>              Configuring PETSc to compile on your system
>>>>>>>
>>>>>>>
>>>>>>> ===============================================================================
>>>>>>> ===============================================================================
>>>>>>>
>>>>>>>
>>>>>>>                Warning: PETSC_ARCH from environment does not match
>>>>>>> command-line or name of script.
>>>>>>>
>>>>>>>                                      Warning: Using from command-line or
>>>>>>> name of script: arch-macosx-gnu-O-omp, ignoring environment:
>>>>>>> arch-macosx-gnu-g
>>>>>>>
>>>>>>> ===============================================================================
>>>>>>>
>>>>>>>
>>>>>>>          TESTING: configureLibraryOptions from
>>>>>>> PETSc.options.libraryOptions(config/PETSc/options/libraryOptions.py:37)
>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>> *******************************************************************************
>>>>>>>          UNABLE to CONFIGURE with GIVEN OPTIONS    (see
>>>>>>> configure.log for details):
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------
>>>>>>> Must use --with-log=0 with --with-threadsafety
>>>>>>>
>>>>>>> *******************************************************************************
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Apr 13, 2020 at 2:54 PM Junchao Zhang <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Apr 13, 2020 at 12:06 PM Mark Adams <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> BTW, I can build on SUMMIT with logging and OMP, apparently. I
>>>>>>>>> also seem to be able to build with debugging. Both of which are not 
>>>>>>>>> allowed
>>>>>>>>> according the the docs. I am puzzled.
>>>>>>>>>
>>>>>>>>  What are "the docs"?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 13, 2020 at 12:05 PM Mark Adams <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think the problem is that you have to turn off logging with
>>>>>>>>>> openmp and the (newish) GPU timers did not protect their timers.
>>>>>>>>>>
>>>>>>>>>> I don't see a good reason to require logging be turned off with
>>>>>>>>>> OMP. We could use PETSC_HAVE_THREADSAFETY to protect logs that we 
>>>>>>>>>> care
>>>>>>>>>> about (eg, in MatSetValues) and as users discover more things that 
>>>>>>>>>> they
>>>>>>>>>> want to call in an OMP thread block, then tell them to turn logging 
>>>>>>>>>> off and
>>>>>>>>>> we will fix it when we can.
>>>>>>>>>>
>>>>>>>>>> Any thoughts on the idea of letting users keep logging with
>>>>>>>>>> openmp?
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 13, 2020 at 11:40 AM Junchao Zhang <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes. Looks we need to include petsclog.h. Don't know why OMP
>>>>>>>>>>> triggered the error.
>>>>>>>>>>> --Junchao Zhang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 13, 2020 at 9:59 AM Mark Adams <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Should I do an MR to fix this?
>>>>>>>>>>>>
>>>>>>>>>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-dev] CUDA + OMP make error

Reply via email to