On Thu, Apr 16, 2020 at 8:42 AM Mark Adams <[email protected]> wrote: > Yea, GPU assembly would be great. I was figuring OMP might be simpler. > > As far as the interface, I am flexible, the simplest way to do it would be > to take an array of element matrices and a DMPlex and call > to DMPlexMatSetClosure. You can see this code in > mark/feature-xgc-interface-rebase, at the bottom of > src/vec/vec/impls/seq/seqcuda/landau.cu. > > I was shy about putting a version of DMPlexMatSetClosure in CUDA, but > maybe that is easier, just plow through it and cut stuff that we don't > need. OMP broke because there are some temp arrays that Matt caches that > need to be "private" of dealt with in some way. >
We should refactor so that all temp arrays are sized and constructed up front, and then the work is done in an internal function which is passed those arrays. I tried to do this, but might have crapped out here. Then you can just call the internal function directly with your arrays. Matt > Coloring is not attractive to me because GPUs demand a lot of parallelism > and the code that this serial (velocity space) solver would be embedded in > a full 3D code that does not use a huge amount of MPI parallelism. For > instance if the app code was to use 6 (or 7 max in SUMMIT) cores per GPU > (or even 4x that with hardware threads) then *I could imagine* there > would be enough parallelism, with coloring, to fuse the element > construction and assembly, so assembling the element matrices right after > they are created. That would be great in terms of not storing all these > matrices and then assembling them all at once. The app that I am > targeting does not use that much MPI parallelism though. But we could > explore that, coloring, space and my mental model could be inaccurate. > (note, I did recently add 8x more parallelism to my code this week and got > a 25% speedup, using one whole GPU). > > Or if you have some sort of lower level synchronization that could allow > for fusing the the assembly with the element creation, then, by all means, > we can explore that. > > I'd be happy to work with you on this. > > Thanks, > Mark > > On Mon, Apr 13, 2020 at 7:08 PM Junchao Zhang <[email protected]> > wrote: > >> Probably matrix assembly on GPU is more important. Do you have an example >> for me to play to see what GPU interface we should have? >> --Junchao Zhang >> >> On Mon, Apr 13, 2020 at 5:44 PM Mark Adams <[email protected]> wrote: >> >>> I was looking into assembling matrices with threads. I have a coloring >>> to avoid conflicts. >>> >>> Turning off all the logging seems way overkill and for methods that can >>> get called in a thread then we could use PETSC_HAVE_THREADSAFTEY (thingy) >>> to protect logging functions. So one can still get timings for the whole >>> assembly process, just not for MatSetValues. Few people are going to do >>> this. I don't think it will be a time sink, and if it is we just revert >>> back to saying 'turn logging off'. I don't see a good argument for >>> insisting on turning off logging, it is pretty important, if we just say >>> that we are going to protect methods as needed. >>> >>> It is not a big deal, I am just exploring this idea. It is such a basic >>> concept in shared memory sparse linear algebra that it seems like a good >>> thing to be able to support and have in an example to say we can assemble >>> matrices in threads (not that it is a great idea). We have all the tools >>> (eg, coloring methods) that it is just a matter of protecting code a few >>> methods. I use DMPlex MatClosure instead of MatSetValues and this is where >>> I die now with non-thread safe code. We have an idea, from Jed, on how to >>> fix it. >>> >>> Anyway, thanks for your help, but I think we should hold off on doing >>> anything until we have some consensus that this would be a good idea to put >>> some effort into getting a thread safe PETSc that can support OMP matrix >>> assembly with a nice compact example. >>> >>> Thanks again, >>> Mark >>> >>> On Mon, Apr 13, 2020 at 5:44 PM Junchao Zhang <[email protected]> >>> wrote: >>> >>>> Mark, >>>> I saw you had "--with-threadsaftey --with-log=0". Do you really want >>>> to call petsc from multiple threads (in contrast to letting petsc call >>>> other libraries, e.g., BLAS, doing multithreading)? If not, you can >>>> drop --with-threadsaftey. >>>> I have https://gitlab.com/petsc/petsc/-/merge_requests/2714 that >>>> should fix your original compilation errors. >>>> >>>> --Junchao Zhang >>>> >>>> On Mon, Apr 13, 2020 at 2:07 PM Mark Adams <[email protected]> wrote: >>>> >>>>> https://www.mcs.anl.gov/petsc/miscellaneous/petscthreads.html >>>>> >>>>> and I see this on my Mac: >>>>> >>>>> 14:23 1 mark/feature-xgc-interface-rebase *= ~/Codes/petsc$ >>>>> ../arch-macosx-gnu-O-omp.py >>>>> >>>>> >>>>> >>>>> =============================================================================== >>>>> Configuring PETSc to compile on your system >>>>> >>>>> >>>>> =============================================================================== >>>>> =============================================================================== >>>>> >>>>> >>>>> Warning: PETSC_ARCH from environment does not match >>>>> command-line or name of script. >>>>> >>>>> Warning: Using from command-line or >>>>> name of script: arch-macosx-gnu-O-omp, ignoring environment: >>>>> arch-macosx-gnu-g >>>>> >>>>> =============================================================================== >>>>> >>>>> >>>>> TESTING: configureLibraryOptions from >>>>> PETSc.options.libraryOptions(config/PETSc/options/libraryOptions.py:37) >>>>> >>>>> >>>>> >>>>> ******************************************************************************* >>>>> UNABLE to CONFIGURE with GIVEN OPTIONS (see configure.log >>>>> for details): >>>>> >>>>> ------------------------------------------------------------------------------- >>>>> Must use --with-log=0 with --with-threadsafety >>>>> >>>>> ******************************************************************************* >>>>> >>>>> >>>>> On Mon, Apr 13, 2020 at 2:54 PM Junchao Zhang <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Apr 13, 2020 at 12:06 PM Mark Adams <[email protected]> wrote: >>>>>> >>>>>>> BTW, I can build on SUMMIT with logging and OMP, apparently. I also >>>>>>> seem to be able to build with debugging. Both of which are not allowed >>>>>>> according the the docs. I am puzzled. >>>>>>> >>>>>> What are "the docs"? >>>>>> >>>>>>> >>>>>>> On Mon, Apr 13, 2020 at 12:05 PM Mark Adams <[email protected]> wrote: >>>>>>> >>>>>>>> I think the problem is that you have to turn off logging with >>>>>>>> openmp and the (newish) GPU timers did not protect their timers. >>>>>>>> >>>>>>>> I don't see a good reason to require logging be turned off with >>>>>>>> OMP. We could use PETSC_HAVE_THREADSAFETY to protect logs that we care >>>>>>>> about (eg, in MatSetValues) and as users discover more things that they >>>>>>>> want to call in an OMP thread block, then tell them to turn logging >>>>>>>> off and >>>>>>>> we will fix it when we can. >>>>>>>> >>>>>>>> Any thoughts on the idea of letting users keep logging with openmp? >>>>>>>> >>>>>>>> On Mon, Apr 13, 2020 at 11:40 AM Junchao Zhang < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Yes. Looks we need to include petsclog.h. Don't know why OMP >>>>>>>>> triggered the error. >>>>>>>>> --Junchao Zhang >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Apr 13, 2020 at 9:59 AM Mark Adams <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Should I do an MR to fix this? >>>>>>>>>> >>>>>>>>> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
