On Thu, Apr 16, 2020 at 10:11 AM Mark Adams <[email protected]> wrote: > On Thu, Apr 16, 2020 at 9:31 AM Matthew Knepley <[email protected]> wrote: > >> On Thu, Apr 16, 2020 at 8:42 AM Mark Adams <[email protected]> wrote: >> >>> Yea, GPU assembly would be great. I was figuring OMP might be simpler. >>> >>> As far as the interface, I am flexible, the simplest way to do it would >>> be to take an array of element matrices and a DMPlex and call >>> to DMPlexMatSetClosure. You can see this code in >>> mark/feature-xgc-interface-rebase, at the bottom of >>> src/vec/vec/impls/seq/seqcuda/landau.cu. >>> >>> I was shy about putting a version of DMPlexMatSetClosure in CUDA, but >>> maybe that is easier, just plow through it and cut stuff that we don't >>> need. OMP broke because there are some temp arrays that Matt caches that >>> need to be "private" of dealt with in some way. >>> >> >> We should refactor so that all temp arrays are sized and constructed up >> front, and then the work is done in an internal function which is passed >> those arrays. I tried to do this, but might have crapped out here. Then >> you can just call the internal function directly with your arrays. >> > > Just to be clear, we have to copy the code to a .cu file and declare > everything as a device method (__global__) right? > > Or write a batched version of DMPlexMatSetClosure and when it gets down to > some kernel like MatSetValues, we, for example, move the Mat pointers to > the GPU, copy to the element matrices to the device, in a .cu file, launch > a kernel that calls a __global__ version of MatSetValues_SeqAIJ (with > mallocs stripped out), then Copy the Mat pointers back to the CPU. All this > copy stuff is usually done with a shadow copy of the object, but that is > complicated by cusparse matrices which, from what I can tell, have a > different, cusparse friendly, device Mat object. >
I would just get the closure indices for that batch of cells, push them to the GPU, and call MatSetValues() from the GPU. Here is the way I am thinking. You can not going to push the Plex to the GPU, so you have to do Plex->Indices on the CPU. Just do that, push them down, and use PETSc Mat from there. Matt > >> Matt >> >> >>> Coloring is not attractive to me because GPUs demand a lot of >>> parallelism and the code that this serial (velocity space) solver would be >>> embedded in a full 3D code that does not use a huge amount of MPI >>> parallelism. For instance if the app code was to use 6 (or 7 max in SUMMIT) >>> cores per GPU (or even 4x that with hardware threads) then *I could >>> imagine* there would be enough parallelism, with coloring, to fuse the >>> element construction and assembly, so assembling the element matrices right >>> after they are created. That would be great in terms of not storing all >>> these matrices and then assembling them all at once. The app that I am >>> targeting does not use that much MPI parallelism though. But we could >>> explore that, coloring, space and my mental model could be inaccurate. >>> (note, I did recently add 8x more parallelism to my code this week and got >>> a 25% speedup, using one whole GPU). >>> >>> Or if you have some sort of lower level synchronization that could allow >>> for fusing the the assembly with the element creation, then, by all means, >>> we can explore that. >>> >>> I'd be happy to work with you on this. >>> >>> Thanks, >>> Mark >>> >>> On Mon, Apr 13, 2020 at 7:08 PM Junchao Zhang <[email protected]> >>> wrote: >>> >>>> Probably matrix assembly on GPU is more important. Do you have an >>>> example for me to play to see what GPU interface we should have? >>>> --Junchao Zhang >>>> >>>> On Mon, Apr 13, 2020 at 5:44 PM Mark Adams <[email protected]> wrote: >>>> >>>>> I was looking into assembling matrices with threads. I have a coloring >>>>> to avoid conflicts. >>>>> >>>>> Turning off all the logging seems way overkill and for methods that >>>>> can get called in a thread then we could use PETSC_HAVE_THREADSAFTEY >>>>> (thingy) to protect logging functions. So one can still get timings for >>>>> the >>>>> whole assembly process, just not for MatSetValues. Few people are going to >>>>> do this. I don't think it will be a time sink, and if it is we just revert >>>>> back to saying 'turn logging off'. I don't see a good argument for >>>>> insisting on turning off logging, it is pretty important, if we just say >>>>> that we are going to protect methods as needed. >>>>> >>>>> It is not a big deal, I am just exploring this idea. It is such a >>>>> basic concept in shared memory sparse linear algebra that it seems like a >>>>> good thing to be able to support and have in an example to say we can >>>>> assemble matrices in threads (not that it is a great idea). We have all >>>>> the >>>>> tools (eg, coloring methods) that it is just a matter of protecting code a >>>>> few methods. I use DMPlex MatClosure instead of MatSetValues and this is >>>>> where I die now with non-thread safe code. We have an idea, from Jed, on >>>>> how to fix it. >>>>> >>>>> Anyway, thanks for your help, but I think we should hold off on doing >>>>> anything until we have some consensus that this would be a good idea to >>>>> put >>>>> some effort into getting a thread safe PETSc that can support OMP matrix >>>>> assembly with a nice compact example. >>>>> >>>>> Thanks again, >>>>> Mark >>>>> >>>>> On Mon, Apr 13, 2020 at 5:44 PM Junchao Zhang <[email protected]> >>>>> wrote: >>>>> >>>>>> Mark, >>>>>> I saw you had "--with-threadsaftey --with-log=0". Do you really >>>>>> want to call petsc from multiple threads (in contrast to letting petsc >>>>>> call >>>>>> other libraries, e.g., BLAS, doing multithreading)? If not, you can >>>>>> drop --with-threadsaftey. >>>>>> I have https://gitlab.com/petsc/petsc/-/merge_requests/2714 that >>>>>> should fix your original compilation errors. >>>>>> >>>>>> --Junchao Zhang >>>>>> >>>>>> On Mon, Apr 13, 2020 at 2:07 PM Mark Adams <[email protected]> wrote: >>>>>> >>>>>>> https://www.mcs.anl.gov/petsc/miscellaneous/petscthreads.html >>>>>>> >>>>>>> and I see this on my Mac: >>>>>>> >>>>>>> 14:23 1 mark/feature-xgc-interface-rebase *= ~/Codes/petsc$ >>>>>>> ../arch-macosx-gnu-O-omp.py >>>>>>> >>>>>>> >>>>>>> >>>>>>> =============================================================================== >>>>>>> Configuring PETSc to compile on your system >>>>>>> >>>>>>> >>>>>>> =============================================================================== >>>>>>> =============================================================================== >>>>>>> >>>>>>> >>>>>>> Warning: PETSC_ARCH from environment does not match >>>>>>> command-line or name of script. >>>>>>> >>>>>>> Warning: Using from command-line or >>>>>>> name of script: arch-macosx-gnu-O-omp, ignoring environment: >>>>>>> arch-macosx-gnu-g >>>>>>> >>>>>>> =============================================================================== >>>>>>> >>>>>>> >>>>>>> TESTING: configureLibraryOptions from >>>>>>> PETSc.options.libraryOptions(config/PETSc/options/libraryOptions.py:37) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ******************************************************************************* >>>>>>> UNABLE to CONFIGURE with GIVEN OPTIONS (see >>>>>>> configure.log for details): >>>>>>> >>>>>>> ------------------------------------------------------------------------------- >>>>>>> Must use --with-log=0 with --with-threadsafety >>>>>>> >>>>>>> ******************************************************************************* >>>>>>> >>>>>>> >>>>>>> On Mon, Apr 13, 2020 at 2:54 PM Junchao Zhang < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Apr 13, 2020 at 12:06 PM Mark Adams <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> BTW, I can build on SUMMIT with logging and OMP, apparently. I >>>>>>>>> also seem to be able to build with debugging. Both of which are not >>>>>>>>> allowed >>>>>>>>> according the the docs. I am puzzled. >>>>>>>>> >>>>>>>> What are "the docs"? >>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Apr 13, 2020 at 12:05 PM Mark Adams <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I think the problem is that you have to turn off logging with >>>>>>>>>> openmp and the (newish) GPU timers did not protect their timers. >>>>>>>>>> >>>>>>>>>> I don't see a good reason to require logging be turned off with >>>>>>>>>> OMP. We could use PETSC_HAVE_THREADSAFETY to protect logs that we >>>>>>>>>> care >>>>>>>>>> about (eg, in MatSetValues) and as users discover more things that >>>>>>>>>> they >>>>>>>>>> want to call in an OMP thread block, then tell them to turn logging >>>>>>>>>> off and >>>>>>>>>> we will fix it when we can. >>>>>>>>>> >>>>>>>>>> Any thoughts on the idea of letting users keep logging with >>>>>>>>>> openmp? >>>>>>>>>> >>>>>>>>>> On Mon, Apr 13, 2020 at 11:40 AM Junchao Zhang < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Yes. Looks we need to include petsclog.h. Don't know why OMP >>>>>>>>>>> triggered the error. >>>>>>>>>>> --Junchao Zhang >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 13, 2020 at 9:59 AM Mark Adams <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Should I do an MR to fix this? >>>>>>>>>>>> >>>>>>>>>>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> <http://www.cse.buffalo.edu/~knepley/> >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
