I see. Thanks a lot. --Junchao Zhang
On Sat, Jan 7, 2023 at 6:15 AM Mark Lohry <[email protected]> wrote: > I've worked on a few different codes doing matrix assembly on GPU > independently of petsc. In all instances to plug into petsc all I need are > the device CSR pointers and some guarantee they don't move around (my first > try without setpreallocation on CPU I saw the value array pointer move > after the first solve). It would also be nice to have a guarantee there > aren't any unnecessary copies since memory constraints are always a concern. > > Here I call > MatCreateSeqAIJCUSPARSE > MatSeqAIJSetPreallocationCSR (filled using a preexisting CSR on host using > the correct index arrays and zeros for values) > MatSeqAIJGetCSRAndMemType (grab the allocated device CSR pointers and use > those directly) > > Then in the Jacobian evaluation routine I fill that CSR directly with no > calls to MatSetValues, just > > MatAssemblyBegin(J,MAT_FINAL_ASSEMBLY); > MatAssemblyEnd(J,MAT_FINAL_ASSEMBLY); > > after to put it in the correct state. > > In this code to fill the CSR coefficients, each GPU thread gets one row > and fills it. No race conditions to contend with. Technically I'm > duplicating some computations (a given dof could fill its own row and > column) but this is much faster than the linear solver anyway. > > Other mesh based codes did GPU assembly using either coloring or mutexes, > but still just need the CSR value array to fill. > > > On Fri, Jan 6, 2023, 9:44 PM Junchao Zhang <[email protected]> > wrote: > >> >> >> >> On Fri, Jan 6, 2023 at 7:35 PM Mark Lohry <[email protected]> wrote: >> >>> Well, I think it's a moderately crazy idea unless it's less painful to >>> implement than I'm thinking. Is there a use case for a mixed device system >>> where one petsc executable might be addressing both a HIP and CUDA device >>> beyond some frankenstein test system somebody cooked up? In all my code I >>> implicitly assume I have either have one host with one device or one host >>> with zero devices. I guess you can support these weird scenarios, but why? >>> Life is hard enough supporting one device compiler with one host compiler. >>> >>> Many thanks Junchao -- with combinations of SetPreallocation I was able >>> to grab allocated pointers out of petsc. Now I have all the jacobian >>> construction on device with no copies. >>> >> Hi, Mark, could you say a few words about how you assemble matrices on >> GPUs? We ported MatSetValues like routines to GPUs but did not continue >> this approach since we have to resolve data races between GPU threads. >> >> >>> >>> On Fri, Jan 6, 2023 at 12:27 AM Barry Smith <[email protected]> wrote: >>> >>>> >>>> So Jed's "everyone" now consists of "no one" and Jed can stop >>>> complaining that "everyone" thinks it is a bad idea. >>>> >>>> >>>> >>>> On Jan 5, 2023, at 11:50 PM, Junchao Zhang <[email protected]> >>>> wrote: >>>> >>>> >>>> >>>> >>>> On Thu, Jan 5, 2023 at 10:32 PM Barry Smith <[email protected]> wrote: >>>> >>>>> >>>>> >>>>> > On Jan 5, 2023, at 3:42 PM, Jed Brown <[email protected]> wrote: >>>>> > >>>>> > Mark Adams <[email protected]> writes: >>>>> > >>>>> >> Support of HIP and CUDA hardware together would be crazy, >>>>> > >>>>> > I don't think it's remotely crazy. libCEED supports both together >>>>> and it's very convenient when testing on a development machine that has >>>>> one >>>>> of each brand GPU and simplifies binary distribution for us and every >>>>> package that uses us. Every day I wish PETSc could build with both >>>>> simultaneously, but everyone tells me it's silly. >>>>> >>>>> Not everyone at all; just a subset of everyone. Junchao is really >>>>> the hold-out :-) >>>>> >>>> I am not, instead I think we should try (I fully agree it can ease >>>> binary distribution). But satish needs to install such a machine first :) >>>> There are issues out of our control if we want to mix GPUs in >>>> execution. For example, how to do VexAXPY on a cuda vector and a hip >>>> vector? Shall we do it on the host? Also, there are no gpu-aware MPI >>>> implementations supporting messages between cuda memory and hip memory. >>>> >>>>> >>>>> I just don't care about "binary packages" :-); I think they are an >>>>> archaic and bad way of thinking about code distribution (but yes the >>>>> alternatives need lots of work to make them flawless, but I think that is >>>>> where the work should go in the packaging world.) >>>>> >>>>> I go further and think one should be able to automatically use a >>>>> CUDA vector on a HIP device as well, it is not hard in theory but requires >>>>> thinking about how we handle classes and subclasses a little to make it >>>>> straightforward; or perhaps Jacob has fixed that also? >>>> >>>> >>>> >>>> >>>>
