Hi Chang, I did the work in mumps. It is easy for me to understand gathering matrix rows to one process. But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? Thanks --Junchao Zhang
On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users < petsc-users@mcs.anl.gov> wrote: > Hi Barry, > > I think mumps solver in petsc does support that. You can check the > documentation on "-mat_mumps_use_omp_threads" at > > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > > and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in > functions MatMumpsSetUpDistRHSInfo and MatMumpsGatherNonzerosOnMaster in > mumps.c > > 1. I understand it is ideal to do one MPI rank per GPU. However, I am > working on an existing code that was developed based on MPI and the the > # of mpi ranks is typically equal to # of cpu cores. We don't want to > change the whole structure of the code. > > 2. What you have suggested has been coded in mumps.c. See function > MatMumpsSetUpDistRHSInfo. > > Regards, > > Chang > > On 10/13/21 7:53 PM, Barry Smith wrote: > > > > > >> On Oct 13, 2021, at 3:50 PM, Chang Liu <c...@pppl.gov> wrote: > >> > >> Hi Barry, > >> > >> That is exactly what I want. > >> > >> Back to my original question, I am looking for an approach to transfer > >> matrix > >> data from many MPI processes to "master" MPI > >> processes, each of which taking care of one GPU, and then upload the > data to GPU to > >> solve. > >> One can just grab some codes from mumps.c to aijcusparse.cu. > > > > mumps.c doesn't actually do that. It never needs to copy the entire > matrix to a single MPI rank. > > > > It would be possible to write such a code that you suggest but it is > not clear that it makes sense > > > > 1) For normal PETSc GPU usage there is one GPU per MPI rank, so while > your one GPU per big domain is solving its systems the other GPUs (with the > other MPI ranks that share that domain) are doing nothing. > > > > 2) For each triangular solve you would have to gather the right hand > side from the multiple ranks to the single GPU to pass it to the GPU solver > and then scatter the resulting solution back to all of its subdomain ranks. > > > > What I was suggesting was assign an entire subdomain to a single MPI > rank, thus it does everything on one GPU and can use the GPU solver > directly. If all the major computations of a subdomain can fit and be done > on a single GPU then you would be utilizing all the GPUs you are using > effectively. > > > > Barry > > > > > > > >> > >> Chang > >> > >> On 10/13/21 1:53 PM, Barry Smith wrote: > >>> Chang, > >>> You are correct there is no MPI + GPU direct solvers that > currently do the triangular solves with MPI + GPU parallelism that I am > aware of. You are limited that individual triangular solves be done on a > single GPU. I can only suggest making each subdomain as big as possible to > utilize each GPU as much as possible for the direct triangular solves. > >>> Barry > >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users < > petsc-users@mcs.anl.gov> wrote: > >>>> > >>>> Hi Mark, > >>>> > >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, > but with -pc_factor_mat_solver_type cusparse, it will give an error. > >>>> > >>>> Yes what I want is to have mumps or superlu to do the factorization, > and then do the rest, including GMRES solver, on gpu. Is that possible? > >>>> > >>>> I have tried to use aijcusparse with superlu_dist, it runs but the > iterative solver is still running on CPUs. I have contacted the superlu > group and they confirmed that is the case right now. But if I set > -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is > running on GPU. > >>>> > >>>> Chang > >>>> > >>>> On 10/13/21 12:03 PM, Mark Adams wrote: > >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <c...@pppl.gov <mailto: > c...@pppl.gov>> wrote: > >>>>> Thank you Junchao for explaining this. I guess in my case the > code is > >>>>> just calling a seq solver like superlu to do factorization on > GPUs. > >>>>> My idea is that I want to have a traditional MPI code to utilize > GPUs > >>>>> with cusparse. Right now cusparse does not support mpiaij > matrix, Sure it does: '-mat_type aijcusparse' will give you an > mpiaijcusparse matrix with > 1 processes. > >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). > >>>>> However, I see in grepping the repo that all the mumps and superlu > tests use aij or sell matrix type. > >>>>> MUMPS and SuperLU provide their own solves, I assume .... but you > might want to do other matrix operations on the GPU. Is that the issue? > >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a > problem? (no test with it so it probably does not work) > >>>>> Thanks, > >>>>> Mark > >>>>> so I > >>>>> want the code to have a mpiaij matrix when adding all the matrix > terms, > >>>>> and then transform the matrix to seqaij when doing the > factorization > >>>>> and > >>>>> solve. This involves sending the data to the master process, and > I > >>>>> think > >>>>> the petsc mumps solver have something similar already. > >>>>> Chang > >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: > >>>>> > > >>>>> > > >>>>> > > >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams <mfad...@lbl.gov > >>>>> <mailto:mfad...@lbl.gov> > >>>>> > <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>>> wrote: > >>>>> > > >>>>> > > >>>>> > > >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu <c...@pppl.gov > >>>>> <mailto:c...@pppl.gov> > >>>>> > <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> wrote: > >>>>> > > >>>>> > Hi Mark, > >>>>> > > >>>>> > The option I use is like > >>>>> > > >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type > fgmres > >>>>> -mat_type > >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type cusparse > >>>>> *-sub_ksp_type > >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol > 1.e-300 > >>>>> > -ksp_atol 1.e-300 > >>>>> > > >>>>> > > >>>>> > Note, If you use -log_view the last column (rows are the > >>>>> method like > >>>>> > MatFactorNumeric) has the percent of work in the GPU. > >>>>> > > >>>>> > Junchao: *This* implies that we have a cuSparse LU > >>>>> factorization. Is > >>>>> > that correct? (I don't think we do) > >>>>> > > >>>>> > No, we don't have cuSparse LU factorization. If you check > >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls > >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. > >>>>> > So I don't understand Chang's idea. Do you want to make bigger > >>>>> blocks? > >>>>> > > >>>>> > > >>>>> > I think this one do both factorization and solve on > gpu. > >>>>> > > >>>>> > You can check the runex72_aijcusparse.sh file in petsc > >>>>> install > >>>>> > directory, and try it your self (this is only lu > >>>>> factorization > >>>>> > without > >>>>> > iterative solve). > >>>>> > > >>>>> > Chang > >>>>> > > >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: > >>>>> > > > >>>>> > > > >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > >>>>> <c...@pppl.gov <mailto:c...@pppl.gov> > >>>>> > <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> > >>>>> > > <mailto:c...@pppl.gov <mailto:c...@pppl.gov> > >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> wrote: > >>>>> > > > >>>>> > > Hi Junchao, > >>>>> > > > >>>>> > > No I only needs it to be transferred within a > >>>>> node. I use > >>>>> > block-Jacobi > >>>>> > > method and GMRES to solve the sparse matrix, > so each > >>>>> > direct solver will > >>>>> > > take care of a sub-block of the whole matrix. > In this > >>>>> > way, I can use > >>>>> > > one > >>>>> > > GPU to solve one sub-block, which is stored > within > >>>>> one node. > >>>>> > > > >>>>> > > It was stated in the documentation that > cusparse > >>>>> solver > >>>>> > is slow. > >>>>> > > However, in my test using ex72.c, the cusparse > >>>>> solver is > >>>>> > faster than > >>>>> > > mumps or superlu_dist on CPUs. > >>>>> > > > >>>>> > > > >>>>> > > Are we talking about the factorization, the solve, > or > >>>>> both? > >>>>> > > > >>>>> > > We do not have an interface to cuSparse's LU > >>>>> factorization (I > >>>>> > just > >>>>> > > learned that it exists a few weeks ago). > >>>>> > > Perhaps your fast "cusparse solver" is '-pc_type lu > >>>>> -mat_type > >>>>> > > aijcusparse' ? This would be the CPU factorization, > >>>>> which is the > >>>>> > > dominant cost. > >>>>> > > > >>>>> > > > >>>>> > > Chang > >>>>> > > > >>>>> > > On 10/12/21 10:24 AM, Junchao Zhang wrote: > >>>>> > > > Hi, Chang, > >>>>> > > > For the mumps solver, we usually > transfers > >>>>> matrix > >>>>> > and vector > >>>>> > > data > >>>>> > > > within a compute node. For the idea you > >>>>> propose, it > >>>>> > looks like > >>>>> > > we need > >>>>> > > > to gather data within MPI_COMM_WORLD, right? > >>>>> > > > > >>>>> > > > Mark, I remember you said cusparse > solve is > >>>>> slow > >>>>> > and you would > >>>>> > > > rather do it on CPU. Is it right? > >>>>> > > > > >>>>> > > > --Junchao Zhang > >>>>> > > > > >>>>> > > > > >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu > via > >>>>> petsc-users > >>>>> > > > <petsc-users@mcs.anl.gov > >>>>> <mailto:petsc-users@mcs.anl.gov> > >>>>> > <mailto:petsc-users@mcs.anl.gov > >>>>> <mailto:petsc-users@mcs.anl.gov>> <mailto: > petsc-users@mcs.anl.gov > >>>>> <mailto:petsc-users@mcs.anl.gov> > >>>>> > <mailto:petsc-users@mcs.anl.gov > >>>>> <mailto:petsc-users@mcs.anl.gov>>> > >>>>> > > <mailto:petsc-users@mcs.anl.gov > >>>>> <mailto:petsc-users@mcs.anl.gov> > >>>>> > <mailto:petsc-users@mcs.anl.gov > >>>>> <mailto:petsc-users@mcs.anl.gov>> <mailto: > petsc-users@mcs.anl.gov > >>>>> <mailto:petsc-users@mcs.anl.gov> > >>>>> > <mailto:petsc-users@mcs.anl.gov > >>>>> <mailto:petsc-users@mcs.anl.gov>>>>> > >>>>> > > wrote: > >>>>> > > > > >>>>> > > > Hi, > >>>>> > > > > >>>>> > > > Currently, it is possible to use mumps > >>>>> solver in > >>>>> > PETSC with > >>>>> > > > -mat_mumps_use_omp_threads option, so > that > >>>>> > multiple MPI > >>>>> > > processes will > >>>>> > > > transfer the matrix and rhs data to the > master > >>>>> > rank, and then > >>>>> > > master > >>>>> > > > rank will call mumps with OpenMP to > solve > >>>>> the matrix. > >>>>> > > > > >>>>> > > > I wonder if someone can develop similar > >>>>> option for > >>>>> > cusparse > >>>>> > > solver. > >>>>> > > > Right now, this solver does not work > with > >>>>> > mpiaijcusparse. I > >>>>> > > think a > >>>>> > > > possible workaround is to transfer all > the > >>>>> matrix > >>>>> > data to one MPI > >>>>> > > > process, and then upload the data to > GPU to > >>>>> solve. > >>>>> > In this > >>>>> > > way, one can > >>>>> > > > use cusparse solver for a MPI program. > >>>>> > > > > >>>>> > > > Chang > >>>>> > > > -- > >>>>> > > > Chang Liu > >>>>> > > > Staff Research Physicist > >>>>> > > > +1 609 243 3438 > >>>>> > > > c...@pppl.gov <mailto:c...@pppl.gov> > >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> > >>>>> > <mailto:c...@pppl.gov <mailto:c...@pppl.gov> > >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> > >>>>> > <mailto:c...@pppl.gov <mailto:c...@pppl.gov> > >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> > >>>>> > > <mailto:c...@pppl.gov <mailto:c...@pppl.gov> > >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> > >>>>> > > > Princeton Plasma Physics Laboratory > >>>>> > > > 100 Stellarator Rd, Princeton NJ 08540, > USA > >>>>> > > > > >>>>> > > > >>>>> > > -- > >>>>> > > Chang Liu > >>>>> > > Staff Research Physicist > >>>>> > > +1 609 243 3438 > >>>>> > > c...@pppl.gov <mailto:c...@pppl.gov> > >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> <mailto: > c...@pppl.gov > >>>>> <mailto:c...@pppl.gov> > >>>>> > <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> > >>>>> > > Princeton Plasma Physics Laboratory > >>>>> > > 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>> > > > >>>>> > > >>>>> > -- > >>>>> > Chang Liu > >>>>> > Staff Research Physicist > >>>>> > +1 609 243 3438 > >>>>> > c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov > >>>>> <mailto:c...@pppl.gov>> > >>>>> > Princeton Plasma Physics Laboratory > >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>> > > >>>>> -- Chang Liu > >>>>> Staff Research Physicist > >>>>> +1 609 243 3438 > >>>>> c...@pppl.gov <mailto:c...@pppl.gov> > >>>>> Princeton Plasma Physics Laboratory > >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>> > >>>> -- > >>>> Chang Liu > >>>> Staff Research Physicist > >>>> +1 609 243 3438 > >>>> c...@pppl.gov > >>>> Princeton Plasma Physics Laboratory > >>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >> > >> -- > >> Chang Liu > >> Staff Research Physicist > >> +1 609 243 3438 > >> c...@pppl.gov > >> Princeton Plasma Physics Laboratory > >> 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > c...@pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA >