Re: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Chang Liu via petsc-users Wed, 13 Oct 2021 18:32:56 -0700

Sorry I am not familiar with the details either. Can you please checkthe code in MatMumpsGatherNonzerosOnMaster in mumps.c?


Chang


On 10/13/21 9:24 PM, Junchao Zhang wrote:

Hi Chang,

I did the work in mumps. It is easy for me to understand gatheringmatrix rows to one process. But how to gather blocks (submatrices) to form a large block? Canyou draw a picture of that?

   Thanks
--Junchao Zhang

On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users<petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>> wrote:


    Hi Barry,

    I think mumps solver in petsc does support that. You can check the
    documentation on "-mat_mumps_use_omp_threads" at

    https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html
    <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>

    and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in
    functions MatMumpsSetUpDistRHSInfo and
    MatMumpsGatherNonzerosOnMaster in
    mumps.c

    1. I understand it is ideal to do one MPI rank per GPU. However, I am
    working on an existing code that was developed based on MPI and the the
    # of mpi ranks is typically equal to # of cpu cores. We don't want to
    change the whole structure of the code.

    2. What you have suggested has been coded in mumps.c. See function
    MatMumpsSetUpDistRHSInfo.

    Regards,

    Chang

    On 10/13/21 7:53 PM, Barry Smith wrote:
     >
     >
     >> On Oct 13, 2021, at 3:50 PM, Chang Liu <c...@pppl.gov
    <mailto:c...@pppl.gov>> wrote:
     >>
     >> Hi Barry,
     >>
     >> That is exactly what I want.
     >>
     >> Back to my original question, I am looking for an approach to
    transfer
     >> matrix
     >> data from many MPI processes to "master" MPI
     >> processes, each of which taking care of one GPU, and then upload
    the data to GPU to
     >> solve.
     >> One can just grab some codes from mumps.c to aijcusparse.cu
    <http://aijcusparse.cu>.
     >
     >    mumps.c doesn't actually do that. It never needs to copy the
    entire matrix to a single MPI rank.
     >
     >    It would be possible to write such a code that you suggest but
    it is not clear that it makes sense
     >
     > 1)  For normal PETSc GPU usage there is one GPU per MPI rank, so
    while your one GPU per big domain is solving its systems the other
    GPUs (with the other MPI ranks that share that domain) are doing
    nothing.
     >
     > 2) For each triangular solve you would have to gather the right
    hand side from the multiple ranks to the single GPU to pass it to
    the GPU solver and then scatter the resulting solution back to all
    of its subdomain ranks.
     >
     >    What I was suggesting was assign an entire subdomain to a
    single MPI rank, thus it does everything on one GPU and can use the
    GPU solver directly. If all the major computations of a subdomain
    can fit and be done on a single GPU then you would be utilizing all
    the GPUs you are using effectively.
     >
     >    Barry
     >
     >
     >
     >>
     >> Chang
     >>
     >> On 10/13/21 1:53 PM, Barry Smith wrote:
     >>>    Chang,
     >>>      You are correct there is no MPI + GPU direct solvers that
    currently do the triangular solves with MPI + GPU parallelism that I
    am aware of. You are limited that individual triangular solves be
    done on a single GPU. I can only suggest making each subdomain as
    big as possible to utilize each GPU as much as possible for the
    direct triangular solves.
     >>>     Barry
     >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users
    <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>> wrote:
     >>>>
     >>>> Hi Mark,
     >>>>
     >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other
    solvers, but with -pc_factor_mat_solver_type cusparse, it will give
    an error.
     >>>>
     >>>> Yes what I want is to have mumps or superlu to do the
    factorization, and then do the rest, including GMRES solver, on gpu.
    Is that possible?
     >>>>
     >>>> I have tried to use aijcusparse with superlu_dist, it runs but
    the iterative solver is still running on CPUs. I have contacted the
    superlu group and they confirmed that is the case right now. But if
    I set -pc_factor_mat_solver_type cusparse, it seems that the
    iterative solver is running on GPU.
     >>>>
     >>>> Chang
     >>>>
     >>>> On 10/13/21 12:03 PM, Mark Adams wrote:
     >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <c...@pppl.gov
    <mailto:c...@pppl.gov> <mailto:c...@pppl.gov
    <mailto:c...@pppl.gov>>> wrote:
     >>>>>     Thank you Junchao for explaining this. I guess in my case
    the code is
     >>>>>     just calling a seq solver like superlu to do
    factorization on GPUs.
     >>>>>     My idea is that I want to have a traditional MPI code to
    utilize GPUs
     >>>>>     with cusparse. Right now cusparse does not support mpiaij
    matrix, Sure it does: '-mat_type aijcusparse' will give you an
    mpiaijcusparse matrix with > 1 processes.
     >>>>> (-mat_type mpiaijcusparse might also work with >1 proc).
     >>>>> However, I see in grepping the repo that all the mumps and
    superlu tests use aij or sell matrix type.
     >>>>> MUMPS and SuperLU provide their own solves, I assume .... but
    you might want to do other matrix operations on the GPU. Is that the
    issue?
     >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU
    have a problem? (no test with it so it probably does not work)
     >>>>> Thanks,
     >>>>> Mark
     >>>>>     so I
     >>>>>     want the code to have a mpiaij matrix when adding all the
    matrix terms,
     >>>>>     and then transform the matrix to seqaij when doing the
    factorization
     >>>>>     and
     >>>>>     solve. This involves sending the data to the master
    process, and I
     >>>>>     think
     >>>>>     the petsc mumps solver have something similar already.
     >>>>>     Chang
     >>>>>     On 10/13/21 10:18 AM, Junchao Zhang wrote:
     >>>>>      >
     >>>>>      >
     >>>>>      >
     >>>>>      > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams
    <mfad...@lbl.gov <mailto:mfad...@lbl.gov>
     >>>>>     <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>>
     >>>>>      > <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>
    <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>>>> wrote:
     >>>>>      >
     >>>>>      >
     >>>>>      >
     >>>>>      >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu
    <c...@pppl.gov <mailto:c...@pppl.gov>
     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
     >>>>>      >     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> wrote:
     >>>>>      >
     >>>>>      >         Hi Mark,
     >>>>>      >
     >>>>>      >         The option I use is like
     >>>>>      >
     >>>>>      >         -pc_type bjacobi -pc_bjacobi_blocks 16
    -ksp_type fgmres
     >>>>>     -mat_type
     >>>>>      >         aijcusparse *-sub_pc_factor_mat_solver_type
    cusparse
     >>>>>     *-sub_ksp_type
     >>>>>      >         preonly *-sub_pc_type lu* -ksp_max_it 2000
    -ksp_rtol 1.e-300
     >>>>>      >         -ksp_atol 1.e-300
     >>>>>      >
     >>>>>      >
     >>>>>      >     Note, If you use -log_view the last column (rows
    are the
     >>>>>     method like
     >>>>>      >     MatFactorNumeric) has the percent of work in the GPU.
     >>>>>      >
     >>>>>      >     Junchao: *This* implies that we have a cuSparse LU
     >>>>>     factorization. Is
     >>>>>      >     that correct? (I don't think we do)
     >>>>>      >
     >>>>>      > No, we don't have cuSparse LU factorization.  If you check
     >>>>>      > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it
    calls
     >>>>>      > MatLUFactorSymbolic_SeqAIJ() instead.
     >>>>>      > So I don't understand Chang's idea. Do you want to
    make bigger
     >>>>>     blocks?
     >>>>>      >
     >>>>>      >
     >>>>>      >         I think this one do both factorization and
    solve on gpu.
     >>>>>      >
     >>>>>      >         You can check the runex72_aijcusparse.sh file
    in petsc
     >>>>>     install
     >>>>>      >         directory, and try it your self (this is only lu
     >>>>>     factorization
     >>>>>      >         without
     >>>>>      >         iterative solve).
     >>>>>      >
     >>>>>      >         Chang
     >>>>>      >
     >>>>>      >         On 10/12/21 1:17 PM, Mark Adams wrote:
     >>>>>      >          >
     >>>>>      >          >
     >>>>>      >          > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
     >>>>>     <c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
     >>>>>      >         <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
     >>>>>      >          > <mailto:c...@pppl.gov
    <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>> wrote:
     >>>>>      >          >
     >>>>>      >          >     Hi Junchao,
     >>>>>      >          >
     >>>>>      >          >     No I only needs it to be transferred
    within a
     >>>>>     node. I use
     >>>>>      >         block-Jacobi
     >>>>>      >          >     method and GMRES to solve the sparse
    matrix, so each
     >>>>>      >         direct solver will
     >>>>>      >          >     take care of a sub-block of the whole
    matrix. In this
     >>>>>      >         way, I can use
     >>>>>      >          >     one
     >>>>>      >          >     GPU to solve one sub-block, which is
    stored within
     >>>>>     one node.
     >>>>>      >          >
     >>>>>      >          >     It was stated in the documentation that
    cusparse
     >>>>>     solver
     >>>>>      >         is slow.
     >>>>>      >          >     However, in my test using ex72.c, the
    cusparse
     >>>>>     solver is
     >>>>>      >         faster than
     >>>>>      >          >     mumps or superlu_dist on CPUs.
     >>>>>      >          >
     >>>>>      >          >
     >>>>>      >          > Are we talking about the factorization, the
    solve, or
     >>>>>     both?
     >>>>>      >          >
     >>>>>      >          > We do not have an interface to cuSparse's LU
     >>>>>     factorization (I
     >>>>>      >         just
     >>>>>      >          > learned that it exists a few weeks ago).
     >>>>>      >          > Perhaps your fast "cusparse solver" is
    '-pc_type lu
     >>>>>     -mat_type
     >>>>>      >          > aijcusparse' ? This would be the CPU
    factorization,
     >>>>>     which is the
     >>>>>      >          > dominant cost.
     >>>>>      >          >
     >>>>>      >          >
     >>>>>      >          >     Chang
     >>>>>      >          >
     >>>>>      >          >     On 10/12/21 10:24 AM, Junchao Zhang wrote:
     >>>>>      >          >      > Hi, Chang,
     >>>>>      >          >      >     For the mumps solver, we usually
    transfers
     >>>>>     matrix
     >>>>>      >         and vector
     >>>>>      >          >     data
     >>>>>      >          >      > within a compute node.  For the idea you
     >>>>>     propose, it
     >>>>>      >         looks like
     >>>>>      >          >     we need
     >>>>>      >          >      > to gather data within
    MPI_COMM_WORLD, right?
     >>>>>      >          >      >
     >>>>>      >          >      >     Mark, I remember you said
    cusparse solve is
     >>>>>     slow
     >>>>>      >         and you would
     >>>>>      >          >      > rather do it on CPU. Is it right?
     >>>>>      >          >      >
     >>>>>      >          >      > --Junchao Zhang
     >>>>>      >          >      >
     >>>>>      >          >      >
     >>>>>      >          >      > On Mon, Oct 11, 2021 at 10:25 PM
    Chang Liu via
     >>>>>     petsc-users
     >>>>>      >          >      > <petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>
     >>>>>     <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>>
     >>>>>      >         <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>
     >>>>>     <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>>> <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>
     >>>>>     <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>>
     >>>>>      >         <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>
     >>>>>     <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>>>>
     >>>>>      >          >     <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>
     >>>>>     <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>>
     >>>>>      >         <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>
     >>>>>     <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>>> <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>
     >>>>>     <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>>
     >>>>>      >         <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>
     >>>>>     <mailto:petsc-users@mcs.anl.gov
    <mailto:petsc-users@mcs.anl.gov>>>>>>
     >>>>>      >          >     wrote:
     >>>>>      >          >      >
     >>>>>      >          >      >     Hi,
     >>>>>      >          >      >
     >>>>>      >          >      >     Currently, it is possible to use
    mumps
     >>>>>     solver in
     >>>>>      >         PETSC with
     >>>>>      >          >      >     -mat_mumps_use_omp_threads
    option, so that
     >>>>>      >         multiple MPI
     >>>>>      >          >     processes will
     >>>>>      >          >      >     transfer the matrix and rhs data
    to the master
     >>>>>      >         rank, and then
     >>>>>      >          >     master
     >>>>>      >          >      >     rank will call mumps with OpenMP
    to solve
     >>>>>     the matrix.
     >>>>>      >          >      >
     >>>>>      >          >      >     I wonder if someone can develop
    similar
     >>>>>     option for
     >>>>>      >         cusparse
     >>>>>      >          >     solver.
     >>>>>      >          >      >     Right now, this solver does not
    work with
     >>>>>      >         mpiaijcusparse. I
     >>>>>      >          >     think a
     >>>>>      >          >      >     possible workaround is to
    transfer all the
     >>>>>     matrix
     >>>>>      >         data to one MPI
     >>>>>      >          >      >     process, and then upload the
    data to GPU to
     >>>>>     solve.
     >>>>>      >         In this
     >>>>>      >          >     way, one can
     >>>>>      >          >      >     use cusparse solver for a MPI
    program.
     >>>>>      >          >      >
     >>>>>      >          >      >     Chang
     >>>>>      >          >      >     --
     >>>>>      >          >      >     Chang Liu
     >>>>>      >          >      >     Staff Research Physicist
     >>>>>      >          >      >     +1 609 243 3438
     >>>>>      >          >      > c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
     >>>>>      >         <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>
     >>>>>      >         <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
     >>>>>      >          >     <mailto:c...@pppl.gov
    <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>>
     >>>>>      >          >      >     Princeton Plasma Physics Laboratory
     >>>>>      >          >      >     100 Stellarator Rd, Princeton NJ
    08540, USA
     >>>>>      >          >      >
     >>>>>      >          >
     >>>>>      >          >     --
     >>>>>      >          >     Chang Liu
     >>>>>      >          >     Staff Research Physicist
     >>>>>      >          >     +1 609 243 3438
     >>>>>      >          > c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> <mailto:c...@pppl.gov
    <mailto:c...@pppl.gov>
     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
     >>>>>      >         <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>
     >>>>>      >          >     Princeton Plasma Physics Laboratory
     >>>>>      >          >     100 Stellarator Rd, Princeton NJ 08540, USA
     >>>>>      >          >
     >>>>>      >
     >>>>>      >         --
     >>>>>      >         Chang Liu
     >>>>>      >         Staff Research Physicist
     >>>>>      >         +1 609 243 3438
     >>>>>      > c...@pppl.gov <mailto:c...@pppl.gov>
    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> <mailto:c...@pppl.gov
    <mailto:c...@pppl.gov>
     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
     >>>>>      >         Princeton Plasma Physics Laboratory
     >>>>>      >         100 Stellarator Rd, Princeton NJ 08540, USA
     >>>>>      >
     >>>>>     --     Chang Liu
     >>>>>     Staff Research Physicist
     >>>>>     +1 609 243 3438
     >>>>> c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov
    <mailto:c...@pppl.gov>>
     >>>>>     Princeton Plasma Physics Laboratory
     >>>>>     100 Stellarator Rd, Princeton NJ 08540, USA
     >>>>
     >>>> --
     >>>> Chang Liu
     >>>> Staff Research Physicist
     >>>> +1 609 243 3438
     >>>> c...@pppl.gov <mailto:c...@pppl.gov>
     >>>> Princeton Plasma Physics Laboratory
     >>>> 100 Stellarator Rd, Princeton NJ 08540, USA
     >>
     >> --
     >> Chang Liu
     >> Staff Research Physicist
     >> +1 609 243 3438
     >> c...@pppl.gov <mailto:c...@pppl.gov>
     >> Princeton Plasma Physics Laboratory
     >> 100 Stellarator Rd, Princeton NJ 08540, USA
     >

--Chang Liu

    Staff Research Physicist
    +1 609 243 3438
    c...@pppl.gov <mailto:c...@pppl.gov>
    Princeton Plasma Physics Laboratory
    100 Stellarator Rd, Princeton NJ 08540, USA


--
Chang Liu
Staff Research Physicist
+1 609 243 3438
c...@pppl.gov
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA

Re: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Reply via email to