You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So 
something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type 
lu 

> On Oct 14, 2021, at 4:14 PM, Chang Liu <c...@pppl.gov> wrote:
> 
> Hi Pierre,
> 
> I wonder if the trick of PCTELESCOPE only works for preconditioner and not 
> for the solver. I have done some tests, and find that for solving a small 
> matrix using -telescope_ksp_type preonly, it does work for GPU with multiple 
> MPI processes. However, for bjacobi and gmres, it does not work.
> 
> The command line options I used for small matrix is like
> 
> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type 
> telescope -mat_type aijcusparse -telescope_pc_type lu 
> -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly 
> -pc_telescope_reduction_factor 4
> 
> which gives the correct output. For iterative solver, I tried
> 
> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type 
> bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse 
> -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly 
> -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse 
> -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 
> -ksp_atol 1.e-20
> 
> for large matrix. The output is like
> 
>  0 KSP Residual norm 40.1497
>  1 KSP Residual norm < 1.e-11
> Norm of error 400.999 iterations 1
> 
> So it seems to call a direct solver instead of an iterative one.
> 
> Can you please help check these options?
> 
> Chang
> 
> On 10/14/21 10:04 AM, Pierre Jolivet wrote:
>>> On 14 Oct 2021, at 3:50 PM, Chang Liu <c...@pppl.gov> wrote:
>>> 
>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds 
>>> exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse 
>>> to seqaircusparse? Or I have to do it manually?
>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat().
>> 1) I’m not sure this is implemented for cuSparse matrices, but it should be;
>> 2) at least for the implementations 
>> MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and 
>> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ 
>> (resp. MATAIJ). Constructors are usually “smart” enough to detect if the MPI 
>> communicator on which the Mat lives is of size 1 (your case), and then the 
>> resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need 
>> to worry about the transformation you are mentioning.
>> If you try this out and this does not work, please provide the backtrace 
>> (probably something like “Operation XYZ not implemented for MatType ABC”), 
>> and hopefully someone can add the missing plumbing.
>> I do not claim that this will be efficient, but I think this goes in the 
>> direction of what you want to achieve.
>> Thanks,
>> Pierre
>>> Chang
>>> 
>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote:
>>>> Maybe I’m missing something, but can’t you use PCTELESCOPE as a subdomain 
>>>> solver, with a reduction factor equal to the number of MPI processes you 
>>>> have per block?
>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X 
>>>> -sub_telescope_pc_type lu
>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only 
>>>> do the Mat needs to be redistributed, the secondary processes also need to 
>>>> be “converted” to OpenMP threads.
>>>> Thus the need for specific code in mumps.c.
>>>> Thanks,
>>>> Pierre
>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users 
>>>>> <petsc-users@mcs.anl.gov> wrote:
>>>>> 
>>>>> Hi Junchao,
>>>>> 
>>>>> Yes that is what I want.
>>>>> 
>>>>> Chang
>>>>> 
>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote:
>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith <bsm...@petsc.dev 
>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>       Junchao,
>>>>>>          If I understand correctly Chang is using the block Jacobi
>>>>>>    method with a single block for a number of MPI ranks and a direct
>>>>>>    solver for each block so it uses PCSetUp_BJacobi_Multiproc() which
>>>>>>    is code Hong Zhang wrote a number of years ago for CPUs. For their
>>>>>>    particular problems this preconditioner works well, but using an
>>>>>>    iterative solver on the blocks does not work well.
>>>>>>          If we had complete MPI-GPU direct solvers he could just use
>>>>>>    the current code with MPIAIJCUSPARSE on each block but since we do
>>>>>>    not he would like to use a single GPU for each block, this means
>>>>>>    that diagonal blocks of  the global parallel MPI matrix needs to be
>>>>>>    sent to a subset of the GPUs (one GPU per block, which has multiple
>>>>>>    MPI ranks associated with the blocks). Similarly for the triangular
>>>>>>    solves the blocks of the right hand side needs to be shipped to the
>>>>>>    appropriate GPU and the resulting solution shipped back to the
>>>>>>    multiple GPUs. So Chang is absolutely correct, this is somewhat like
>>>>>>    your code for MUMPS with OpenMP. OK, I now understand the background..
>>>>>>    One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the
>>>>>>    MPI ranks and then shrink each block down to a single GPU but this
>>>>>>    would be pretty inefficient, ideally one would go directly from the
>>>>>>    big MPI matrix on all the GPUs to the sub matrices on the subset of
>>>>>>    GPUs. But this may be a large coding project.
>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, 
>>>>>> we just need to move each block (submatrix) living over multiple MPI 
>>>>>> ranks to one of them and solve directly there.  In other words, we keep 
>>>>>> blocks' size, no shrinking or expanding.
>>>>>> As mentioned before, cusparse does not provide LU factorization. So the 
>>>>>> LU factorization would be done on CPU, and the solve be done on GPU. I 
>>>>>> assume Chang wants to gain from the (potential) faster solve (instead of 
>>>>>> factorization) on GPU.
>>>>>>       Barry
>>>>>>    Since the matrices being factored and solved directly are relatively
>>>>>>    large it is possible that the cusparse code could be reasonably
>>>>>>    efficient (they are not the tiny problems one gets at the coarse
>>>>>>    level of multigrid). Of course, this is speculation, I don't
>>>>>>    actually know how much better the cusparse code would be on the
>>>>>>    direct solver than a good CPU direct sparse solver.
>>>>>>     > On Oct 13, 2021, at 9:32 PM, Chang Liu <c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>> wrote:
>>>>>>     >
>>>>>>     > Sorry I am not familiar with the details either. Can you please
>>>>>>    check the code in MatMumpsGatherNonzerosOnMaster in mumps.c?
>>>>>>     >
>>>>>>     > Chang
>>>>>>     >
>>>>>>     > On 10/13/21 9:24 PM, Junchao Zhang wrote:
>>>>>>     >> Hi Chang,
>>>>>>     >>   I did the work in mumps. It is easy for me to understand
>>>>>>    gathering matrix rows to one process.
>>>>>>     >>   But how to gather blocks (submatrices) to form a large block?   
>>>>>>   Can you draw a picture of that?
>>>>>>     >>   Thanks
>>>>>>     >> --Junchao Zhang
>>>>>>     >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users
>>>>>>    <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>
>>>>>>    <mailto:petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>>
>>>>>>    wrote:
>>>>>>     >>    Hi Barry,
>>>>>>     >>    I think mumps solver in petsc does support that. You can
>>>>>>    check the
>>>>>>     >>    documentation on "-mat_mumps_use_omp_threads" at
>>>>>>     >>
>>>>>>    https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html
>>>>>>    <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>
>>>>>>     >>       
>>>>>> <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html
>>>>>>    <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>>
>>>>>>     >>    and the code enclosed by #if
>>>>>>    defined(PETSC_HAVE_OPENMP_SUPPORT) in
>>>>>>     >>    functions MatMumpsSetUpDistRHSInfo and
>>>>>>     >>    MatMumpsGatherNonzerosOnMaster in
>>>>>>     >>    mumps.c
>>>>>>     >>    1. I understand it is ideal to do one MPI rank per GPU.
>>>>>>    However, I am
>>>>>>     >>    working on an existing code that was developed based on MPI
>>>>>>    and the the
>>>>>>     >>    # of mpi ranks is typically equal to # of cpu cores. We don't
>>>>>>    want to
>>>>>>     >>    change the whole structure of the code.
>>>>>>     >>    2. What you have suggested has been coded in mumps.c. See
>>>>>>    function
>>>>>>     >>    MatMumpsSetUpDistRHSInfo.
>>>>>>     >>    Regards,
>>>>>>     >>    Chang
>>>>>>     >>    On 10/13/21 7:53 PM, Barry Smith wrote:
>>>>>>     >>     >
>>>>>>     >>     >
>>>>>>     >>     >> On Oct 13, 2021, at 3:50 PM, Chang Liu <c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> wrote:
>>>>>>     >>     >>
>>>>>>     >>     >> Hi Barry,
>>>>>>     >>     >>
>>>>>>     >>     >> That is exactly what I want.
>>>>>>     >>     >>
>>>>>>     >>     >> Back to my original question, I am looking for an approach 
>>>>>> to
>>>>>>     >>    transfer
>>>>>>     >>     >> matrix
>>>>>>     >>     >> data from many MPI processes to "master" MPI
>>>>>>     >>     >> processes, each of which taking care of one GPU, and then
>>>>>>    upload
>>>>>>     >>    the data to GPU to
>>>>>>     >>     >> solve.
>>>>>>     >>     >> One can just grab some codes from mumps.c to
>>>>>>    aijcusparse.cu <http://aijcusparse.cu>
>>>>>>     >>    <http://aijcusparse.cu <http://aijcusparse.cu>>.
>>>>>>     >>     >
>>>>>>     >>     >    mumps.c doesn't actually do that. It never needs to
>>>>>>    copy the
>>>>>>     >>    entire matrix to a single MPI rank.
>>>>>>     >>     >
>>>>>>     >>     >    It would be possible to write such a code that you
>>>>>>    suggest but
>>>>>>     >>    it is not clear that it makes sense
>>>>>>     >>     >
>>>>>>     >>     > 1)  For normal PETSc GPU usage there is one GPU per MPI
>>>>>>    rank, so
>>>>>>     >>    while your one GPU per big domain is solving its systems the
>>>>>>    other
>>>>>>     >>    GPUs (with the other MPI ranks that share that domain) are 
>>>>>> doing
>>>>>>     >>    nothing.
>>>>>>     >>     >
>>>>>>     >>     > 2) For each triangular solve you would have to gather the
>>>>>>    right
>>>>>>     >>    hand side from the multiple ranks to the single GPU to pass it 
>>>>>> to
>>>>>>     >>    the GPU solver and then scatter the resulting solution back
>>>>>>    to all
>>>>>>     >>    of its subdomain ranks.
>>>>>>     >>     >
>>>>>>     >>     >    What I was suggesting was assign an entire subdomain to a
>>>>>>     >>    single MPI rank, thus it does everything on one GPU and can
>>>>>>    use the
>>>>>>     >>    GPU solver directly. If all the major computations of a 
>>>>>> subdomain
>>>>>>     >>    can fit and be done on a single GPU then you would be
>>>>>>    utilizing all
>>>>>>     >>    the GPUs you are using effectively.
>>>>>>     >>     >
>>>>>>     >>     >    Barry
>>>>>>     >>     >
>>>>>>     >>     >
>>>>>>     >>     >
>>>>>>     >>     >>
>>>>>>     >>     >> Chang
>>>>>>     >>     >>
>>>>>>     >>     >> On 10/13/21 1:53 PM, Barry Smith wrote:
>>>>>>     >>     >>>    Chang,
>>>>>>     >>     >>>      You are correct there is no MPI + GPU direct
>>>>>>    solvers that
>>>>>>     >>    currently do the triangular solves with MPI + GPU parallelism
>>>>>>    that I
>>>>>>     >>    am aware of. You are limited that individual triangular solves 
>>>>>> be
>>>>>>     >>    done on a single GPU. I can only suggest making each subdomain 
>>>>>> as
>>>>>>     >>    big as possible to utilize each GPU as much as possible for the
>>>>>>     >>    direct triangular solves.
>>>>>>     >>     >>>     Barry
>>>>>>     >>     >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users
>>>>>>     >>    <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>
>>>>>>    <mailto:petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>>
>>>>>>    wrote:
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> Hi Mark,
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> '-mat_type aijcusparse' works with mpiaijcusparse with
>>>>>>    other
>>>>>>     >>    solvers, but with -pc_factor_mat_solver_type cusparse, it
>>>>>>    will give
>>>>>>     >>    an error.
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> Yes what I want is to have mumps or superlu to do the
>>>>>>     >>    factorization, and then do the rest, including GMRES solver,
>>>>>>    on gpu.
>>>>>>     >>    Is that possible?
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> I have tried to use aijcusparse with superlu_dist, it
>>>>>>    runs but
>>>>>>     >>    the iterative solver is still running on CPUs. I have
>>>>>>    contacted the
>>>>>>     >>    superlu group and they confirmed that is the case right now.
>>>>>>    But if
>>>>>>     >>    I set -pc_factor_mat_solver_type cusparse, it seems that the
>>>>>>     >>    iterative solver is running on GPU.
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> Chang
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> On 10/13/21 12:03 PM, Mark Adams wrote:
>>>>>>     >>     >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu
>>>>>>    <c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> wrote:
>>>>>>     >>     >>>>>     Thank you Junchao for explaining this. I guess in
>>>>>>    my case
>>>>>>     >>    the code is
>>>>>>     >>     >>>>>     just calling a seq solver like superlu to do
>>>>>>     >>    factorization on GPUs.
>>>>>>     >>     >>>>>     My idea is that I want to have a traditional MPI
>>>>>>    code to
>>>>>>     >>    utilize GPUs
>>>>>>     >>     >>>>>     with cusparse. Right now cusparse does not support
>>>>>>    mpiaij
>>>>>>     >>    matrix, Sure it does: '-mat_type aijcusparse' will give you an
>>>>>>     >>    mpiaijcusparse matrix with > 1 processes.
>>>>>>     >>     >>>>> (-mat_type mpiaijcusparse might also work with >1 proc).
>>>>>>     >>     >>>>> However, I see in grepping the repo that all the mumps 
>>>>>> and
>>>>>>     >>    superlu tests use aij or sell matrix type.
>>>>>>     >>     >>>>> MUMPS and SuperLU provide their own solves, I assume
>>>>>>    .... but
>>>>>>     >>    you might want to do other matrix operations on the GPU. Is
>>>>>>    that the
>>>>>>     >>    issue?
>>>>>>     >>     >>>>> Did you try -mat_type aijcusparse with MUMPS and/or
>>>>>>    SuperLU
>>>>>>     >>    have a problem? (no test with it so it probably does not work)
>>>>>>     >>     >>>>> Thanks,
>>>>>>     >>     >>>>> Mark
>>>>>>     >>     >>>>>     so I
>>>>>>     >>     >>>>>     want the code to have a mpiaij matrix when adding
>>>>>>    all the
>>>>>>     >>    matrix terms,
>>>>>>     >>     >>>>>     and then transform the matrix to seqaij when doing 
>>>>>> the
>>>>>>     >>    factorization
>>>>>>     >>     >>>>>     and
>>>>>>     >>     >>>>>     solve. This involves sending the data to the master
>>>>>>     >>    process, and I
>>>>>>     >>     >>>>>     think
>>>>>>     >>     >>>>>     the petsc mumps solver have something similar 
>>>>>> already.
>>>>>>     >>     >>>>>     Chang
>>>>>>     >>     >>>>>     On 10/13/21 10:18 AM, Junchao Zhang wrote:
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams
>>>>>>     >>    <mfad...@lbl.gov <mailto:mfad...@lbl.gov>
>>>>>>    <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>>
>>>>>>     >>     >>>>>     <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>
>>>>>>    <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>>>
>>>>>>     >>     >>>>>      > <mailto:mfad...@lbl.gov
>>>>>>    <mailto:mfad...@lbl.gov> <mailto:mfad...@lbl.gov
>>>>>>    <mailto:mfad...@lbl.gov>>
>>>>>>     >>    <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>
>>>>>>    <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>>>>> wrote:
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu
>>>>>>     >>    <c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>>
>>>>>>     >>     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>      >     <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>> wrote:
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         Hi Mark,
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         The option I use is like
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         -pc_type bjacobi -pc_bjacobi_blocks 16
>>>>>>     >>    -ksp_type fgmres
>>>>>>     >>     >>>>>     -mat_type
>>>>>>     >>     >>>>>      >         aijcusparse 
>>>>>> *-sub_pc_factor_mat_solver_type
>>>>>>     >>    cusparse
>>>>>>     >>     >>>>>     *-sub_ksp_type
>>>>>>     >>     >>>>>      >         preonly *-sub_pc_type lu* -ksp_max_it 
>>>>>> 2000
>>>>>>     >>    -ksp_rtol 1.e-300
>>>>>>     >>     >>>>>      >         -ksp_atol 1.e-300
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >     Note, If you use -log_view the last column
>>>>>>    (rows
>>>>>>     >>    are the
>>>>>>     >>     >>>>>     method like
>>>>>>     >>     >>>>>      >     MatFactorNumeric) has the percent of work
>>>>>>    in the GPU.
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >     Junchao: *This* implies that we have a
>>>>>>    cuSparse LU
>>>>>>     >>     >>>>>     factorization. Is
>>>>>>     >>     >>>>>      >     that correct? (I don't think we do)
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      > No, we don't have cuSparse LU factorization.     
>>>>>> If you check
>>>>>>     >>     >>>>>      > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will
>>>>>>    find it
>>>>>>     >>    calls
>>>>>>     >>     >>>>>      > MatLUFactorSymbolic_SeqAIJ() instead.
>>>>>>     >>     >>>>>      > So I don't understand Chang's idea. Do you want 
>>>>>> to
>>>>>>     >>    make bigger
>>>>>>     >>     >>>>>     blocks?
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         I think this one do both factorization 
>>>>>> and
>>>>>>     >>    solve on gpu.
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         You can check the
>>>>>>    runex72_aijcusparse.sh file
>>>>>>     >>    in petsc
>>>>>>     >>     >>>>>     install
>>>>>>     >>     >>>>>      >         directory, and try it your self (this
>>>>>>    is only lu
>>>>>>     >>     >>>>>     factorization
>>>>>>     >>     >>>>>      >         without
>>>>>>     >>     >>>>>      >         iterative solve).
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         Chang
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         On 10/12/21 1:17 PM, Mark Adams wrote:
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          > On Tue, Oct 12, 2021 at 11:19 AM
>>>>>>    Chang Liu
>>>>>>     >>     >>>>>     <c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>
>>>>>>     >>     >>>>>      >          > <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>>> wrote:
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     Hi Junchao,
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     No I only needs it to be 
>>>>>> transferred
>>>>>>     >>    within a
>>>>>>     >>     >>>>>     node. I use
>>>>>>     >>     >>>>>      >         block-Jacobi
>>>>>>     >>     >>>>>      >          >     method and GMRES to solve the 
>>>>>> sparse
>>>>>>     >>    matrix, so each
>>>>>>     >>     >>>>>      >         direct solver will
>>>>>>     >>     >>>>>      >          >     take care of a sub-block of the
>>>>>>    whole
>>>>>>     >>    matrix. In this
>>>>>>     >>     >>>>>      >         way, I can use
>>>>>>     >>     >>>>>      >          >     one
>>>>>>     >>     >>>>>      >          >     GPU to solve one sub-block, which 
>>>>>> is
>>>>>>     >>    stored within
>>>>>>     >>     >>>>>     one node.
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     It was stated in the
>>>>>>    documentation that
>>>>>>     >>    cusparse
>>>>>>     >>     >>>>>     solver
>>>>>>     >>     >>>>>      >         is slow.
>>>>>>     >>     >>>>>      >          >     However, in my test using
>>>>>>    ex72.c, the
>>>>>>     >>    cusparse
>>>>>>     >>     >>>>>     solver is
>>>>>>     >>     >>>>>      >         faster than
>>>>>>     >>     >>>>>      >          >     mumps or superlu_dist on CPUs.
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          > Are we talking about the
>>>>>>    factorization, the
>>>>>>     >>    solve, or
>>>>>>     >>     >>>>>     both?
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          > We do not have an interface to
>>>>>>    cuSparse's LU
>>>>>>     >>     >>>>>     factorization (I
>>>>>>     >>     >>>>>      >         just
>>>>>>     >>     >>>>>      >          > learned that it exists a few weeks 
>>>>>> ago).
>>>>>>     >>     >>>>>      >          > Perhaps your fast "cusparse solver" is
>>>>>>     >>    '-pc_type lu
>>>>>>     >>     >>>>>     -mat_type
>>>>>>     >>     >>>>>      >          > aijcusparse' ? This would be the CPU
>>>>>>     >>    factorization,
>>>>>>     >>     >>>>>     which is the
>>>>>>     >>     >>>>>      >          > dominant cost.
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     Chang
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     On 10/12/21 10:24 AM, Junchao
>>>>>>    Zhang wrote:
>>>>>>     >>     >>>>>      >          >      > Hi, Chang,
>>>>>>     >>     >>>>>      >          >      >     For the mumps solver, we
>>>>>>    usually
>>>>>>     >>    transfers
>>>>>>     >>     >>>>>     matrix
>>>>>>     >>     >>>>>      >         and vector
>>>>>>     >>     >>>>>      >          >     data
>>>>>>     >>     >>>>>      >          >      > within a compute node.  For
>>>>>>    the idea you
>>>>>>     >>     >>>>>     propose, it
>>>>>>     >>     >>>>>      >         looks like
>>>>>>     >>     >>>>>      >          >     we need
>>>>>>     >>     >>>>>      >          >      > to gather data within
>>>>>>     >>    MPI_COMM_WORLD, right?
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     Mark, I remember you said
>>>>>>     >>    cusparse solve is
>>>>>>     >>     >>>>>     slow
>>>>>>     >>     >>>>>      >         and you would
>>>>>>     >>     >>>>>      >          >      > rather do it on CPU. Is it 
>>>>>> right?
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      > --Junchao Zhang
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      > On Mon, Oct 11, 2021 at 10:25 
>>>>>> PM
>>>>>>     >>    Chang Liu via
>>>>>>     >>     >>>>>     petsc-users
>>>>>>     >>     >>>>>      >          >      > <petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov 
>>>>>> <mailto:petsc-users@mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov 
>>>>>> <mailto:petsc-users@mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>>>> <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov 
>>>>>> <mailto:petsc-users@mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov 
>>>>>> <mailto:petsc-users@mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>>>>>
>>>>>>     >>     >>>>>      >          >     <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov 
>>>>>> <mailto:petsc-users@mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov 
>>>>>> <mailto:petsc-users@mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>>>> <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov 
>>>>>> <mailto:petsc-users@mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov 
>>>>>> <mailto:petsc-users@mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users@mcs.anl.gov
>>>>>>    <mailto:petsc-users@mcs.anl.gov>>>>>>>
>>>>>>     >>     >>>>>      >          >     wrote:
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     Hi,
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     Currently, it is possible
>>>>>>    to use
>>>>>>     >>    mumps
>>>>>>     >>     >>>>>     solver in
>>>>>>     >>     >>>>>      >         PETSC with
>>>>>>     >>     >>>>>      >          >      >     -mat_mumps_use_omp_threads
>>>>>>     >>    option, so that
>>>>>>     >>     >>>>>      >         multiple MPI
>>>>>>     >>     >>>>>      >          >     processes will
>>>>>>     >>     >>>>>      >          >      >     transfer the matrix and
>>>>>>    rhs data
>>>>>>     >>    to the master
>>>>>>     >>     >>>>>      >         rank, and then
>>>>>>     >>     >>>>>      >          >     master
>>>>>>     >>     >>>>>      >          >      >     rank will call mumps with
>>>>>>    OpenMP
>>>>>>     >>    to solve
>>>>>>     >>     >>>>>     the matrix.
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     I wonder if someone can
>>>>>>    develop
>>>>>>     >>    similar
>>>>>>     >>     >>>>>     option for
>>>>>>     >>     >>>>>      >         cusparse
>>>>>>     >>     >>>>>      >          >     solver.
>>>>>>     >>     >>>>>      >          >      >     Right now, this solver
>>>>>>    does not
>>>>>>     >>    work with
>>>>>>     >>     >>>>>      >         mpiaijcusparse. I
>>>>>>     >>     >>>>>      >          >     think a
>>>>>>     >>     >>>>>      >          >      >     possible workaround is to
>>>>>>     >>    transfer all the
>>>>>>     >>     >>>>>     matrix
>>>>>>     >>     >>>>>      >         data to one MPI
>>>>>>     >>     >>>>>      >          >      >     process, and then upload 
>>>>>> the
>>>>>>     >>    data to GPU to
>>>>>>     >>     >>>>>     solve.
>>>>>>     >>     >>>>>      >         In this
>>>>>>     >>     >>>>>      >          >     way, one can
>>>>>>     >>     >>>>>      >          >      >     use cusparse solver for a 
>>>>>> MPI
>>>>>>     >>    program.
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     Chang
>>>>>>     >>     >>>>>      >          >      >     --
>>>>>>     >>     >>>>>      >          >      >     Chang Liu
>>>>>>     >>     >>>>>      >          >      >     Staff Research Physicist
>>>>>>     >>     >>>>>      >          >      >     +1 609 243 3438
>>>>>>     >>     >>>>>      >          >      > c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>
>>>>>>     >>     >>>>>      >         <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>>
>>>>>>     >>     >>>>>      >         <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>
>>>>>>     >>     >>>>>      >          >     <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>>>
>>>>>>     >>     >>>>>      >          >      >     Princeton Plasma Physics
>>>>>>    Laboratory
>>>>>>     >>     >>>>>      >          >      >     100 Stellarator Rd,
>>>>>>    Princeton NJ
>>>>>>     >>    08540, USA
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     --
>>>>>>     >>     >>>>>      >          >     Chang Liu
>>>>>>     >>     >>>>>      >          >     Staff Research Physicist
>>>>>>     >>     >>>>>      >          >     +1 609 243 3438
>>>>>>     >>     >>>>>      >          > c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>>
>>>>>>     >>     >>>>>      >          >     Princeton Plasma Physics 
>>>>>> Laboratory
>>>>>>     >>     >>>>>      >          >     100 Stellarator Rd, Princeton NJ
>>>>>>    08540, USA
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         --
>>>>>>     >>     >>>>>      >         Chang Liu
>>>>>>     >>     >>>>>      >         Staff Research Physicist
>>>>>>     >>     >>>>>      >         +1 609 243 3438
>>>>>>     >>     >>>>>      > c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>     >>>>>     <mailto:c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>
>>>>>>     >>     >>>>>      >         Princeton Plasma Physics Laboratory
>>>>>>     >>     >>>>>      >         100 Stellarator Rd, Princeton NJ 08540, 
>>>>>> USA
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>     --     Chang Liu
>>>>>>     >>     >>>>>     Staff Research Physicist
>>>>>>     >>     >>>>>     +1 609 243 3438
>>>>>>     >>     >>>>> c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>
>>>>>>     >>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>
>>>>>>     >>     >>>>>     Princeton Plasma Physics Laboratory
>>>>>>     >>     >>>>>     100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> --
>>>>>>     >>     >>>> Chang Liu
>>>>>>     >>     >>>> Staff Research Physicist
>>>>>>     >>     >>>> +1 609 243 3438
>>>>>>     >>     >>>> c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>     >>>> Princeton Plasma Physics Laboratory
>>>>>>     >>     >>>> 100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>     >>     >>
>>>>>>     >>     >> --
>>>>>>     >>     >> Chang Liu
>>>>>>     >>     >> Staff Research Physicist
>>>>>>     >>     >> +1 609 243 3438
>>>>>>     >>     >> c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>    <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
>>>>>>     >>     >> Princeton Plasma Physics Laboratory
>>>>>>     >>     >> 100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>     >>     >
>>>>>>     >>    --     Chang Liu
>>>>>>     >>    Staff Research Physicist
>>>>>>     >>    +1 609 243 3438
>>>>>>     >> c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov
>>>>>>    <mailto:c...@pppl.gov>>
>>>>>>     >>    Princeton Plasma Physics Laboratory
>>>>>>     >>    100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>     >
>>>>>>     > --
>>>>>>     > Chang Liu
>>>>>>     > Staff Research Physicist
>>>>>>     > +1 609 243 3438
>>>>>>     > c...@pppl.gov <mailto:c...@pppl.gov>
>>>>>>     > Princeton Plasma Physics Laboratory
>>>>>>     > 100 Stellarator Rd, Princeton NJ 08540, USA
>>>>> 
>>>>> -- 
>>>>> Chang Liu
>>>>> Staff Research Physicist
>>>>> +1 609 243 3438
>>>>> c...@pppl.gov
>>>>> Princeton Plasma Physics Laboratory
>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA
>>> 
>>> -- 
>>> Chang Liu
>>> Staff Research Physicist
>>> +1 609 243 3438
>>> c...@pppl.gov
>>> Princeton Plasma Physics Laboratory
>>> 100 Stellarator Rd, Princeton NJ 08540, USA
> 
> -- 
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> c...@pppl.gov
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA

Reply via email to