You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu
> On Oct 14, 2021, at 4:14 PM, Chang Liu <c...@pppl.gov> wrote: > > Hi Pierre, > > I wonder if the trick of PCTELESCOPE only works for preconditioner and not > for the solver. I have done some tests, and find that for solving a small > matrix using -telescope_ksp_type preonly, it does work for GPU with multiple > MPI processes. However, for bjacobi and gmres, it does not work. > > The command line options I used for small matrix is like > > mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type > telescope -mat_type aijcusparse -telescope_pc_type lu > -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly > -pc_telescope_reduction_factor 4 > > which gives the correct output. For iterative solver, I tried > > mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type > bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse > -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly > -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse > -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 > -ksp_atol 1.e-20 > > for large matrix. The output is like > > 0 KSP Residual norm 40.1497 > 1 KSP Residual norm < 1.e-11 > Norm of error 400.999 iterations 1 > > So it seems to call a direct solver instead of an iterative one. > > Can you please help check these options? > > Chang > > On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>> On 14 Oct 2021, at 3:50 PM, Chang Liu <c...@pppl.gov> wrote: >>> >>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds >>> exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse >>> to seqaircusparse? Or I have to do it manually? >> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >> 1) I’m not sure this is implemented for cuSparse matrices, but it should be; >> 2) at least for the implementations >> MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and >> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ >> (resp. MATAIJ). Constructors are usually “smart” enough to detect if the MPI >> communicator on which the Mat lives is of size 1 (your case), and then the >> resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need >> to worry about the transformation you are mentioning. >> If you try this out and this does not work, please provide the backtrace >> (probably something like “Operation XYZ not implemented for MatType ABC”), >> and hopefully someone can add the missing plumbing. >> I do not claim that this will be efficient, but I think this goes in the >> direction of what you want to achieve. >> Thanks, >> Pierre >>> Chang >>> >>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>> Maybe I’m missing something, but can’t you use PCTELESCOPE as a subdomain >>>> solver, with a reduction factor equal to the number of MPI processes you >>>> have per block? >>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X >>>> -sub_telescope_pc_type lu >>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only >>>> do the Mat needs to be redistributed, the secondary processes also need to >>>> be “converted” to OpenMP threads. >>>> Thus the need for specific code in mumps.c. >>>> Thanks, >>>> Pierre >>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users >>>>> <petsc-users@mcs.anl.gov> wrote: >>>>> >>>>> Hi Junchao, >>>>> >>>>> Yes that is what I want. >>>>> >>>>> Chang >>>>> >>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith <bsm...@petsc.dev >>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>> Junchao, >>>>>> If I understand correctly Chang is using the block Jacobi >>>>>> method with a single block for a number of MPI ranks and a direct >>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>> particular problems this preconditioner works well, but using an >>>>>> iterative solver on the blocks does not work well. >>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>> not he would like to use a single GPU for each block, this means >>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>> GPUs. But this may be a large coding project. >>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, >>>>>> we just need to move each block (submatrix) living over multiple MPI >>>>>> ranks to one of them and solve directly there. In other words, we keep >>>>>> blocks' size, no shrinking or expanding. >>>>>> As mentioned before, cusparse does not provide LU factorization. So the >>>>>> LU factorization would be done on CPU, and the solve be done on GPU. I >>>>>> assume Chang wants to gain from the (potential) faster solve (instead of >>>>>> factorization) on GPU. >>>>>> Barry >>>>>> Since the matrices being factored and solved directly are relatively >>>>>> large it is possible that the cusparse code could be reasonably >>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>> actually know how much better the cusparse code would be on the >>>>>> direct solver than a good CPU direct sparse solver. >>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu <c...@pppl.gov >>>>>> <mailto:c...@pppl.gov>> wrote: >>>>>> > >>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>> > >>>>>> > Chang >>>>>> > >>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>> >> Hi Chang, >>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>> gathering matrix rows to one process. >>>>>> >> But how to gather blocks (submatrices) to form a large block? >>>>>> Can you draw a picture of that? >>>>>> >> Thanks >>>>>> >> --Junchao Zhang >>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> >>>>>> <mailto:petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>> >>>>>> wrote: >>>>>> >> Hi Barry, >>>>>> >> I think mumps solver in petsc does support that. You can >>>>>> check the >>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>> >> >>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>> <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html> >>>>>> >> >>>>>> <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>> <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>> >>>>>> >> and the code enclosed by #if >>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>> >> mumps.c >>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>> However, I am >>>>>> >> working on an existing code that was developed based on MPI >>>>>> and the the >>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>> want to >>>>>> >> change the whole structure of the code. >>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>> function >>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>> >> Regards, >>>>>> >> Chang >>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>> >> > >>>>>> >> > >>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu <c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> wrote: >>>>>> >> >> >>>>>> >> >> Hi Barry, >>>>>> >> >> >>>>>> >> >> That is exactly what I want. >>>>>> >> >> >>>>>> >> >> Back to my original question, I am looking for an approach >>>>>> to >>>>>> >> transfer >>>>>> >> >> matrix >>>>>> >> >> data from many MPI processes to "master" MPI >>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>> upload >>>>>> >> the data to GPU to >>>>>> >> >> solve. >>>>>> >> >> One can just grab some codes from mumps.c to >>>>>> aijcusparse.cu <http://aijcusparse.cu> >>>>>> >> <http://aijcusparse.cu <http://aijcusparse.cu>>. >>>>>> >> > >>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>> copy the >>>>>> >> entire matrix to a single MPI rank. >>>>>> >> > >>>>>> >> > It would be possible to write such a code that you >>>>>> suggest but >>>>>> >> it is not clear that it makes sense >>>>>> >> > >>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>> rank, so >>>>>> >> while your one GPU per big domain is solving its systems the >>>>>> other >>>>>> >> GPUs (with the other MPI ranks that share that domain) are >>>>>> doing >>>>>> >> nothing. >>>>>> >> > >>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>> right >>>>>> >> hand side from the multiple ranks to the single GPU to pass it >>>>>> to >>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>> to all >>>>>> >> of its subdomain ranks. >>>>>> >> > >>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>> use the >>>>>> >> GPU solver directly. If all the major computations of a >>>>>> subdomain >>>>>> >> can fit and be done on a single GPU then you would be >>>>>> utilizing all >>>>>> >> the GPUs you are using effectively. >>>>>> >> > >>>>>> >> > Barry >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> >> >>>>>> >> >> Chang >>>>>> >> >> >>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>> >> >>> Chang, >>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>> solvers that >>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>> that I >>>>>> >> am aware of. You are limited that individual triangular solves >>>>>> be >>>>>> >> done on a single GPU. I can only suggest making each subdomain >>>>>> as >>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>> >> direct triangular solves. >>>>>> >> >>> Barry >>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>> >> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> >>>>>> <mailto:petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>>> >>>>>> wrote: >>>>>> >> >>>> >>>>>> >> >>>> Hi Mark, >>>>>> >> >>>> >>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>> other >>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>> will give >>>>>> >> an error. >>>>>> >> >>>> >>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>> on gpu. >>>>>> >> Is that possible? >>>>>> >> >>>> >>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>> runs but >>>>>> >> the iterative solver is still running on CPUs. I have >>>>>> contacted the >>>>>> >> superlu group and they confirmed that is the case right now. >>>>>> But if >>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>> >> iterative solver is running on GPU. >>>>>> >> >>>> >>>>>> >> >>>> Chang >>>>>> >> >>>> >>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>> <c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> wrote: >>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>> my case >>>>>> >> the code is >>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>> >> factorization on GPUs. >>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>> code to >>>>>> >> utilize GPUs >>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>> mpiaij >>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>> >> >>>>> However, I see in grepping the repo that all the mumps >>>>>> and >>>>>> >> superlu tests use aij or sell matrix type. >>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>> .... but >>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>> that the >>>>>> >> issue? >>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>> SuperLU >>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>> >> >>>>> Thanks, >>>>>> >> >>>>> Mark >>>>>> >> >>>>> so I >>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>> all the >>>>>> >> matrix terms, >>>>>> >> >>>>> and then transform the matrix to seqaij when doing >>>>>> the >>>>>> >> factorization >>>>>> >> >>>>> and >>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>> >> process, and I >>>>>> >> >>>>> think >>>>>> >> >>>>> the petsc mumps solver have something similar >>>>>> already. >>>>>> >> >>>>> Chang >>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>> >> <mfad...@lbl.gov <mailto:mfad...@lbl.gov> >>>>>> <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>> >>>>>> >> >>>>> <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov> >>>>>> <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>>> >>>>>> >> >>>>> > <mailto:mfad...@lbl.gov >>>>>> <mailto:mfad...@lbl.gov> <mailto:mfad...@lbl.gov >>>>>> <mailto:mfad...@lbl.gov>> >>>>>> >> <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov> >>>>>> <mailto:mfad...@lbl.gov <mailto:mfad...@lbl.gov>>>>> wrote: >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>> >> <c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov>> >>>>>> >> >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> > <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>> wrote: >>>>>> >> >>>>> > >>>>>> >> >>>>> > Hi Mark, >>>>>> >> >>>>> > >>>>>> >> >>>>> > The option I use is like >>>>>> >> >>>>> > >>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>> >> -ksp_type fgmres >>>>>> >> >>>>> -mat_type >>>>>> >> >>>>> > aijcusparse >>>>>> *-sub_pc_factor_mat_solver_type >>>>>> >> cusparse >>>>>> >> >>>>> *-sub_ksp_type >>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it >>>>>> 2000 >>>>>> >> -ksp_rtol 1.e-300 >>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>> (rows >>>>>> >> are the >>>>>> >> >>>>> method like >>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>> in the GPU. >>>>>> >> >>>>> > >>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>> cuSparse LU >>>>>> >> >>>>> factorization. Is >>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>> >> >>>>> > >>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. >>>>>> If you check >>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>> find it >>>>>> >> calls >>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want >>>>>> to >>>>>> >> make bigger >>>>>> >> >>>>> blocks? >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > I think this one do both factorization >>>>>> and >>>>>> >> solve on gpu. >>>>>> >> >>>>> > >>>>>> >> >>>>> > You can check the >>>>>> runex72_aijcusparse.sh file >>>>>> >> in petsc >>>>>> >> >>>>> install >>>>>> >> >>>>> > directory, and try it your self (this >>>>>> is only lu >>>>>> >> >>>>> factorization >>>>>> >> >>>>> > without >>>>>> >> >>>>> > iterative solve). >>>>>> >> >>>>> > >>>>>> >> >>>>> > Chang >>>>>> >> >>>>> > >>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>> Chang Liu >>>>>> >> >>>>> <c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> > <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> >>>>>> >> >>>>> > > <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>>> wrote: >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > Hi Junchao, >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > No I only needs it to be >>>>>> transferred >>>>>> >> within a >>>>>> >> >>>>> node. I use >>>>>> >> >>>>> > block-Jacobi >>>>>> >> >>>>> > > method and GMRES to solve the >>>>>> sparse >>>>>> >> matrix, so each >>>>>> >> >>>>> > direct solver will >>>>>> >> >>>>> > > take care of a sub-block of the >>>>>> whole >>>>>> >> matrix. In this >>>>>> >> >>>>> > way, I can use >>>>>> >> >>>>> > > one >>>>>> >> >>>>> > > GPU to solve one sub-block, which >>>>>> is >>>>>> >> stored within >>>>>> >> >>>>> one node. >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > It was stated in the >>>>>> documentation that >>>>>> >> cusparse >>>>>> >> >>>>> solver >>>>>> >> >>>>> > is slow. >>>>>> >> >>>>> > > However, in my test using >>>>>> ex72.c, the >>>>>> >> cusparse >>>>>> >> >>>>> solver is >>>>>> >> >>>>> > faster than >>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > Are we talking about the >>>>>> factorization, the >>>>>> >> solve, or >>>>>> >> >>>>> both? >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > We do not have an interface to >>>>>> cuSparse's LU >>>>>> >> >>>>> factorization (I >>>>>> >> >>>>> > just >>>>>> >> >>>>> > > learned that it exists a few weeks >>>>>> ago). >>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>> >> '-pc_type lu >>>>>> >> >>>>> -mat_type >>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>> >> factorization, >>>>>> >> >>>>> which is the >>>>>> >> >>>>> > > dominant cost. >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > Chang >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>> Zhang wrote: >>>>>> >> >>>>> > > > Hi, Chang, >>>>>> >> >>>>> > > > For the mumps solver, we >>>>>> usually >>>>>> >> transfers >>>>>> >> >>>>> matrix >>>>>> >> >>>>> > and vector >>>>>> >> >>>>> > > data >>>>>> >> >>>>> > > > within a compute node. For >>>>>> the idea you >>>>>> >> >>>>> propose, it >>>>>> >> >>>>> > looks like >>>>>> >> >>>>> > > we need >>>>>> >> >>>>> > > > to gather data within >>>>>> >> MPI_COMM_WORLD, right? >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > Mark, I remember you said >>>>>> >> cusparse solve is >>>>>> >> >>>>> slow >>>>>> >> >>>>> > and you would >>>>>> >> >>>>> > > > rather do it on CPU. Is it >>>>>> right? >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > --Junchao Zhang >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 >>>>>> PM >>>>>> >> Chang Liu via >>>>>> >> >>>>> petsc-users >>>>>> >> >>>>> > > > <petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>> >>>>>> >> >>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>>> >>>>>> >> >>>>> > <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>> >>>>>> >> >>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>> >>>>>> >> >>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>>> >>>>>> >> >>>>> > <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>> >>>>>> >> >>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>>>>> >>>>>> >> >>>>> > > <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>> >>>>>> >> >>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>>> >>>>>> >> >>>>> > <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>> >>>>>> >> >>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>> >>>>>> >> >>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>>> >>>>>> >> >>>>> > <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>> >>>>>> >> >>>>> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov> >>>>>> >> <mailto:petsc-users@mcs.anl.gov >>>>>> <mailto:petsc-users@mcs.anl.gov>>>>>>> >>>>>> >> >>>>> > > wrote: >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > Hi, >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > Currently, it is possible >>>>>> to use >>>>>> >> mumps >>>>>> >> >>>>> solver in >>>>>> >> >>>>> > PETSC with >>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>> >> option, so that >>>>>> >> >>>>> > multiple MPI >>>>>> >> >>>>> > > processes will >>>>>> >> >>>>> > > > transfer the matrix and >>>>>> rhs data >>>>>> >> to the master >>>>>> >> >>>>> > rank, and then >>>>>> >> >>>>> > > master >>>>>> >> >>>>> > > > rank will call mumps with >>>>>> OpenMP >>>>>> >> to solve >>>>>> >> >>>>> the matrix. >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > I wonder if someone can >>>>>> develop >>>>>> >> similar >>>>>> >> >>>>> option for >>>>>> >> >>>>> > cusparse >>>>>> >> >>>>> > > solver. >>>>>> >> >>>>> > > > Right now, this solver >>>>>> does not >>>>>> >> work with >>>>>> >> >>>>> > mpiaijcusparse. I >>>>>> >> >>>>> > > think a >>>>>> >> >>>>> > > > possible workaround is to >>>>>> >> transfer all the >>>>>> >> >>>>> matrix >>>>>> >> >>>>> > data to one MPI >>>>>> >> >>>>> > > > process, and then upload >>>>>> the >>>>>> >> data to GPU to >>>>>> >> >>>>> solve. >>>>>> >> >>>>> > In this >>>>>> >> >>>>> > > way, one can >>>>>> >> >>>>> > > > use cusparse solver for a >>>>>> MPI >>>>>> >> program. >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > Chang >>>>>> >> >>>>> > > > -- >>>>>> >> >>>>> > > > Chang Liu >>>>>> >> >>>>> > > > Staff Research Physicist >>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>> >> >>>>> > > > c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> >>>>>> >> >>>>> > <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>> >>>>>> >> >>>>> > <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> >>>>>> >> >>>>> > > <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>>> >>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>> Laboratory >>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>> Princeton NJ >>>>>> >> 08540, USA >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > -- >>>>>> >> >>>>> > > Chang Liu >>>>>> >> >>>>> > > Staff Research Physicist >>>>>> >> >>>>> > > +1 609 243 3438 >>>>>> >> >>>>> > > c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> > <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>>> >>>>>> >> >>>>> > > Princeton Plasma Physics >>>>>> Laboratory >>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>> 08540, USA >>>>>> >> >>>>> > > >>>>>> >> >>>>> > >>>>>> >> >>>>> > -- >>>>>> >> >>>>> > Chang Liu >>>>>> >> >>>>> > Staff Research Physicist >>>>>> >> >>>>> > +1 609 243 3438 >>>>>> >> >>>>> > c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> >>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>>> >>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, >>>>>> USA >>>>>> >> >>>>> > >>>>>> >> >>>>> -- Chang Liu >>>>>> >> >>>>> Staff Research Physicist >>>>>> >> >>>>> +1 609 243 3438 >>>>>> >> >>>>> c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov> >>>>>> >> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>>> >>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >> >>>> >>>>>> >> >>>> -- >>>>>> >> >>>> Chang Liu >>>>>> >> >>>> Staff Research Physicist >>>>>> >> >>>> +1 609 243 3438 >>>>>> >> >>>> c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >> >> >>>>>> >> >> -- >>>>>> >> >> Chang Liu >>>>>> >> >> Staff Research Physicist >>>>>> >> >> +1 609 243 3438 >>>>>> >> >> c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> <mailto:c...@pppl.gov <mailto:c...@pppl.gov>> >>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >> > >>>>>> >> -- Chang Liu >>>>>> >> Staff Research Physicist >>>>>> >> +1 609 243 3438 >>>>>> >> c...@pppl.gov <mailto:c...@pppl.gov> <mailto:c...@pppl.gov >>>>>> <mailto:c...@pppl.gov>> >>>>>> >> Princeton Plasma Physics Laboratory >>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> > >>>>>> > -- >>>>>> > Chang Liu >>>>>> > Staff Research Physicist >>>>>> > +1 609 243 3438 >>>>>> > c...@pppl.gov <mailto:c...@pppl.gov> >>>>>> > Princeton Plasma Physics Laboratory >>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >>>>> -- >>>>> Chang Liu >>>>> Staff Research Physicist >>>>> +1 609 243 3438 >>>>> c...@pppl.gov >>>>> Princeton Plasma Physics Laboratory >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>> -- >>> Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> c...@pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > c...@pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA