On Thu, Dec 21, 2023 at 5:54 AM Matthew Knepley <[email protected]> wrote:
> On Thu, Dec 21, 2023 at 6:46 AM Sreeram R Venkat <[email protected]> > wrote: > >> Ok, I think the error I'm getting has something to do with how the >> multiple solves are being done in succession. I'll try to see if there's >> anything I'm doing wrong there. >> >> One question about the -pc_type lu -ksp_type preonly method: do you know >> which parts of the solve (factorization/triangular solves) are done on host >> and which are done on device? >> > > For SEQDENSE, I believe both the factorization and solve is on device. It > is hard to see, but I believe the dispatch code is here: > Yes, it is correct. > > > https://gitlab.com/petsc/petsc/-/blob/main/src/mat/impls/dense/seq/cupm/matseqdensecupm.hpp?ref_type=heads#L368 > > Thanks, > > Matt > > >> Thanks, >> Sreeram >> >> On Sat, Dec 16, 2023 at 10:56 PM Pierre Jolivet <[email protected]> wrote: >> >>> Unfortunately, I am not able to reproduce such a failure with your input >>> matrix. >>> I’ve used ex79 that I linked previously and the system is properly >>> solved. >>> $ ./ex79 -pc_type hypre -ksp_type hpddm -ksp_hpddm_type cg >>> -ksp_converged_reason -ksp_view_mat ascii::ascii_info -ksp_view_rhs >>> ascii::ascii_info >>> Linear solve converged due to CONVERGED_RTOL iterations 6 >>> Mat Object: 1 MPI process >>> type: seqaijcusparse >>> rows=289, cols=289 >>> total: nonzeros=2401, allocated nonzeros=2401 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node routines >>> Mat Object: 1 MPI process >>> type: seqdensecuda >>> rows=289, cols=10 >>> total: nonzeros=2890, allocated nonzeros=2890 >>> total number of mallocs used during MatSetValues calls=0 >>> >>> You mentioned in a subsequent email that you are interested in systems >>> with at most 1E6 unknowns, and up to 1E4 right-hand sides. >>> I’m not sure you can expect significant gains from using GPU for such >>> systems. >>> Probably, the fastest approach would indeed be -pc_type lu -ksp_type >>> preonly -ksp_matsolve_batch_size 100 or something, depending on the memory >>> available on your host. >>> >>> Thanks, >>> Pierre >>> >>> On 15 Dec 2023, at 9:52 PM, Sreeram R Venkat <[email protected]> >>> wrote: >>> >>> Here are the ksp_view files. I set the options >>> -ksp_error_if_not_converged to try to get the vectors that caused the >>> error. I noticed that some of the KSPMatSolves converge while others don't. >>> In the code, the solves are called as: >>> >>> input vector v --> insert data of v into a dense mat --> KSPMatSolve() >>> --> MatMatMult() --> KSPMatSolve() --> insert data of dense mat into output >>> vector w -- output w >>> >>> The operator used in the KSP is a Laplacian-like operator, and the >>> MatMatMult is with a Mass Matrix. The whole thing is supposed to be a solve >>> with a biharmonic-like operator. I can also run it with only the first >>> KSPMatSolve (i.e. just a Laplacian-like operator). In that case, the KSP >>> reportedly converges after 0 iterations (see the next line), but this >>> causes problems in other parts of the code later on. >>> >>> I saw that sometimes the first KSPMatSolve "converges" after 0 >>> iterations due to CONVERGED_RTOL. Then, the second KSPMatSolve produces a >>> NaN/Inf. I tried setting ksp_min_it, but that didn't seem to do anything. >>> >>> I'll keep trying different options and also try to get the MWE made >>> (this KSPMatSolve is pretty performance critical for us). >>> >>> Thanks for all your help, >>> Sreeram >>> >>> On Fri, Dec 15, 2023 at 1:01 AM Pierre Jolivet <[email protected]> wrote: >>> >>>> >>>> On 14 Dec 2023, at 11:45 PM, Sreeram R Venkat <[email protected]> >>>> wrote: >>>> >>>> Thanks, I will try to create a minimal reproducible example. This may >>>> take me some time though, as I need to figure out how to extract only the >>>> relevant parts (the full program this solve is used in is getting quite >>>> complex). >>>> >>>> >>>> You could just do -ksp_view_mat binary:Amat.bin -ksp_view_pmat >>>> binary:Pmat.bin -ksp_view_rhs binary:rhs.bin and send me those three files >>>> (I’m guessing your are using double-precision scalars with 32-bit >>>> PetscInt). >>>> >>>> I'll also try out some of the BoomerAMG options to see if that helps. >>>> >>>> >>>> These should work (this is where all “PCMatApply()-ready” PC are being >>>> tested): >>>> https://petsc.org/release/src/ksp/ksp/tutorials/ex79.c.html#line215 >>>> You can see it’s also testing PCHYPRE + KSPHPDDM on device (but not >>>> with HIP). >>>> I’m aware the performance should not be optimal (see your comment about >>>> host/device copies), I’ve money to hire someone to work on this but: a) I >>>> need to find the correct engineer/post-doc, b) I currently don’t have good >>>> use cases (of course, I could generate a synthetic benchmark, for science). >>>> So even if you send me the three Mat, a MWE would be appreciated if the >>>> KSPMatSolve() is performance-critical for you (see point b) from above). >>>> >>>> Thanks, >>>> Pierre >>>> >>>> Thanks, >>>> Sreeram >>>> >>>> On Thu, Dec 14, 2023, 1:12 PM Pierre Jolivet <[email protected]> wrote: >>>> >>>>> >>>>> >>>>> On 14 Dec 2023, at 8:02 PM, Sreeram R Venkat <[email protected]> >>>>> wrote: >>>>> >>>>> Hello Pierre, >>>>> >>>>> Thank you for your reply. I tried out the HPDDM CG as you said, and it >>>>> seems to be doing the batched solves, but the KSP is not converging due to >>>>> a NaN or Inf being generated. I also noticed there are a lot of >>>>> host-to-device and device-to-host copies of the matrices (the non-batched >>>>> KSP solve did not have any memcopies). I have attached dump.0 again. Could >>>>> you please take a look? >>>>> >>>>> >>>>> Yes, but you’d need to send me something I can run with your set of >>>>> options (if you are more confident doing this in private, you can remove >>>>> the list from c/c). >>>>> Not all BoomerAMG smoothers handle blocks of right-hand sides, and >>>>> there is not much error checking, so instead of erroring out, this may be >>>>> the reason why you are getting garbage. >>>>> >>>>> Thanks, >>>>> Pierre >>>>> >>>>> Thanks, >>>>> Sreeram >>>>> >>>>> On Thu, Dec 14, 2023 at 12:42 AM Pierre Jolivet <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello Sreeram, >>>>>> KSPCG (PETSc implementation of CG) does not handle solves with >>>>>> multiple columns at once. >>>>>> There is only a single native PETSc KSP implementation which handles >>>>>> solves with multiple columns at once: KSPPREONLY. >>>>>> If you use --download-hpddm, you can use a CG (or GMRES, or more >>>>>> advanced methods) implementation which handles solves with multiple >>>>>> columns >>>>>> at once (via -ksp_type hpddm -ksp_hpddm_type cg or KSPSetType(ksp, >>>>>> KSPHPDDM); KSPHPDDMSetType(ksp, KSP_HPDDM_TYPE_CG);). >>>>>> I’m the main author of HPDDM, there is preliminary support for device >>>>>> matrices, but if it’s not working as intended/not faster than column by >>>>>> column, I’d be happy to have a deeper look (maybe in private), because >>>>>> most >>>>>> (if not all) of my users interested in (pseudo-)block Krylov solvers >>>>>> (i.e., >>>>>> solvers that treat right-hand sides in a single go) are using plain host >>>>>> matrices. >>>>>> >>>>>> Thanks, >>>>>> Pierre >>>>>> >>>>>> PS: you could have a look at >>>>>> https://www.sciencedirect.com/science/article/abs/pii/S0898122121000055 >>>>>> to >>>>>> understand the philosophy behind block iterative methods in PETSc (and in >>>>>> HPDDM), src/mat/tests/ex237.c, the benchmark I mentioned earlier, was >>>>>> developed in the context of this paper to produce Figures 2-3. Note that >>>>>> this paper is now slightly outdated, since then, PCHYPRE and PCMG (among >>>>>> others) have been made “PCMatApply()-ready”. >>>>>> >>>>>> On 13 Dec 2023, at 11:05 PM, Sreeram R Venkat <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Hello Pierre, >>>>>> >>>>>> I am trying out the KSPMatSolve with the BoomerAMG preconditioner. >>>>>> However, I am noticing that it is still solving column by column (this is >>>>>> stated explicitly in the info dump attached). I looked at the code for >>>>>> KSPMatSolve_Private() and saw that as long as ksp->ops->matsolve is >>>>>> true, it should do the batched solve, though I'm not sure where that gets >>>>>> set. >>>>>> >>>>>> I am using the options -pc_type hypre -pc_hypre_type boomeramg when >>>>>> running the code. >>>>>> >>>>>> Can you please help me with this? >>>>>> >>>>>> Thanks, >>>>>> Sreeram >>>>>> >>>>>> >>>>>> On Thu, Dec 7, 2023 at 4:04 PM Mark Adams <[email protected]> wrote: >>>>>> >>>>>>> N.B., AMGX interface is a bit experimental. >>>>>>> Mark >>>>>>> >>>>>>> On Thu, Dec 7, 2023 at 4:11 PM Sreeram R Venkat <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Oh, in that case I will try out BoomerAMG. Getting AMGX to build >>>>>>>> correctly was also tricky so hopefully the HYPRE build will be easier. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Sreeram >>>>>>>> >>>>>>>> On Thu, Dec 7, 2023, 3:03 PM Pierre Jolivet <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7 Dec 2023, at 9:37 PM, Sreeram R Venkat <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Thank you Barry and Pierre; I will proceed with the first option. >>>>>>>>> >>>>>>>>> I want to use the AMGX preconditioner for the KSP. I will try it >>>>>>>>> out and see how it performs. >>>>>>>>> >>>>>>>>> >>>>>>>>> Just FYI, AMGX does not handle systems with multiple RHS, and thus >>>>>>>>> has no PCMatApply() implementation. >>>>>>>>> BoomerAMG does, and there is a PCMatApply_HYPRE_BoomerAMG() >>>>>>>>> implementation. >>>>>>>>> But let us know if you need assistance figuring things out. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Pierre >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Sreeram >>>>>>>>> >>>>>>>>> On Thu, Dec 7, 2023 at 2:02 PM Pierre Jolivet <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> To expand on Barry’s answer, we have observed repeatedly that >>>>>>>>>> MatMatMult with MatAIJ performs better than MatMult with MatMAIJ, >>>>>>>>>> you can >>>>>>>>>> reproduce this on your own with >>>>>>>>>> https://petsc.org/release/src/mat/tests/ex237.c.html. >>>>>>>>>> Also, I’m guessing you are using some sort of preconditioner >>>>>>>>>> within your KSP. >>>>>>>>>> Not all are “KSPMatSolve-ready”, i.e., they may treat blocks of >>>>>>>>>> right-hand sides column by column, which is very inefficient. >>>>>>>>>> You could run your code with -info dump and send us dump.0 to see >>>>>>>>>> what needs to be done on our end to make things more efficient, >>>>>>>>>> should you >>>>>>>>>> not be satisfied with the current performance of the code. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Pierre >>>>>>>>>> >>>>>>>>>> On 7 Dec 2023, at 8:34 PM, Barry Smith <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Dec 7, 2023, at 1:17 PM, Sreeram R Venkat <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I have 2 sequential matrices M and R (both MATSEQAIJCUSPARSE of >>>>>>>>>> size n x n) and a vector v of size n*m. v = [v_1 , v_2 ,... , v_m] >>>>>>>>>> where >>>>>>>>>> v_i has size n. The data for v can be stored either in column-major >>>>>>>>>> or >>>>>>>>>> row-major order. Now, I want to do 2 types of operations: >>>>>>>>>> >>>>>>>>>> 1. Matvecs of the form M*v_i = w_i, for i = 1..m. >>>>>>>>>> 2. KSPSolves of the form R*x_i = v_i, for i = 1..m. >>>>>>>>>> >>>>>>>>>> From what I have read on the documentation, I can think of 2 >>>>>>>>>> approaches. >>>>>>>>>> >>>>>>>>>> 1. Get the pointer to the data in v (column-major) and use it to >>>>>>>>>> create a dense matrix V. Then do a MatMatMult with M*V = W, and take >>>>>>>>>> the >>>>>>>>>> data pointer of W to create the vector w. For KSPSolves, use >>>>>>>>>> KSPMatSolve >>>>>>>>>> with R and V. >>>>>>>>>> >>>>>>>>>> 2. Create a MATMAIJ using M/R and use that for matvecs directly >>>>>>>>>> with the vector v. I don't know if KSPSolve with the MATMAIJ will >>>>>>>>>> know that >>>>>>>>>> it is a multiple RHS system and act accordingly. >>>>>>>>>> >>>>>>>>>> Which would be the more efficient option? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Use 1. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> As a side-note, I am also wondering if there is a way to use >>>>>>>>>> row-major storage of the vector v. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> No >>>>>>>>>> >>>>>>>>>> The reason is that this could allow for more coalesced memory >>>>>>>>>> access when doing matvecs. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> PETSc matrix-vector products use BLAS GMEV matrix-vector >>>>>>>>>> products for the computation so in theory they should already be >>>>>>>>>> well-optimized >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Sreeram >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> <dump.0> >>>>>> >>>>>> >>>>>> <dump.0> >>>>> >>>>> >>>>> >>>> <Pmat.bin><Amat.bin><rhs.bin> >>> >>> >>> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> >
