> On 20 Dec 2023, at 8:42 AM, Sreeram R Venkat <[email protected]> wrote:
> 
> Ok, I think the error I'm getting has something to do with how the multiple 
> solves are being done in succession. I'll try to see if there's anything I'm 
> doing wrong there. 
> 
> One question about the -pc_type lu -ksp_type preonly method: do you know 
> which parts of the solve (factorization/triangular solves) are done on host 
> and which are done on device?

I think only the triangular solves can be done on device.
Since you have many right-hand sides, it may not be that bad.
GPU people will hopefully give you a more insightful answer.

Thanks,
Pierre

> Thanks,
> Sreeram
> 
> On Sat, Dec 16, 2023 at 10:56 PM Pierre Jolivet <[email protected] 
> <mailto:[email protected]>> wrote:
>> Unfortunately, I am not able to reproduce such a failure with your input 
>> matrix.
>> I’ve used ex79 that I linked previously and the system is properly solved.
>> $ ./ex79 -pc_type hypre -ksp_type hpddm -ksp_hpddm_type cg 
>> -ksp_converged_reason -ksp_view_mat ascii::ascii_info -ksp_view_rhs 
>> ascii::ascii_info
>> Linear solve converged due to CONVERGED_RTOL iterations 6
>> Mat Object: 1 MPI process
>>   type: seqaijcusparse
>>   rows=289, cols=289
>>   total: nonzeros=2401, allocated nonzeros=2401
>>   total number of mallocs used during MatSetValues calls=0
>>     not using I-node routines
>> Mat Object: 1 MPI process
>>   type: seqdensecuda
>>   rows=289, cols=10
>>   total: nonzeros=2890, allocated nonzeros=2890
>>   total number of mallocs used during MatSetValues calls=0
>> 
>> You mentioned in a subsequent email that you are interested in systems with 
>> at most 1E6 unknowns, and up to 1E4 right-hand sides.
>> I’m not sure you can expect significant gains from using GPU for such 
>> systems.
>> Probably, the fastest approach would indeed be -pc_type lu -ksp_type preonly 
>> -ksp_matsolve_batch_size 100 or something, depending on the memory available 
>> on your host.
>> 
>> Thanks,
>> Pierre
>> 
>>> On 15 Dec 2023, at 9:52 PM, Sreeram R Venkat <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Here are the ksp_view files.  I set the options -ksp_error_if_not_converged 
>>> to try to get the vectors that caused the error. I noticed that some of the 
>>> KSPMatSolves converge while others don't. In the code, the solves are 
>>> called as:
>>> 
>>> input vector v --> insert data of v into a dense mat --> KSPMatSolve() --> 
>>> MatMatMult() --> KSPMatSolve() --> insert data of dense mat into output 
>>> vector w -- output w
>>> 
>>> The operator used in the KSP is a Laplacian-like operator, and the 
>>> MatMatMult is with a Mass Matrix. The whole thing is supposed to be a solve 
>>> with a biharmonic-like operator. I can also run it with only the first 
>>> KSPMatSolve (i.e. just a Laplacian-like operator). In that case, the KSP 
>>> reportedly converges after 0 iterations (see the next line), but this 
>>> causes problems in other parts of the code later on. 
>>> 
>>> I saw that sometimes the first KSPMatSolve "converges" after 0 iterations 
>>> due to CONVERGED_RTOL. Then, the second KSPMatSolve produces a NaN/Inf. I 
>>> tried setting ksp_min_it, but that didn't seem to do anything. 
>>> 
>>> I'll keep trying different options and also try to get the MWE made (this 
>>> KSPMatSolve is pretty performance critical for us). 
>>> 
>>> Thanks for all your help,
>>> Sreeram
>>> 
>>> On Fri, Dec 15, 2023 at 1:01 AM Pierre Jolivet <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> 
>>>>> On 14 Dec 2023, at 11:45 PM, Sreeram R Venkat <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Thanks, I will try to create a minimal reproducible example. This may 
>>>>> take me some time though, as I need to figure out how to extract only the 
>>>>> relevant parts (the full program this solve is used in is getting quite 
>>>>> complex).
>>>> 
>>>> You could just do -ksp_view_mat binary:Amat.bin -ksp_view_pmat 
>>>> binary:Pmat.bin -ksp_view_rhs binary:rhs.bin and send me those three files 
>>>> (I’m guessing your are using double-precision scalars with 32-bit 
>>>> PetscInt).
>>>> 
>>>>> I'll also try out some of the BoomerAMG options to see if that helps.
>>>> 
>>>> These should work (this is where all “PCMatApply()-ready” PC are being 
>>>> tested): 
>>>> https://petsc.org/release/src/ksp/ksp/tutorials/ex79.c.html#line215
>>>> You can see it’s also testing PCHYPRE + KSPHPDDM on device (but not with 
>>>> HIP).
>>>> I’m aware the performance should not be optimal (see your comment about 
>>>> host/device copies), I’ve money to hire someone to work on this but: a) I 
>>>> need to find the correct engineer/post-doc, b) I currently don’t have good 
>>>> use cases (of course, I could generate a synthetic benchmark, for science).
>>>> So even if you send me the three Mat, a MWE would be appreciated if the 
>>>> KSPMatSolve() is performance-critical for you (see point b) from above).
>>>> 
>>>> Thanks,
>>>> Pierre
>>>> 
>>>>> Thanks,
>>>>> Sreeram
>>>>> 
>>>>> On Thu, Dec 14, 2023, 1:12 PM Pierre Jolivet <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 14 Dec 2023, at 8:02 PM, Sreeram R Venkat <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> Hello Pierre,
>>>>>>> 
>>>>>>> Thank you for your reply. I tried out the HPDDM CG as you said, and it 
>>>>>>> seems to be doing the batched solves, but the KSP is not converging due 
>>>>>>> to a NaN or Inf being generated. I also noticed there are a lot of 
>>>>>>> host-to-device and device-to-host copies of the matrices (the 
>>>>>>> non-batched KSP solve did not have any memcopies). I have attached 
>>>>>>> dump.0 again. Could you please take a look?
>>>>>> 
>>>>>> Yes, but you’d need to send me something I can run with your set of 
>>>>>> options (if you are more confident doing this in private, you can remove 
>>>>>> the list from c/c).
>>>>>> Not all BoomerAMG smoothers handle blocks of right-hand sides, and there 
>>>>>> is not much error checking, so instead of erroring out, this may be the 
>>>>>> reason why you are getting garbage.
>>>>>> 
>>>>>> Thanks,
>>>>>> Pierre
>>>>>> 
>>>>>>> Thanks,
>>>>>>> Sreeram
>>>>>>> 
>>>>>>> On Thu, Dec 14, 2023 at 12:42 AM Pierre Jolivet <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> Hello Sreeram,
>>>>>>>> KSPCG (PETSc implementation of CG) does not handle solves with 
>>>>>>>> multiple columns at once.
>>>>>>>> There is only a single native PETSc KSP implementation which handles 
>>>>>>>> solves with multiple columns at once: KSPPREONLY.
>>>>>>>> If you use --download-hpddm, you can use a CG (or GMRES, or more 
>>>>>>>> advanced methods) implementation which handles solves with multiple 
>>>>>>>> columns at once (via -ksp_type hpddm -ksp_hpddm_type cg or 
>>>>>>>> KSPSetType(ksp, KSPHPDDM); KSPHPDDMSetType(ksp, KSP_HPDDM_TYPE_CG);).
>>>>>>>> I’m the main author of HPDDM, there is preliminary support for device 
>>>>>>>> matrices, but if it’s not working as intended/not faster than column 
>>>>>>>> by column, I’d be happy to have a deeper look (maybe in private), 
>>>>>>>> because most (if not all) of my users interested in (pseudo-)block 
>>>>>>>> Krylov solvers (i.e., solvers that treat right-hand sides in a single 
>>>>>>>> go) are using plain host matrices.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Pierre
>>>>>>>> 
>>>>>>>> PS: you could have a look at 
>>>>>>>> https://www.sciencedirect.com/science/article/abs/pii/S0898122121000055
>>>>>>>>  to understand the philosophy behind block iterative methods in PETSc 
>>>>>>>> (and in HPDDM), src/mat/tests/ex237.c, the benchmark I mentioned 
>>>>>>>> earlier, was developed in the context of this paper to produce Figures 
>>>>>>>> 2-3. Note that this paper is now slightly outdated, since then, 
>>>>>>>> PCHYPRE and PCMG (among others) have been made “PCMatApply()-ready”.
>>>>>>>> 
>>>>>>>>> On 13 Dec 2023, at 11:05 PM, Sreeram R Venkat <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> Hello Pierre,
>>>>>>>>> 
>>>>>>>>> I am trying out the KSPMatSolve with the BoomerAMG preconditioner. 
>>>>>>>>> However, I am noticing that it is still solving column by column 
>>>>>>>>> (this is stated explicitly in the info dump attached). I looked at 
>>>>>>>>> the code for KSPMatSolve_Private() and saw that as long as 
>>>>>>>>> ksp->ops->matsolve is true, it should do the batched solve, though 
>>>>>>>>> I'm not sure where that gets set. 
>>>>>>>>> 
>>>>>>>>> I am using the options -pc_type hypre -pc_hypre_type boomeramg when 
>>>>>>>>> running the code.
>>>>>>>>> 
>>>>>>>>> Can you please help me with this?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Sreeram
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 7, 2023 at 4:04 PM Mark Adams <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> N.B., AMGX interface is a bit experimental.
>>>>>>>>>> Mark
>>>>>>>>>> 
>>>>>>>>>> On Thu, Dec 7, 2023 at 4:11 PM Sreeram R Venkat <[email protected] 
>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>> Oh, in that case I will try out BoomerAMG. Getting AMGX to build 
>>>>>>>>>>> correctly was also tricky so hopefully the HYPRE build will be 
>>>>>>>>>>> easier.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Sreeram
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Dec 7, 2023, 3:03 PM Pierre Jolivet <[email protected] 
>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 7 Dec 2023, at 9:37 PM, Sreeram R Venkat <[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you Barry and Pierre; I will proceed with the first option. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I want to use the AMGX preconditioner for the KSP. I will try it 
>>>>>>>>>>>>> out and see how it performs.
>>>>>>>>>>>> 
>>>>>>>>>>>> Just FYI, AMGX does not handle systems with multiple RHS, and thus 
>>>>>>>>>>>> has no PCMatApply() implementation.
>>>>>>>>>>>> BoomerAMG does, and there is a PCMatApply_HYPRE_BoomerAMG() 
>>>>>>>>>>>> implementation.
>>>>>>>>>>>> But let us know if you need assistance figuring things out.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Pierre
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Sreeram
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Dec 7, 2023 at 2:02 PM Pierre Jolivet <[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> To expand on Barry’s answer, we have observed repeatedly that 
>>>>>>>>>>>>>> MatMatMult with MatAIJ performs better than MatMult with 
>>>>>>>>>>>>>> MatMAIJ, you can reproduce this on your own with 
>>>>>>>>>>>>>> https://petsc.org/release/src/mat/tests/ex237.c.html.
>>>>>>>>>>>>>> Also, I’m guessing you are using some sort of preconditioner 
>>>>>>>>>>>>>> within your KSP.
>>>>>>>>>>>>>> Not all are “KSPMatSolve-ready”, i.e., they may treat blocks of 
>>>>>>>>>>>>>> right-hand sides column by column, which is very inefficient.
>>>>>>>>>>>>>> You could run your code with -info dump and send us dump.0 to 
>>>>>>>>>>>>>> see what needs to be done on our end to make things more 
>>>>>>>>>>>>>> efficient, should you not be satisfied with the current 
>>>>>>>>>>>>>> performance of the code.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Pierre
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 7 Dec 2023, at 8:34 PM, Barry Smith <[email protected] 
>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Dec 7, 2023, at 1:17 PM, Sreeram R Venkat 
>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have 2 sequential matrices M and R (both MATSEQAIJCUSPARSE 
>>>>>>>>>>>>>>>> of size n x n) and a vector v of size n*m. v = [v_1 , v_2 ,... 
>>>>>>>>>>>>>>>> , v_m] where v_i has size n. The data for v can be stored 
>>>>>>>>>>>>>>>> either in column-major or row-major order.  Now, I want to do 
>>>>>>>>>>>>>>>> 2 types of operations:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1. Matvecs of the form M*v_i = w_i, for i = 1..m. 
>>>>>>>>>>>>>>>> 2. KSPSolves of the form R*x_i = v_i, for i = 1..m.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> From what I have read on the documentation, I can think of 2 
>>>>>>>>>>>>>>>> approaches. 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1. Get the pointer to the data in v (column-major) and use it 
>>>>>>>>>>>>>>>> to create a dense matrix V. Then do a MatMatMult with M*V = W, 
>>>>>>>>>>>>>>>> and take the data pointer of W to create the vector w. For 
>>>>>>>>>>>>>>>> KSPSolves, use KSPMatSolve with R and V.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2. Create a MATMAIJ using M/R and use that for matvecs 
>>>>>>>>>>>>>>>> directly with the vector v. I don't know if KSPSolve with the 
>>>>>>>>>>>>>>>> MATMAIJ will know that it is a multiple RHS system and act 
>>>>>>>>>>>>>>>> accordingly.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Which would be the more efficient option?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Use 1. 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> As a side-note, I am also wondering if there is a way to use 
>>>>>>>>>>>>>>>> row-major storage of the vector v.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> No
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The reason is that this could allow for more coalesced memory 
>>>>>>>>>>>>>>>> access when doing matvecs.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   PETSc matrix-vector products use BLAS GMEV matrix-vector 
>>>>>>>>>>>>>>> products for the computation so in theory they should already 
>>>>>>>>>>>>>>> be well-optimized
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Sreeram
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> <dump.0>
>>>>>>>> 
>>>>>>> <dump.0>
>>>>>> 
>>>> 
>>> <Pmat.bin><Amat.bin><rhs.bin>
>> 

Reply via email to