Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms

Barry Smith Thu, 06 Nov 2025 18:32:27 -0800


> On Nov 5, 2025, at 6:01 PM, Angus, Justin Ray <[email protected]> wrote:
> 
> Hi Barry,
> 
> Thanks for the explanation, though I’m not sure I understand exactly what 
> it’s doing. I at least understand now that the overlap doesn’t simply include 
> degrees of freedom from neighboring cells within the range of the specified 
> overlap for each side and direction (which is how chat GPT told me it worked).


  It kind of does. When you use PCASM and set a requested overlap it 
"determines" the neighboring cells by the nonzeros in the matrix, not by the 
geometry. 
> 
> For this problem ASM with overlap 16 + LU for the sub pc gave optimal 
> performance. A lower value for the overlap results in too many GMRES 
> iterations. I’m actually surprised that such a large overlap was needed, 
> because the code I’m using is nearly identical to another and in that code 
> the GMRES iterations stay low with an overlap of 4. Perhaps there is 
> something different in the setup that is not clear.

   It could be, but good preconditioners are very dependent on the exact PDE 
model you are using as well as the discretization you are using so a slight 
change can dramatically change what is a good preconditioner for that problem.

> 
> The PETSc preconditioners seem very powerful, bu it is difficult and 
> time-consuming to find the optimal setup for a given problem given my nascent 
> understanding of how the preconditoners work.

   This is a universal problem, I am hoping someday LLM's will be able to 
provide good advice for each problem where a preconditioner is needed :-).

  
> Is there someone from PETSc (perhaps yourself) with advanced knowledge of the 
> preconditions that I could schedule a zoom/webex meeting with to discuss?

   Sorry, but even the little bit you indicated about your problem, electric 
field on a Yee grid makes it clear to me I have no expert advice on 
preconditioner selection I can offer to you.

   I suggest looking at the HPDDM preconditioners that you can use from PETSc 
https://urldefense.us/v3/__https://petsc.org/release/manualpages/PC/PCHPDDM/*pchpddm__;Iw!!G_uCfscf7eWS!Z374UasLKyZcpDCY3R861HyM9rLm2gfpYPbVa8mXgyitOOJm3I5LnL3GVZ_8zyTkdwBMjbok0VY2f1Xp9Q3LSA$
 . These preconditioners are like the Mona Lisa while basic ASM is like a 
kindergartener's drawing; they can be quite robust for difficult problems.

  Barry

> 
> Thanks!
> 
> -Justin
> 
> 
> From: Barry Smith <[email protected]>
> Date: Wednesday, November 5, 2025 at 2:48 PM
> To: Angus, Justin Ray <[email protected]>
> Cc: [email protected] <[email protected]>, Matthew Knepley <[email protected]>, 
> [email protected] <[email protected]>
> Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
> 
> 
>   An overlap of 16 is huge and would rarely if ever, be done in practice. It 
> is not surprising that the subproblems become as large as they do with such a 
> large overlap. 
> 
>   How the overlap is used. 
> 
>   while (overlap--) {
>      add to the subproblem all degrees of freedom that are coupled by 
> nonzeros in the matrix to all the current degrees of freedom in the 
> subproblem.
>   }
> 
>   So it is grabbing all the neighbors for 16 rounds of grabbing.
> 
> 
> 
> On Nov 5, 2025, at 3:05 PM, Angus, Justin Ray via petsc-dev 
> <[email protected]> wrote:
> 
> I think the issue is my overlap is too large. Perhaps I don’t fully 
> understand how the overlap parameter is used. Let me explain my setup below.
> 
> My vector of unknowns is the electric field on a Yee grid in a 2D geometry. 
> I’m using 4x4 grid cells per rank. This gives 4*5 = 20 degrees of freedom for 
> each of the two in-plane components of E, and 5*5 = 25 for the out-of-plane 
> component. The total is 65 degrees of freedom per rank. My global problem 
> size is 224x16 on 224 ranks for one case, and 224x32 on 448 ranks for 
> another. Using ASM overlap 4, I get the following for PC sub blocks on a rank:
> PC Object: (sub_) 1 MPI process
>       type: lu
>         out-of-place factorization
>         Reusing fill from past factorization
>         Reusing reordering from past factorization
>         tolerance for zero pivot 2.22045e-14
>         matrix ordering: nd
>         factor fill ratio given 5., needed 4.02544
>           Factored matrix follows:
>             Mat Object: (sub_) 1 MPI process
>               type: seqaij
>               rows=402, cols=402
>               package used to perform factorization: petsc
>               total: nonzeros=16295, allocated nonzeros=16295
>                 not using I-node routines
> 
> The above is for a 224x16 size domain in 224 total ranks, but I get the same 
> thing for a 224x32 size domain on 448 ranks, which is what I am expected to 
> get.
> 
> However, if I set the overlap to 16 (which is larger than by box size on a 
> given rank), I get the following
> 224x16 gid on 112 ranks: 
> PC Object: (sub_) 1 MPI process
>       type: lu
>         out-of-place factorization
>         Reusing fill from past factorization
>         Reusing reordering from past factorization
>         tolerance for zero pivot 2.22045e-14
>         matrix ordering: nd
>         factor fill ratio given 5., needed 6.52557
>           Factored matrix follows:
>             Mat Object: (sub_) 1 MPI process
>               type: seqaij
>               rows=1316, cols=1316
>               package used to perform factorization: petsc
>               total: nonzeros=95195, allocated nonzeros=95195
>                 not using I-node routines
> 
> 224x16 gid on 112 ranks: 
> PC Object: (sub_) 1 MPI process
>       type: lu
>         out-of-place factorization
>         Reusing fill from past factorization
>         Reusing reordering from past factorization
>         tolerance for zero pivot 2.22045e-14
>         matrix ordering: nd
>         factor fill ratio given 5., needed 8.59182
>           Factored matrix follows:
>             Mat Object: (sub_) 1 MPI process
>               type: seqaij
>               rows=2632, cols=2632
>               package used to perform factorization: petsc
>               total: nonzeros=250675, allocated nonzeros=250675
>                 not using I-node routines
> 
> In this case, with an overlap much larger than the box size, the rows/cols 
> per rank go up by a factor of 2 when doubling the problem size at fixed work 
> per rank. 
> 
> Why is this?
> How exactly is the overlap parameter used?
> 
> Thank you.
> 
> -Justin
> 
> From: Angus, Justin Ray <[email protected]>
> Date: Wednesday, November 5, 2025 at 8:17 AM
> To: [email protected] <[email protected]>, Matthew Knepley <[email protected]>
> Cc: [email protected] <[email protected]>
> Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
> 
> Thanks for the reply.
> 
> The work per block should be the same for the weak scaling. I know LU is not 
> scalable with respect to the block size.
> 
> Perhaps our setup is not doing what we think it is doing. I’ll look into it 
> further.
> 
> -Justin
> 
> From: Mark Adams <[email protected]>
> Date: Wednesday, November 5, 2025 at 6:14 AM
> To: Matthew Knepley <[email protected]>
> Cc: Angus, Justin Ray <[email protected]>, [email protected] 
> <[email protected]>
> Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
> 
> And we do not have sparse LU on GPUs so that is done on the CPU.
> 
> And I don't know why it would not weak scale well. 
> Your results are consistent with just using one process with one domain, (re 
> Matt) while you double the problem size.
> 
> On Tue, Nov 4, 2025 at 2:27 PM Matthew Knepley <[email protected] 
> <mailto:[email protected]>> wrote:
> On Tue, Nov 4, 2025 at 1:25 PM Angus, Justin Ray via petsc-dev 
> <[email protected] <mailto:[email protected]>> wrote:
> Hi Junchao,
> 
> We have recently been using ASM + LU for 2D problems on both CPU and GPU. 
> However, I found that this method has very bad weak scaling. I find that the 
> cost of PCApply increases by about a factor of 4 each time I increase the 
> problem size in 1 dimension by a factor of 2 while keeping the load per 
> core/gpu the same. The total number of GMRES iterations does not increase, 
> just the cost of PCApply (and PCSetup). Is this scaling behavior expected? 
> Any ideas of how to optimize the preconditioner?
> 
> The cost of PCApply for ASM is dominated by the cost of process-local block 
> solves. You are using LU for the block solve. (Sparse) LU has cost roughly 
> O(N^2) for the apply (depending on the structure of the matrix). So, if you 
> double the size of a local block, your runtime should increase by about 4x. 
> Thus LU is not a scalable method.
> 
>   Thanks,
> 
>      Matt
>  
> Thank you.
> 
> -Justin
> 
> From: Junchao Zhang <[email protected] <mailto:[email protected]>>
> Date: Monday, April 14, 2025 at 7:35 PM
> To: Angus, Justin Ray <[email protected] <mailto:[email protected]>>
> Cc: [email protected] <mailto:[email protected]> 
> <[email protected] <mailto:[email protected]>>, Ghosh, Debojyoti 
> <[email protected] <mailto:[email protected]>>
> Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
> 
> Petsc supports ILU0/ICC0 numeric factorization (without reordering) and then 
> triangular solve on GPUs. It is done by calling vendor libraries (ex. 
> cusparse).
> We have options -pc_factor_mat_factor_on_host <bool>  
> -pc_factor_mat_solve_on_host <bool> to force doing the factorization and 
> MatSolve on the host for device matrix types.
> 
> You can try to see if it works for your case.
> 
> --Junchao Zhang
> 
> 
> On Mon, Apr 14, 2025 at 4:39 PM Angus, Justin Ray via petsc-dev 
> <[email protected] <mailto:[email protected]>> wrote:
> Hello,
> 
>  
> A project I work on uses GMRES via PETSc. In particular, we have had good 
> successes using the Additive Schwarz Method + ILU preconditioner setup using 
> a CPU-based code. I found online where it is stated that “Parts of most 
> preconditioners run directly on the GPU” 
> (https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!Z374UasLKyZcpDCY3R861HyM9rLm2gfpYPbVa8mXgyitOOJm3I5LnL3GVZ_8zyTkdwBMjbok0VY2f1UxBC1gHg$
>   
> <https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bw6qeKcY7MKSvlEgcogdKR7fpjZSOFvka6zfDprUZ_sJHdE-YZmRD6UTqWQW3_uGVBII4P-AG0zaGTLbI67_fQ$>).
>  Is ASM + ILU also available for GPU platforms?
> 
>  
> -Justin
> 
> 
> 
> --
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!Z374UasLKyZcpDCY3R861HyM9rLm2gfpYPbVa8mXgyitOOJm3I5LnL3GVZ_8zyTkdwBMjbok0VY2f1W396NqwA$
>   
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dXQeQOf4ckc4MRP64tltlc6e1FJgPXuEuzX8tHsTreO_vIP2Lbge1es994i-WdQTd1zpmNP2R9dbEHfLa0v_$>
>

Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms

Reply via email to