I think the issue is my overlap is too large. Perhaps I don’t fully understand
how the overlap parameter is used. Let me explain my setup below.
My vector of unknowns is the electric field on a Yee grid in a 2D geometry. I’m
using 4x4 grid cells per rank. This gives 4*5 = 20 degrees of freedom for each
of the two in-plane components of E, and 5*5 = 25 for the out-of-plane
component. The total is 65 degrees of freedom per rank. My global problem size
is 224x16 on 224 ranks for one case, and 224x32 on 448 ranks for another. Using
ASM overlap 4, I get the following for PC sub blocks on a rank:
PC Object: (sub_) 1 MPI process
type: lu
out-of-place factorization
Reusing fill from past factorization
Reusing reordering from past factorization
tolerance for zero pivot 2.22045e-14
matrix ordering: nd
factor fill ratio given 5., needed 4.02544
Factored matrix follows:
Mat Object: (sub_) 1 MPI process
type: seqaij
rows=402, cols=402
package used to perform factorization: petsc
total: nonzeros=16295, allocated nonzeros=16295
not using I-node routines
The above is for a 224x16 size domain in 224 total ranks, but I get the same
thing for a 224x32 size domain on 448 ranks, which is what I am expected to get.
However, if I set the overlap to 16 (which is larger than by box size on a
given rank), I get the following
224x16 gid on 112 ranks:
PC Object: (sub_) 1 MPI process
type: lu
out-of-place factorization
Reusing fill from past factorization
Reusing reordering from past factorization
tolerance for zero pivot 2.22045e-14
matrix ordering: nd
factor fill ratio given 5., needed 6.52557
Factored matrix follows:
Mat Object: (sub_) 1 MPI process
type: seqaij
rows=1316, cols=1316
package used to perform factorization: petsc
total: nonzeros=95195, allocated nonzeros=95195
not using I-node routines
224x16 gid on 112 ranks:
PC Object: (sub_) 1 MPI process
type: lu
out-of-place factorization
Reusing fill from past factorization
Reusing reordering from past factorization
tolerance for zero pivot 2.22045e-14
matrix ordering: nd
factor fill ratio given 5., needed 8.59182
Factored matrix follows:
Mat Object: (sub_) 1 MPI process
type: seqaij
rows=2632, cols=2632
package used to perform factorization: petsc
total: nonzeros=250675, allocated nonzeros=250675
not using I-node routines
In this case, with an overlap much larger than the box size, the rows/cols per
rank go up by a factor of 2 when doubling the problem size at fixed work per
rank.
Why is this?
How exactly is the overlap parameter used?
Thank you.
-Justin
From: Angus, Justin Ray <[email protected]>
Date: Wednesday, November 5, 2025 at 8:17 AM
To: [email protected] <[email protected]>, Matthew Knepley <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
Thanks for the reply.
The work per block should be the same for the weak scaling. I know LU is not
scalable with respect to the block size.
Perhaps our setup is not doing what we think it is doing. I’ll look into it
further.
-Justin
From: Mark Adams <[email protected]>
Date: Wednesday, November 5, 2025 at 6:14 AM
To: Matthew Knepley <[email protected]>
Cc: Angus, Justin Ray <[email protected]>, [email protected]
<[email protected]>
Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
And we do not have sparse LU on GPUs so that is done on the CPU.
And I don't know why it would not weak scale well.
Your results are consistent with just using one process with one domain, (re
Matt) while you double the problem size.
On Tue, Nov 4, 2025 at 2:27 PM Matthew Knepley
<[email protected]<mailto:[email protected]>> wrote:
On Tue, Nov 4, 2025 at 1:25 PM Angus, Justin Ray via petsc-dev
<[email protected]<mailto:[email protected]>> wrote:
Hi Junchao,
We have recently been using ASM + LU for 2D problems on both CPU and GPU.
However, I found that this method has very bad weak scaling. I find that the
cost of PCApply increases by about a factor of 4 each time I increase the
problem size in 1 dimension by a factor of 2 while keeping the load per
core/gpu the same. The total number of GMRES iterations does not increase, just
the cost of PCApply (and PCSetup). Is this scaling behavior expected? Any ideas
of how to optimize the preconditioner?
The cost of PCApply for ASM is dominated by the cost of process-local block
solves. You are using LU for the block solve. (Sparse) LU has cost roughly
O(N^2) for the apply (depending on the structure of the matrix). So, if you
double the size of a local block, your runtime should increase by about 4x.
Thus LU is not a scalable method.
Thanks,
Matt
Thank you.
-Justin
From: Junchao Zhang <[email protected]<mailto:[email protected]>>
Date: Monday, April 14, 2025 at 7:35 PM
To: Angus, Justin Ray <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>, Ghosh, Debojyoti
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
Petsc supports ILU0/ICC0 numeric factorization (without reordering) and then
triangular solve on GPUs. It is done by calling vendor libraries (ex. cusparse).
We have options -pc_factor_mat_factor_on_host <bool>
-pc_factor_mat_solve_on_host <bool> to force doing the factorization and
MatSolve on the host for device matrix types.
You can try to see if it works for your case.
--Junchao Zhang
On Mon, Apr 14, 2025 at 4:39 PM Angus, Justin Ray via petsc-dev
<[email protected]<mailto:[email protected]>> wrote:
Hello,
A project I work on uses GMRES via PETSc. In particular, we have had good
successes using the Additive Schwarz Method + ILU preconditioner setup using a
CPU-based code. I found online where it is stated that “Parts of most
preconditioners run directly on the GPU”
(https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!cKaARjBmC0LioY3weXAlJzubLEHzVu5vYn5ahrpHmwoOdy6fN2P1sBujTXmP0AcXQGimVUUTDFcRhE72IkOHQw$
<https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bw6qeKcY7MKSvlEgcogdKR7fpjZSOFvka6zfDprUZ_sJHdE-YZmRD6UTqWQW3_uGVBII4P-AG0zaGTLbI67_fQ$>).
Is ASM + ILU also available for GPU platforms?
-Justin
--
What most experimenters take for granted before they begin their experiments is
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cKaARjBmC0LioY3weXAlJzubLEHzVu5vYn5ahrpHmwoOdy6fN2P1sBujTXmP0AcXQGimVUUTDFcRhE6BU5G4NA$
<https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dXQeQOf4ckc4MRP64tltlc6e1FJgPXuEuzX8tHsTreO_vIP2Lbge1es994i-WdQTd1zpmNP2R9dbEHfLa0v_$>