An overlap of 16 is huge and would rarely if ever, be done in practice. It is
not surprising that the subproblems become as large as they do with such a
large overlap.
How the overlap is used.
while (overlap--) {
add to the subproblem all degrees of freedom that are coupled by nonzeros
in the matrix to all the current degrees of freedom in the subproblem.
}
So it is grabbing all the neighbors for 16 rounds of grabbing.
> On Nov 5, 2025, at 3:05 PM, Angus, Justin Ray via petsc-dev
> <[email protected]> wrote:
>
> I think the issue is my overlap is too large. Perhaps I don’t fully
> understand how the overlap parameter is used. Let me explain my setup below.
>
> My vector of unknowns is the electric field on a Yee grid in a 2D geometry.
> I’m using 4x4 grid cells per rank. This gives 4*5 = 20 degrees of freedom for
> each of the two in-plane components of E, and 5*5 = 25 for the out-of-plane
> component. The total is 65 degrees of freedom per rank. My global problem
> size is 224x16 on 224 ranks for one case, and 224x32 on 448 ranks for
> another. Using ASM overlap 4, I get the following for PC sub blocks on a rank:
> PC Object: (sub_) 1 MPI process
> type: lu
> out-of-place factorization
> Reusing fill from past factorization
> Reusing reordering from past factorization
> tolerance for zero pivot 2.22045e-14
> matrix ordering: nd
> factor fill ratio given 5., needed 4.02544
> Factored matrix follows:
> Mat Object: (sub_) 1 MPI process
> type: seqaij
> rows=402, cols=402
> package used to perform factorization: petsc
> total: nonzeros=16295, allocated nonzeros=16295
> not using I-node routines
>
> The above is for a 224x16 size domain in 224 total ranks, but I get the same
> thing for a 224x32 size domain on 448 ranks, which is what I am expected to
> get.
>
> However, if I set the overlap to 16 (which is larger than by box size on a
> given rank), I get the following
> 224x16 gid on 112 ranks:
> PC Object: (sub_) 1 MPI process
> type: lu
> out-of-place factorization
> Reusing fill from past factorization
> Reusing reordering from past factorization
> tolerance for zero pivot 2.22045e-14
> matrix ordering: nd
> factor fill ratio given 5., needed 6.52557
> Factored matrix follows:
> Mat Object: (sub_) 1 MPI process
> type: seqaij
> rows=1316, cols=1316
> package used to perform factorization: petsc
> total: nonzeros=95195, allocated nonzeros=95195
> not using I-node routines
>
> 224x16 gid on 112 ranks:
> PC Object: (sub_) 1 MPI process
> type: lu
> out-of-place factorization
> Reusing fill from past factorization
> Reusing reordering from past factorization
> tolerance for zero pivot 2.22045e-14
> matrix ordering: nd
> factor fill ratio given 5., needed 8.59182
> Factored matrix follows:
> Mat Object: (sub_) 1 MPI process
> type: seqaij
> rows=2632, cols=2632
> package used to perform factorization: petsc
> total: nonzeros=250675, allocated nonzeros=250675
> not using I-node routines
>
> In this case, with an overlap much larger than the box size, the rows/cols
> per rank go up by a factor of 2 when doubling the problem size at fixed work
> per rank.
>
> Why is this?
> How exactly is the overlap parameter used?
>
> Thank you.
>
> -Justin
>
> From: Angus, Justin Ray <[email protected]>
> Date: Wednesday, November 5, 2025 at 8:17 AM
> To: [email protected] <[email protected]>, Matthew Knepley <[email protected]>
> Cc: [email protected] <[email protected]>
> Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
>
> Thanks for the reply.
>
> The work per block should be the same for the weak scaling. I know LU is not
> scalable with respect to the block size.
>
> Perhaps our setup is not doing what we think it is doing. I’ll look into it
> further.
>
> -Justin
>
> From: Mark Adams <[email protected]>
> Date: Wednesday, November 5, 2025 at 6:14 AM
> To: Matthew Knepley <[email protected]>
> Cc: Angus, Justin Ray <[email protected]>, [email protected]
> <[email protected]>
> Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
>
> And we do not have sparse LU on GPUs so that is done on the CPU.
>
> And I don't know why it would not weak scale well.
> Your results are consistent with just using one process with one domain, (re
> Matt) while you double the problem size.
>
> On Tue, Nov 4, 2025 at 2:27 PM Matthew Knepley <[email protected]
> <mailto:[email protected]>> wrote:
> On Tue, Nov 4, 2025 at 1:25 PM Angus, Justin Ray via petsc-dev
> <[email protected] <mailto:[email protected]>> wrote:
> Hi Junchao,
>
> We have recently been using ASM + LU for 2D problems on both CPU and GPU.
> However, I found that this method has very bad weak scaling. I find that the
> cost of PCApply increases by about a factor of 4 each time I increase the
> problem size in 1 dimension by a factor of 2 while keeping the load per
> core/gpu the same. The total number of GMRES iterations does not increase,
> just the cost of PCApply (and PCSetup). Is this scaling behavior expected?
> Any ideas of how to optimize the preconditioner?
>
> The cost of PCApply for ASM is dominated by the cost of process-local block
> solves. You are using LU for the block solve. (Sparse) LU has cost roughly
> O(N^2) for the apply (depending on the structure of the matrix). So, if you
> double the size of a local block, your runtime should increase by about 4x.
> Thus LU is not a scalable method.
>
> Thanks,
>
> Matt
>
> Thank you.
>
> -Justin
>
> From: Junchao Zhang <[email protected] <mailto:[email protected]>>
> Date: Monday, April 14, 2025 at 7:35 PM
> To: Angus, Justin Ray <[email protected] <mailto:[email protected]>>
> Cc: [email protected] <mailto:[email protected]>
> <[email protected] <mailto:[email protected]>>, Ghosh, Debojyoti
> <[email protected] <mailto:[email protected]>>
> Subject: Re: [petsc-dev] Additive Schwarz Method + ILU on GPU platforms
>
> Petsc supports ILU0/ICC0 numeric factorization (without reordering) and then
> triangular solve on GPUs. It is done by calling vendor libraries (ex.
> cusparse).
> We have options -pc_factor_mat_factor_on_host <bool>
> -pc_factor_mat_solve_on_host <bool> to force doing the factorization and
> MatSolve on the host for device matrix types.
>
> You can try to see if it works for your case.
>
> --Junchao Zhang
>
>
> On Mon, Apr 14, 2025 at 4:39 PM Angus, Justin Ray via petsc-dev
> <[email protected] <mailto:[email protected]>> wrote:
> Hello,
>
>
> A project I work on uses GMRES via PETSc. In particular, we have had good
> successes using the Additive Schwarz Method + ILU preconditioner setup using
> a CPU-based code. I found online where it is stated that “Parts of most
> preconditioners run directly on the GPU”
> (https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!a_8TfxeDzbG_lCrCC136iGZS5sjN7ztnUdFCfx8-z22iGCTLkqRkhKCH2veVVdMwnYaOYulKDOV-MlPE9UAwlA$
>
> <https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bw6qeKcY7MKSvlEgcogdKR7fpjZSOFvka6zfDprUZ_sJHdE-YZmRD6UTqWQW3_uGVBII4P-AG0zaGTLbI67_fQ$>).
> Is ASM + ILU also available for GPU platforms?
>
>
> -Justin
>
>
>
> --
> What most experimenters take for granted before they begin their experiments
> is infinitely more interesting than any results to which their experiments
> lead.
> -- Norbert Wiener
>
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!a_8TfxeDzbG_lCrCC136iGZS5sjN7ztnUdFCfx8-z22iGCTLkqRkhKCH2veVVdMwnYaOYulKDOV-MlP-gqELRw$
>
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dXQeQOf4ckc4MRP64tltlc6e1FJgPXuEuzX8tHsTreO_vIP2Lbge1es994i-WdQTd1zpmNP2R9dbEHfLa0v_$>