Re: [deal.II] Problem with scaling with parallel::distributed::Triangulation and direct solver

Nihar Bhardwaj Darbhamulla Fri, 14 Nov 2025 10:20:24 -0800

Hello Prof. Wolfgang,

"I don't recall the exact interface of SparseDirectMUMPS from past 
releases. SparseDirectUMFPACK allows you to compute a decomposition only 
once, and then apply it repeatedly in vmult(). The interface of 
SparseDirectMUMPS that's in the current developer version also allows 
you to do that. If you can switch to the current developer version (or 
check whether 9.7 can do that as well), you may want to try that. "


I used SparseDirectMUMPS as you suggested, i.e. initialize it in the 
constructor, and use it repeatedly in the solve. I believe, it applies the 
factored form each time vmult is called. 

" I can't see where you use the AMG preconditioners, but the same applies: 
You should only set them up once and then re-use many times. That is, 
the preconditioners need to live *outside* the place where you solve the 
inner (1,1) block. "

I have added the code snippet initializing and running the CG+AMG 
preconditioner below. As it happens, the inner (1,1) block arises out of 
the symmetric discretization of a Stokes system (as in Step-22), and using 
BoomerAMG as is does not allow for the recognition of multiple DoFs 
associated with each node, while this is an option available in using 
BoomerAMG through PETSc ( *-pc_hypre_boomeramg_nodal_coarsen* ), but is not 
exposed through the dealii interface. The solutions I can think of are:

   1. adding the option to the PETScWrappers source code and testing it. 
   I'm not sure if it has been tried before with the PETSc Wrappers, and 
   whether there is a certain re-ordering that is to be kept in mind to 
   execute this successfully.
   2. Switch to using Trilinos wrappers. However, my research group has 
   more background on using PETSc, and the switch would come at a loss of 
   support available from a linear algebra library POV.
   3. Modifying the formulation to go from a symmetric gradient weak-form 
   to a vector laplacian weak-form. This comes with modeling justifications 
   and would be problem specific. 

For now, I would prefer option 1, but would like to know if it has been 
tried before, and if any caveats are associated with using it.

The code for using CG+AMG as a preconditioner
{
        
         using invCG = LinearSolver::InverseMatrix<  LA::MPI::SparseMatrix, 
                                                     LA::MPI
::PreconditionAMG>;

                                                    
         LA::MPI::PreconditionAMG precAd;
         {
             LA::MPI::PreconditionAMG::AdditionalData data;
             #ifdef USE_PETSC_LA
                 data.symmetric_operator = true;
                 data.max_iter = 4;
                 data.output_details = true;
             #endif
             precAd.initialize(A.block(0,0), data);
         }

         LA::MPI::PreconditionAMG precAs;
         {
             LA::MPI::PreconditionAMG::AdditionalData data;
             #ifdef USE_PETSC_LA
                 data.symmetric_operator = true;
                 data.max_iter = 4;
                 data.output_details=true;
             #endif
             precAs.initialize(A.block(1,1), data);
         }

         LA::MPI::PreconditionAMG precMp;
         {
             LA::MPI::PreconditionAMG::AdditionalData data;
             #ifdef USE_PETSC_LA
                 data.symmetric_operator = true;
             #endif
             precMp.initialize(Mp.block(2,2), data);
         }

         const invCG inv_Ad( A.block(0,0), precAd );
         const invCG inv_As( A.block(1,1), precAs );
         const invCG inv_Mp( Mp.block(2,2), precMp );

         const LinearSolver::BlockDiagonalPreconditioner<invCG, invCG, invCG
>
             prec_M( inv_Ad, inv_As, inv_Mp );

with the inverse matrix class defined as
template<class Matrix, class Preconditioner>
    class InverseMatrix : public Subscriptor
    {
        public:
            InverseMatrix(  const Matrix &K,
                            const Preconditioner &M ) 
                : K(&K), M(M),
                  pcout( std::cout, 
(Utilities::MPI::this_mpi_process(MPI_COMM_WORLD) 
== 0)) {}
            
            template<typename VectorType>
            void vmult(     VectorType &dst,
                            const VectorType &src) const;

        private:
            const SmartPointer<const Matrix> K;
            const Preconditioner             &M;
            ConditionalOStream pcout;
    };

    template<class Matrix, class Preconditioner>
    template<typename VectorType>
    void InverseMatrix<Matrix, Preconditioner> :: vmult( 
                VectorType &dst,
                const VectorType &src ) const 
    {
        SolverControl   solver_control(src.size(), 1e-8 * src.l2_norm());
        solver_control.log_history(true);
        SolverCG<VectorType>    cg(solver_control);
        dst = 0;

        try{
            cg.solve(*K, dst, src, M);
            pcout<<"CG solved in "<<solver_control.last_step() <<" 
iterations" << std::endl;
        } catch (std::exception &e) {
            Assert(false, ExcMessage(e.what()));
        }
    }

Thank you for your help once again.

Regards,
Nihar


On Tuesday, 11 November 2025 at 10:58:33 UTC-8 Wolfgang Bangerth wrote:

>
> > My idea was to observe timing behaviour similar to step-55, as the 
> > preconditioner template I have followed most closely emulates the 
> > behaviour of the implementation in step-55. The number of degrees of 
> > freedom are not significantly different from those observed at cycles 4 
> > and 5. The output I observe from cycles 4 and 5 when i run step-55 using 
> > 2 processes with the same version of deal.II, I get the performance as 
> > shown below. This indicated to me that for 2 mpi processes and the 
> > number of dofs, the solution time would be quite low. Once I had 
> > reasonable solution times, I was planning to scale the code to a larger 
> > number of degrees of freedom.
>
> Ah, I see -- so the question you're really asking is why it takes 493 
> seconds to solve a problem with 215,000 unknowns. That's likely because 
> you do 19 outer iterations, and in each you call a direct solver to 
> decompose the same matrix 19 times.
>
>
> > Prof. Bangerth, would there be a way to do this when using 
> > SparseDirectMUMPS? From reading the documentation, I only see a solve() 
> > function. The alternative would be to use sparseILU. Do you recommend 
> > using sparseILU instead?
>
> I don't recall the exact interface of SparseDirectMUMPS from past 
> releases. SparseDirectUMFPACK allows you to compute a decomposition only 
> once, and then apply it repeatedly in vmult(). The interface of 
> SparseDirectMUMPS that's in the current developer version also allows 
> you to do that. If you can switch to the current developer version (or 
> check whether 9.7 can do that as well), you may want to try that.
>
> SparseILU works sometimes, but it typically does not scale well to large 
> problems. (Sparse direct solvers often do not either, but at least they 
> don't require endless fiddling with settings.)
>
>
> > Since there is clearly an inefficiency when using DirectInverseMatrix 
> > objects as my preconditioner, I switched to using InverseMatrix as my 
> > inner solver, wherein I have CG with AMG preconditioner as shown in the 
> > code below:
> > [...]
> > I don't seem to observe any form of improvement in performance. From my 
> > observations the second CG solve with the (1,1) block takes around 70 
> > iterations to converge, which adds to the bulk of the computation time. 
> > I would most likely have to add some performance improvements here for 
> > precAs, which might bring down the iteration counts and speed up things. 
> > Do you think this would be the right way to approach this problem?
>
> I can't see where you use the AMG preconditioners, but the same applies: 
> You should only set them up once and then re-use many times. That is, 
> the preconditioners need to live *outside* the place where you solve the 
> inner (1,1) block.
>
> Perhaps as a general rule, people spend whole PhD theses to develop good 
> parallel solvers and preconditioners. In industry, consultants are paid 
> thousands or tens of thousands of dollars to figure out good solvers and 
> preconditioners. You should expect that figuring this out is a long 
> learning process that involves developing the skills to set up block 
> preconditioners in the right place, and to find ways to timing the right 
> places. This is not going to be an easy process; there is also not going 
> to be a magic bullet the good people on this mailing list have that will 
> magically make it work for you.
>
> Best
> W.
>

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/dealii/cab96f2d-8737-4093-bf29-d8977ddc6722n%40googlegroups.com.

Re: [deal.II] Problem with scaling with parallel::distributed::Triangulation and direct solver

Reply via email to