Re: [deal.II] Speed up of CG solver

Wolfgang Bangerth Sun, 15 Jan 2012 16:05:55 -0800

I performed the tests with the very specific aim of demonstrating that
for the type of problem I am dealing with, (1) the solver generally
requires over 90% of the run time, and hence is the major area that
should be optimized,

I'd say that's only true if you have a poor preconditioner (e.g. SSOR).With good preconditioners (e.g. AMG or an ILU) you should be able to getthat down to 50% or so.

Under these idealized conditions, the Trilinos CG solver with SSOR
preconditioner performed very well, in terms of speed up attained. The
maximum deviation from a linear speed up was 30% for up to 8
processors. For processor 1-4 it was around 10%. (These were measured
on a single xeon 8 core chip. I am waiting for a job with two 4 core
chips to run so that I can show the (expected) performance drop as off
chip communications affects the results. As I said, idealized
conditions.)

My surprise came when using the deal.II CG solver with SSOR as
preconditioner. My results for a single processor took slightly less
than half the time the Trilinos solver required when using one MPI
thread, which is great, but I found virtually no speed up from 1
thread to 8 threads.

Right, but I suspect that that is because deal.II and Trilinos disagreeon what SSOR means. In deal.II, we apply SSOR to the entire matrix, i.e.it is a sequential algorithm because you need to have the result of theprevious row's operation to substitute in the current row. I suspectthat what Trilinos means is that it chops the matrix into a number ofblocks and applies the SSOR algorithm to each of these blocks. Analternative viewpoint is that the matrix is subdivided into BxB blocksand that the SSOR method is only applied to the B diagonal blocks; sincethey don't couple with each other, this creates the potential forparallelization, at the cost that the preconditioner is worse than onethat considers the entire matrix.



> I (mistakenly?) thought that the deal.II vmult

method was threaded and should have shown at least some speed up.

The SparseMatrix::vmult function is; but the PreconditionSSOR::vmultfunction isn't.

For the record, I controlled the number of threads deal.II used by
explicitly editing source/base/multithread_info.cc to set n_cpus to my
desired value.

This is sort of an outdated way of doing things since most of thelibrary has been converted to the tasks framework instead of explicitthreads. The variable you set does not affect the parallelization ofSparseMatrix::vmult, for example. However, you can control how manytasks should be created at once using a method described in


http://www.dealii.org/developer/doxygen/deal.II/group__threads.html#MTTaskThreads

I think you've already found this. I suppose you set this to the samevalue as for threads?

I also used UMFPACK to solve the system. On the xeon 8 core chip,  the
deal.II CG solver + SSOR preconditioner beat UMFPACK by about 20% when
I reinitialized UMFPACK every time I needed to solve the matrix. On my
laptop the opposite occurs and UMFPACK beats CG by about 16%. I would
expect the difference lies in the versions of BLAS I am using on the
different machines. When I only initialize the UMFPACK matrix once and
then use it for the remainder of the time steps (which my test case
allows, but in general cannot be done) UMFPACK is an order of
magnitude faster than the rest, perhaps unsurprisingly.

Yes, I think this is a general observation. For things that have lessthan 100,000 DoFs, umfpack is generally the fastest method. Your problemwould fall into this category.


Cheers
 W.

--
------------------------------------------------------------------------
Wolfgang Bangerth               email:            [email protected]
                                www: http://www.math.tamu.edu/~bangerth/

_______________________________________________
dealii mailing list http://poisson.dealii.org/mailman/listinfo/dealii

Re: [deal.II] Speed up of CG solver

Reply via email to