I performed the tests with the very specific aim of demonstrating that
for the type of problem I am dealing with, (1) the solver generally
requires over 90% of the run time, and hence is the major area that
should be optimized,

I'd say that's only true if you have a poor preconditioner (e.g. SSOR). With good preconditioners (e.g. AMG or an ILU) you should be able to get that down to 50% or so.


Under these idealized conditions, the Trilinos CG solver with SSOR
preconditioner performed very well, in terms of speed up attained. The
maximum deviation from a linear speed up was 30% for up to 8
processors. For processor 1-4 it was around 10%. (These were measured
on a single xeon 8 core chip. I am waiting for a job with two 4 core
chips to run so that I can show the (expected) performance drop as off
chip communications affects the results. As I said, idealized
conditions.)

My surprise came when using the deal.II CG solver with SSOR as
preconditioner. My results for a single processor took slightly less
than half the time the Trilinos solver required when using one MPI
thread, which is great, but I found virtually no speed up from 1
thread to 8 threads.

Right, but I suspect that that is because deal.II and Trilinos disagree on what SSOR means. In deal.II, we apply SSOR to the entire matrix, i.e. it is a sequential algorithm because you need to have the result of the previous row's operation to substitute in the current row. I suspect that what Trilinos means is that it chops the matrix into a number of blocks and applies the SSOR algorithm to each of these blocks. An alternative viewpoint is that the matrix is subdivided into BxB blocks and that the SSOR method is only applied to the B diagonal blocks; since they don't couple with each other, this creates the potential for parallelization, at the cost that the preconditioner is worse than one that considers the entire matrix.


> I (mistakenly?) thought that the deal.II vmult
method was threaded and should have shown at least some speed up.

The SparseMatrix::vmult function is; but the PreconditionSSOR::vmult function isn't.


For the record, I controlled the number of threads deal.II used by
explicitly editing source/base/multithread_info.cc to set n_cpus to my
desired value.

This is sort of an outdated way of doing things since most of the library has been converted to the tasks framework instead of explicit threads. The variable you set does not affect the parallelization of SparseMatrix::vmult, for example. However, you can control how many tasks should be created at once using a method described in

http://www.dealii.org/developer/doxygen/deal.II/group__threads.html#MTTaskThreads

I think you've already found this. I suppose you set this to the same value as for threads?


I also used UMFPACK to solve the system. On the xeon 8 core chip,  the
deal.II CG solver + SSOR preconditioner beat UMFPACK by about 20% when
I reinitialized UMFPACK every time I needed to solve the matrix. On my
laptop the opposite occurs and UMFPACK beats CG by about 16%. I would
expect the difference lies in the versions of BLAS I am using on the
different machines. When I only initialize the UMFPACK matrix once and
then use it for the remainder of the time steps (which my test case
allows, but in general cannot be done) UMFPACK is an order of
magnitude faster than the rest, perhaps unsurprisingly.

Yes, I think this is a general observation. For things that have less than 100,000 DoFs, umfpack is generally the fastest method. Your problem would fall into this category.

Cheers
 W.

--
------------------------------------------------------------------------
Wolfgang Bangerth               email:            [email protected]
                                www: http://www.math.tamu.edu/~bangerth/

_______________________________________________
dealii mailing list http://poisson.dealii.org/mailman/listinfo/dealii

Reply via email to