Dear Slawa,
> In my current simulations (with not more than 160k elements) system
> assembly takes around 90% of computational time and I am not
> happy with this fact. I clearly understand that my implementation
> (see below) is fully non-optimal but rather fully readable, it
> follows the tutorial style. But in any case, could you share your
> thoughts on how to speed up the system assembly?
One way for getting hints _what_ to improve is to run the program with
valgrind's callgrind, e.g. as
$ valgrind --tool=callgrind ./step-xx
That creates a log file which you can inspect with KCachegrind, and
there you can see in which lines you spend most of the time when
executing the program.
As for your program: Usually the matrix assembly is the most expensive
part, compared to the vector assembly. One thing that certainly can be
improved is to rewrite this formula:
> //This includes anti-trapping term
> if (grad_phi_norm>params.machine_zero)
> cell_matrix(i,j) +=
> params.a_coef * params.W_coef * params.c_l0_coef *
> (1.0 - params.k_coef) * timestep_inv / grad_phi_norm *
> exp(u_func(phi_old,c_old)) *
> fe_values[phi].value (j, q_point) *
> fe_values[conc].gradient (i, q_point) *
> grad_solution[q_point][0]
> *
> fe_values.JxW(q_point);
> }
Here you calculate a product of a lot of numbers (and and exponential)
at the innermost position of the loop (then I'm not sure if you really
enter this part of the loop regularly). I expect it would be much faster
if you calculated the product
Tensor<1,2> factors = params.a_coef * ... * exp(...) *
grad_solution[q_point] * fe_values.JxW(q_point)
before the loop over j and i, and then just used
cell_matrix(i,j) += fe_values[].value * fe_values[].gradient * factors
Theoretically, a good compiler could possibly track that product back
and do just as I proposed, but I don't think compilers are that good...
Anyway, you might be able to hand-optimize some of the things like this.
If you're really into writing fast assembly routines, you would need to
- as you state in your e-mail - sacrifice readability. You can expect
the matrix assembly to be faster if you
1. do not use the FEValuesExtractors, but distinguish the cases with
fe.system_to_component_index(i).first. Then you do not have to calculate
with a lot of zeros that are otherwise introduced e.g. for the term
fe_val[phi].value(i,q) * fe_val[phi].value(j,q) when you're actually on
a conc-dof on the cell.
2. combine 1. with rearranging loops so that the q_point-loop is at the
innermost place, which helps to reduce the impact of if() statements you
introduce with suggestion #1
3. try to reuse data, i.e., pre-calculate the factors in the loop over
the quadrature points before you enter the loops over the dofs. On a
larger scale, you might even recognize that some parts of your matrix
are the same from one cell to the next if the cells have the same
geometry. E.g., the part of cell_matrix which is calculated from
fe_val[conc].value * fe_val[conc].value should remain unchanged if
cell->get_cell_similarity() gives you the return value
CellSimilarity::translation. However, then you need to selectively zero
your matrix, which is not pretty.
A more general improvement strategy (and yet giving even uglier code)
would be to replace the hand-written loops over q, i, j by a
matrix-matrix product. Especially if you use higher order methods in 3D
combined with deal.II configured with an optimized BLAS, that can give
you a factor of 10 of an improvement. But that means that you have to
find matrices that do the same as the hand-written assembly code, which
is very technical. It took me weeks to get the assembly of the Jacobian
matrix for the Newton iteration in the Navier-Stokes equations into such
form and to find all the little errors you make when you try to do so -
but now it's really fast...
Best,
Martin
_______________________________________________
dealii mailing list http://poisson.dealii.org/mailman/listinfo/dealii