> How did you do the timing?

#include "boost/date_time/posix_time/posix_time.hpp"
using namespace boost::posix_time;

for () {
    time_start = microsec_clock::local_time();
    // ...code to time...
    time_end = microsec_clock::local_time();
    duration += time_end - time_start;
}
std::cout << duration << std::endl;

> Some of these calls involve so little work that they are hard to time.

Indeed, but in the case here not too small to be missed by the
microsecond timer. (The number of cells is 44976 here, so one
ufc.update call for example is around 25 ms.)

> There are some optimisations for UFC::update(...) in
>
>     https://bitbucket.org/fenics-project/dolfin/pull-request/73/

I'll be happy to try those out and report figures. (Right now I'm
still getting a compilation error,
<https://bitbucket.org/fenics-project/dolfin/pull-request/73/changes-for-ufc-cell-clean-up/diff>).

> Preserving the sparsity structure makes a significant difference, but I
> believe that you'll see the difference when GenericMatrix::apply() is
> called.

That may be; right now, the computation time for the Krylov iteration
is much smaller than for the Jacobian construction which is why I
haven't taken a closer look yet.

> There is limit scope for speed up of this step (if you want to solve Ax=b).

Like I said, it's more that I'm solving a series of $A_i x = b_i$
(nonlinear time integration -- Navier-Stokes in fact).

Cheers,
Nico


On Mon, Dec 16, 2013 at 3:08 PM, Garth N. Wells <[email protected]> wrote:
> On 2013-12-16 13:23, Nico Schlömer wrote:
>>
>> I did some timings here and found that whenever the Jacobian is
>> constructed, most of the computational time goes into the cell
>> iteration in Assembler::assemble_cells(...). Specifically (in units of
>> seconds)
>>
>
> Not surprising. It's where all the work is done.
>
>
>> ufc.get_cell_integral:             00:00:00.006977
>> ufc.update                             00:00:01.133166
>> get local-to-global dof maps: 00:00:00.009772
>> integral->tabulate_tensor:     00:00:03.289413
>> add_to_global_tensor:          00:00:01.693635
>>
>
> How did you do the timing? Some of these calls involve so little work that
> they are hard to time.
>
>
>> I'm not entirely sure what the purpose of all of these call is, but I
>> guess that tabulate_tensor does the actual heavy lifting, i.e., the
>> integration. Besides this, the ufc.update
>
>
> There are some optimisations for UFC::update(...) in
>
>     https://bitbucket.org/fenics-project/dolfin/pull-request/73/
>
> UFC::update(...) copies some data unnecessarily (e.g. topology data).
>
>
>> and the addition to the
>> global tensor take a significant amount of time.
>> Since I'm solving a series of linear problems (in fact, a (time)
>> series of nonlinear problems) very similar in structure, I think the
>> one or the other call might be cached away. The mere caching of the
>> sparsity structure as done in the CahnHilliard demo doesn't do much.
>
>
> Preserving the sparsity structure makes a significant difference, but I
> believe that you'll see the difference when GenericMatrix::apply() is
> called.
>
>
>> Does anyone have more insight into what might be worth exploiting here
>> for speedup?
>>
>
> There is limit scope for speed up of this step (if you want to solve Ax=b).
> I have a plan for some mesh data reordering that can speed up insertion
> through improved data locality. One reason the insertion looks relatively
> costly is that the tabulate_tensor function is highly optimised. From
> numbers I've seen about, it can be several times faster than for other
> codes.
>
> It is possible to assemble into a different sparse matrix data structures,
> but my experience is that this pushes the time cost further down the line,
> i.e. to the linear solver. You can try assembling into a dolfin::STLMatrix.
>
> Garth
>
>
>
>> Cheers,
>> Nico
>>
>>
>>
>> On Mon, Jun 3, 2013 at 8:54 PM, Garth N. Wells <[email protected]> wrote:
>>>
>>> On 3 June 2013 19:49, Nico Schlömer <[email protected]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> when solving nonlinear problems, I simply went with
>>>>
>>>> # define F
>>>> J = derivative(F, u)
>>>> solve(F1 == 0, u, bcs, J)
>>>>
>>>> for now (which uses Newton's method).
>>>> I noticed, however, that the computation of the Jacobian,
>>>>
>>>> nonlinear_problem.J(*_A, x);
>>>>
>>>> takes by the most time in the computation.
>>>> Is there some caching I could employ? I need to solve a similar
>>>> nonlinear
>>>> system in each time step.
>>>>
>>>
>>> Use the lower-level NewtonSolver class. You then have complete control
>>> over how J is computed/supplied/cached.
>>>
>>> The Cahn-Hilliard demo illustrates use of the NewtonSolver class.
>>>
>>> Garth
>>>
>>>
>>>> --Nico
>>>>
>>>> _______________________________________________
>>>> fenics-support mailing list
>>>> [email protected]
>>>> http://fenicsproject.org/mailman/listinfo/fenics-support
>>>>
>
_______________________________________________
fenics-support mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics-support

Reply via email to