>> I'll be happy to try those out and report figures.

Here for some timings with the new UFC changes:

ufc.get_cell_integral:     00:00:00.007066
ufc.update:                00:00:01.092362
local-to-global dof maps:  00:00:00.008636
integral->tabulate_tensor: 00:00:03.379240
add_to_global_tensor       00:00:01.634146

-- Well, the benefit for my application isn't that great.
Speeding up the assemble process for one call isn't necessarily the
intention here anyways; I'm more concerned about a series of similar
linear problems.
Could it be, for example, that ufc.update(...) does the same thing
over and over if you call assemble_cells(...) for the same mesh and
function spaces (i.e., set of test and trial functions)?

As for the add_to_global_tensor(...) call, I'm not sure why it takes
so long at all. All that seems to going on is a bit of vector i/o. Hm.

--Nico



On Mon, Dec 16, 2013 at 4:09 PM, Garth N. Wells <[email protected]> wrote:
> On 2013-12-16 14:58, Nico Schlömer wrote:
>>>
>>> How did you do the timing?
>>
>>
>> #include "boost/date_time/posix_time/posix_time.hpp"
>> using namespace boost::posix_time;
>>
>> for () {
>>     time_start = microsec_clock::local_time();
>>     // ...code to time...
>>     time_end = microsec_clock::local_time();
>>     duration += time_end - time_start;
>> }
>> std::cout << duration << std::endl;
>>
>>> Some of these calls involve so little work that they are hard to time.
>>
>>
>> Indeed, but in the case here not too small to be missed by the
>> microsecond timer. (The number of cells is 44976 here, so one
>> ufc.update call for example is around 25 ms.)
>>
>
> Some are too small, e.g. getting a reference to a dofmap. The creation and
> destruction of the timer will dominate.
>
>
>>> There are some optimisations for UFC::update(...) in
>>>
>>>     https://bitbucket.org/fenics-project/dolfin/pull-request/73/
>>
>>
>> I'll be happy to try those out and report figures. (Right now I'm
>> still getting a compilation error,
>>
>> <https://bitbucket.org/fenics-project/dolfin/pull-request/73/changes-for-ufc-cell-clean-up/diff>).
>>
>
> You need to get the UFC changes too.
>
>
>>> Preserving the sparsity structure makes a significant difference, but I
>>> believe that you'll see the difference when GenericMatrix::apply() is
>>> called.
>>
>>
>> That may be; right now, the computation time for the Krylov iteration
>> is much smaller than for the Jacobian construction which is why I
>> haven't taken a closer look yet.
>>
>>> There is limit scope for speed up of this step (if you want to solve
>>> Ax=b).
>>
>>
>
> If you can get away without a preconditioner, you can: (1) go matrix-free;
> or (2) implement an unassembled matrix data structure that can do mat-vec
> products. I'm not aware of PETSc supporting unassembled matrices.
>
> Garth
>
>
>> Like I said, it's more that I'm solving a series of $A_i x = b_i$
>> (nonlinear time integration -- Navier-Stokes in fact).
>>
>> Cheers,
>> Nico
>>
>>
>> On Mon, Dec 16, 2013 at 3:08 PM, Garth N. Wells <[email protected]> wrote:
>>>
>>> On 2013-12-16 13:23, Nico Schlömer wrote:
>>>>
>>>>
>>>> I did some timings here and found that whenever the Jacobian is
>>>> constructed, most of the computational time goes into the cell
>>>> iteration in Assembler::assemble_cells(...). Specifically (in units of
>>>> seconds)
>>>>
>>>
>>> Not surprising. It's where all the work is done.
>>>
>>>
>>>> ufc.get_cell_integral:             00:00:00.006977
>>>> ufc.update                             00:00:01.133166
>>>> get local-to-global dof maps: 00:00:00.009772
>>>> integral->tabulate_tensor:     00:00:03.289413
>>>> add_to_global_tensor:          00:00:01.693635
>>>>
>>>
>>> How did you do the timing? Some of these calls involve so little work
>>> that
>>> they are hard to time.
>>>
>>>
>>>> I'm not entirely sure what the purpose of all of these call is, but I
>>>> guess that tabulate_tensor does the actual heavy lifting, i.e., the
>>>> integration. Besides this, the ufc.update
>>>
>>>
>>>
>>> There are some optimisations for UFC::update(...) in
>>>
>>>     https://bitbucket.org/fenics-project/dolfin/pull-request/73/
>>>
>>> UFC::update(...) copies some data unnecessarily (e.g. topology data).
>>>
>>>
>>>> and the addition to the
>>>> global tensor take a significant amount of time.
>>>> Since I'm solving a series of linear problems (in fact, a (time)
>>>> series of nonlinear problems) very similar in structure, I think the
>>>> one or the other call might be cached away. The mere caching of the
>>>> sparsity structure as done in the CahnHilliard demo doesn't do much.
>>>
>>>
>>>
>>> Preserving the sparsity structure makes a significant difference, but I
>>> believe that you'll see the difference when GenericMatrix::apply() is
>>> called.
>>>
>>>
>>>> Does anyone have more insight into what might be worth exploiting here
>>>> for speedup?
>>>>
>>>
>>> There is limit scope for speed up of this step (if you want to solve
>>> Ax=b).
>>> I have a plan for some mesh data reordering that can speed up insertion
>>> through improved data locality. One reason the insertion looks relatively
>>> costly is that the tabulate_tensor function is highly optimised. From
>>> numbers I've seen about, it can be several times faster than for other
>>> codes.
>>>
>>> It is possible to assemble into a different sparse matrix data
>>> structures,
>>> but my experience is that this pushes the time cost further down the
>>> line,
>>> i.e. to the linear solver. You can try assembling into a
>>> dolfin::STLMatrix.
>>>
>>> Garth
>>>
>>>
>>>
>>>> Cheers,
>>>> Nico
>>>>
>>>>
>>>>
>>>> On Mon, Jun 3, 2013 at 8:54 PM, Garth N. Wells <[email protected]> wrote:
>>>>>
>>>>>
>>>>> On 3 June 2013 19:49, Nico Schlömer <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> when solving nonlinear problems, I simply went with
>>>>>>
>>>>>> # define F
>>>>>> J = derivative(F, u)
>>>>>> solve(F1 == 0, u, bcs, J)
>>>>>>
>>>>>> for now (which uses Newton's method).
>>>>>> I noticed, however, that the computation of the Jacobian,
>>>>>>
>>>>>> nonlinear_problem.J(*_A, x);
>>>>>>
>>>>>> takes by the most time in the computation.
>>>>>> Is there some caching I could employ? I need to solve a similar
>>>>>> nonlinear
>>>>>> system in each time step.
>>>>>>
>>>>>
>>>>> Use the lower-level NewtonSolver class. You then have complete control
>>>>> over how J is computed/supplied/cached.
>>>>>
>>>>> The Cahn-Hilliard demo illustrates use of the NewtonSolver class.
>>>>>
>>>>> Garth
>>>>>
>>>>>
>>>>>> --Nico
>>>>>>
>>>>>> _______________________________________________
>>>>>> fenics-support mailing list
>>>>>> [email protected]
>>>>>> http://fenicsproject.org/mailman/listinfo/fenics-support
>>>>>>
>>>
>
_______________________________________________
fenics-support mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics-support

Reply via email to