>> I'll be happy to try those out and report figures. Here for some timings with the new UFC changes:
ufc.get_cell_integral: 00:00:00.007066 ufc.update: 00:00:01.092362 local-to-global dof maps: 00:00:00.008636 integral->tabulate_tensor: 00:00:03.379240 add_to_global_tensor 00:00:01.634146 -- Well, the benefit for my application isn't that great. Speeding up the assemble process for one call isn't necessarily the intention here anyways; I'm more concerned about a series of similar linear problems. Could it be, for example, that ufc.update(...) does the same thing over and over if you call assemble_cells(...) for the same mesh and function spaces (i.e., set of test and trial functions)? As for the add_to_global_tensor(...) call, I'm not sure why it takes so long at all. All that seems to going on is a bit of vector i/o. Hm. --Nico On Mon, Dec 16, 2013 at 4:09 PM, Garth N. Wells <[email protected]> wrote: > On 2013-12-16 14:58, Nico Schlömer wrote: >>> >>> How did you do the timing? >> >> >> #include "boost/date_time/posix_time/posix_time.hpp" >> using namespace boost::posix_time; >> >> for () { >> time_start = microsec_clock::local_time(); >> // ...code to time... >> time_end = microsec_clock::local_time(); >> duration += time_end - time_start; >> } >> std::cout << duration << std::endl; >> >>> Some of these calls involve so little work that they are hard to time. >> >> >> Indeed, but in the case here not too small to be missed by the >> microsecond timer. (The number of cells is 44976 here, so one >> ufc.update call for example is around 25 ms.) >> > > Some are too small, e.g. getting a reference to a dofmap. The creation and > destruction of the timer will dominate. > > >>> There are some optimisations for UFC::update(...) in >>> >>> https://bitbucket.org/fenics-project/dolfin/pull-request/73/ >> >> >> I'll be happy to try those out and report figures. (Right now I'm >> still getting a compilation error, >> >> <https://bitbucket.org/fenics-project/dolfin/pull-request/73/changes-for-ufc-cell-clean-up/diff>). >> > > You need to get the UFC changes too. > > >>> Preserving the sparsity structure makes a significant difference, but I >>> believe that you'll see the difference when GenericMatrix::apply() is >>> called. >> >> >> That may be; right now, the computation time for the Krylov iteration >> is much smaller than for the Jacobian construction which is why I >> haven't taken a closer look yet. >> >>> There is limit scope for speed up of this step (if you want to solve >>> Ax=b). >> >> > > If you can get away without a preconditioner, you can: (1) go matrix-free; > or (2) implement an unassembled matrix data structure that can do mat-vec > products. I'm not aware of PETSc supporting unassembled matrices. > > Garth > > >> Like I said, it's more that I'm solving a series of $A_i x = b_i$ >> (nonlinear time integration -- Navier-Stokes in fact). >> >> Cheers, >> Nico >> >> >> On Mon, Dec 16, 2013 at 3:08 PM, Garth N. Wells <[email protected]> wrote: >>> >>> On 2013-12-16 13:23, Nico Schlömer wrote: >>>> >>>> >>>> I did some timings here and found that whenever the Jacobian is >>>> constructed, most of the computational time goes into the cell >>>> iteration in Assembler::assemble_cells(...). Specifically (in units of >>>> seconds) >>>> >>> >>> Not surprising. It's where all the work is done. >>> >>> >>>> ufc.get_cell_integral: 00:00:00.006977 >>>> ufc.update 00:00:01.133166 >>>> get local-to-global dof maps: 00:00:00.009772 >>>> integral->tabulate_tensor: 00:00:03.289413 >>>> add_to_global_tensor: 00:00:01.693635 >>>> >>> >>> How did you do the timing? Some of these calls involve so little work >>> that >>> they are hard to time. >>> >>> >>>> I'm not entirely sure what the purpose of all of these call is, but I >>>> guess that tabulate_tensor does the actual heavy lifting, i.e., the >>>> integration. Besides this, the ufc.update >>> >>> >>> >>> There are some optimisations for UFC::update(...) in >>> >>> https://bitbucket.org/fenics-project/dolfin/pull-request/73/ >>> >>> UFC::update(...) copies some data unnecessarily (e.g. topology data). >>> >>> >>>> and the addition to the >>>> global tensor take a significant amount of time. >>>> Since I'm solving a series of linear problems (in fact, a (time) >>>> series of nonlinear problems) very similar in structure, I think the >>>> one or the other call might be cached away. The mere caching of the >>>> sparsity structure as done in the CahnHilliard demo doesn't do much. >>> >>> >>> >>> Preserving the sparsity structure makes a significant difference, but I >>> believe that you'll see the difference when GenericMatrix::apply() is >>> called. >>> >>> >>>> Does anyone have more insight into what might be worth exploiting here >>>> for speedup? >>>> >>> >>> There is limit scope for speed up of this step (if you want to solve >>> Ax=b). >>> I have a plan for some mesh data reordering that can speed up insertion >>> through improved data locality. One reason the insertion looks relatively >>> costly is that the tabulate_tensor function is highly optimised. From >>> numbers I've seen about, it can be several times faster than for other >>> codes. >>> >>> It is possible to assemble into a different sparse matrix data >>> structures, >>> but my experience is that this pushes the time cost further down the >>> line, >>> i.e. to the linear solver. You can try assembling into a >>> dolfin::STLMatrix. >>> >>> Garth >>> >>> >>> >>>> Cheers, >>>> Nico >>>> >>>> >>>> >>>> On Mon, Jun 3, 2013 at 8:54 PM, Garth N. Wells <[email protected]> wrote: >>>>> >>>>> >>>>> On 3 June 2013 19:49, Nico Schlömer <[email protected]> wrote: >>>>>> >>>>>> >>>>>> Hi all, >>>>>> >>>>>> when solving nonlinear problems, I simply went with >>>>>> >>>>>> # define F >>>>>> J = derivative(F, u) >>>>>> solve(F1 == 0, u, bcs, J) >>>>>> >>>>>> for now (which uses Newton's method). >>>>>> I noticed, however, that the computation of the Jacobian, >>>>>> >>>>>> nonlinear_problem.J(*_A, x); >>>>>> >>>>>> takes by the most time in the computation. >>>>>> Is there some caching I could employ? I need to solve a similar >>>>>> nonlinear >>>>>> system in each time step. >>>>>> >>>>> >>>>> Use the lower-level NewtonSolver class. You then have complete control >>>>> over how J is computed/supplied/cached. >>>>> >>>>> The Cahn-Hilliard demo illustrates use of the NewtonSolver class. >>>>> >>>>> Garth >>>>> >>>>> >>>>>> --Nico >>>>>> >>>>>> _______________________________________________ >>>>>> fenics-support mailing list >>>>>> [email protected] >>>>>> http://fenicsproject.org/mailman/listinfo/fenics-support >>>>>> >>> > _______________________________________________ fenics-support mailing list [email protected] http://fenicsproject.org/mailman/listinfo/fenics-support
