>> It seems this thread got a bit derailed yesterday :( >> >> I've done some more careful profiling: >> *) Full assemble once >> *) Assemble matrix 30 times without reset >> in order to amortize the time for initialization. >> >> The call graph shows that std::lower_bound is called from add: >> dolfin::GenericMatrix::add -> >> dolfin::uBlasMatrix<>:add -> >> std::lower_bound >> >> In this assembly add+children takes 89% of the time, and tabulate_tensor >> taking roughly 9%. Full (gprof) profile attached. Murtazo is probably >> right that the performance numbers are virtually the same with PETSc. I >> will hook it up and try (and let you know if this is not the case). >> > > Yes, I got the same numbers with PETSc. I checked and it it the same > problem with uBlas, I am pretty sure that searching the elements in the > assembly takes a very long time. Is it possible to change the element > matrix A(indx) directly in uBlas? If it is possible we may get that the > speedup would be much. > > There is a MatSetOption in PETSc, MAT_USE_HASH_TABLE, to take care what > exactly I would like to have. But, that option does not work with AIJ > format we are using in dolfin.
Ok. Good. Why does this not work? With what matrix formats does it work? /Johan > murtazo > >> I do appreciate the complexity of inserting elements into a sparse >> matrix, and I do _not_ claim to know better when it comes to the >> assembler architecture. >> >> Still, as I vary the size of the mesh I get this performance metric >> virtually constant: >> Assembled 7.3e+05 non-zero matrix elements per second (first pass) >> Assembled 1.4e+06 non-zero matrix elements per second (re-assemble). >> >> Is this a sensible metric? If so, is it well understood how the DOLFIN >> assembler performs? If not, I could put together a test-suite for a >> range of forms (2/3D, simple/mixed element, simple/complicated >> expressions in the form etc). >> >> To restate my question: How should I verify that the assembler is >> performing as expected here? Am I looking at some unexpected overhead in >> this assembly (we all know how hard this can be to spot with C++)? >> >> Thanks! >> /Dag >> >> Garth N. Wells wrote: >>> >>> Anders Logg wrote: >>>> On Fri, May 16, 2008 at 12:17:19AM +0200, Murtazo Nazarov wrote: >>>>>> Hello! >>>>>> >>>>>> I'm looking at a "suspiciously slow" assembly and would like to >>>>>> determine what is going on. In general, what should one expect the >>>>>> most >>>>>> time-consuming step to be? >>>>>> >>>>>> This is what my gprof looks like: >>>>>> >>>>>> Time: >>>>>> 61.97% unsigned int const* std::lower_bound >>>>>> 25.84% dolfin::uBlasMatrix<...>::add >>>>>> 8.27% >>>>>> UFC_NSEMomentum3DBilinearForm_cell_integral_0::tabulate_tensor >>>>>> 1.1% dolfin::uBlasMatrix<...>::init >>>> Where is lower_bound used? From within uBlasMatrix::add or is it in >>>> building the sparsity pattern? >>>> >>> >>> I suspect that it's either in building the sparsity pattern or >>> initialising the uBLAS matrix. The matrix structure is initialised by >>> running across rows and inserting a zero. uBLAS doesn't provide a >>> mechanism for initialising the underlying data structures directly for >>> a >>> sparse matrix. >>> >>> Dag: could you you run the same test using PETSc as the backend? >>> >>>>> I got these numbers also. I understand that it is very painful in >>>>> large >>>>> computations. >>>>> >>>>> I see what is a problem with adding into the stiffness matrix A. >>>>> Searching >>>>> the position of the element which needs to be added takes very long >>>>> time, >>>>> especially if you are solving big problems with thousands unknowns >>>>> and >>>>> repeating the assembling a lot of times! >>>> If you know a good way to avoid inserting entries into a sparse matrix >>>> during assembly, please tell me... :-) >>>> >>>> If the assembly is costly, you might want to try assembling the action >>>> of it instead and send that to a Krylov solver. Inserting into a >>>> vector is much easier than into a sparse matrix. >>>> >>>>> One way could be finding the global indices of the matrix A once, and >>>>> use >>>>> it in the assembly process. By this way we avoid of searching the >>>>> element >>>>> position and it makes the process significantly fast. But, there is a >>>>> problem: somehow I cannot get access to the global index of cell in >>>>> the A >>>>> and change it instead of using MatSetValues (in PETSc). >>>> I don't understand what you suggest here. We do precompute the >>>> sparsity pattern of the matrix and use that to preallocate, but I >>>> don't know of any other way to insert entries than MatSetValues. >>>> >>> >>> I doubt insertion is the real problem, especially as Dag noted that >>> subsequent assembly operations take only half the time since the matrix >>> is already initialised. >>> >>> PETSc (and no doubt Trilinos) do offer some assembly possibilities that >>> we haven't yet exploited because they require a reorganisation of the >>> dof map. >>> >>> Garth >>> >>>>> I am pretty sure that we may speed up the A.set() and A.get() >>>>> processes as >>>>> well by the above method. >>>> Please explain. >>>> >>>>> I am not sure how the dofmap to get rows and cols indices of the >>>>> cells >>>>> is >>>>> implemented. We could avoid repeating this operation as well. >>>> This is already implemented (but maybe not used). DofMap handles this. >>>> It wraps the generated ufc::dof_map code and may pretabulate (and >>>> possibly reorder) the dofs. >>>> >>>>> We did some comparison with another free fem toolbox, FemLego, the >>>>> assembly process in Dolfin is 3 times slower than FemLego in 2D. I >>>>> believe >>>>> this number will increase in 3D. FemLego uses quadrature rule for >>>>> computing integrals. >>>> Can you benchmark the various parts of the assembly to see what causes >>>> the slowdown: >>>> >>>> 1. Is it tabulate_tensor? >>>> 2. Is it tabulate_dofs? >>>> 3. Is it A.add()? >>>> 4. Something else? >>>> >>>>> I hope some PETSc guys will help us to do this improvements. Any >>>>> other >>>>> ideas are welcome! >>>> We are currently experimenting with collecting and preprocessing >>>> batches of entries before inserting into the global sparse matrix in >>>> hope of speeding up the assembly but we don't have any results yet. >>>> >>> _______________________________________________ >>> DOLFIN-dev mailing list >>> [email protected] >>> http://www.fenics.org/mailman/listinfo/dolfin-dev >> _______________________________________________ >> DOLFIN-dev mailing list >> [email protected] >> http://www.fenics.org/mailman/listinfo/dolfin-dev >> > > > _______________________________________________ > DOLFIN-dev mailing list > [email protected] > http://www.fenics.org/mailman/listinfo/dolfin-dev > _______________________________________________ DOLFIN-dev mailing list [email protected] http://www.fenics.org/mailman/listinfo/dolfin-dev
