On Wed, Nov 10, 2010 at 03:58:13PM +0000, Garth N. Wells wrote: > > > On 10/11/10 15:47, Anders Logg wrote: > >On Wed, Nov 10, 2010 at 02:47:30PM +0000, Garth N. Wells wrote: > >>Nice to see multi-thread assembly being added. We should look at > >>adding support for the multi-threaded version of SuperLU. What other > >>multi-thread solvers are out there? > > > >Yes, that would be good, but I don't know which solvers are available. > > > >>I haven't looked at the code in great detail, but are element > >>tensors being added to the global tensor is a thread-safe fashion? > >>Both PETSc and Trilinos are not thread-safe. > > > >Yes, they should. That's the main point. It's a very simple algorithm > >which just partitions the matrix row by row and makes each process > >responsible for a chunk of rows. > > Would it be better to partition the mesh (using Metis) and then > renumber dofs? That way the 'all_in_range' case would be maximised > and the 'some_in_range' would be minimised. If the rows are > distributed, mixed elements will be a mess because the global rows > are far apart (using the FFC-generated dof map).
Renumbering is definitely important for getting good speedup. This hasn't been added yet but was implemented in the prototype version (which was stand-alone from DOLFIN). > >During assembly, all processes > >iterate over the entire mesh and on each cell does one of three things: > > > > 1. all_in_range: tabulate_tensor as usual and add > > 2. none_in_range: skip tabulate_tensor (continue) > > 3. some_in_range: tabulate_tensor and insert only rows in range > > > >Didem Unat (PhD student at UCLA/Simula) tried this in a simple > >prototype code and got very good speedups (up to a factor 7 on an > >eight-core machine) > > I hope the test wasn't for Poisson on a dolfin::UnitFoo mesh. That would have been much better! The best speedups are seen for "complex" forms where tabulate_tensor takes a significant portion of the time. The best speedup was obtained for the NS nonlinear term in 3D (unit cube). For 3D elasticity the results were almost as good, a bit worse for Poisson. The speedup was reduced slightly by going to higher order but not much (from 7 to 6 when going to cubics). -- Anders > Garth > > >so it's just a matter of doing the same thing as > >part of DOLFIN (which is a bit trickier since some of the data access > >is hidden). The current implementation in DOLFIN seems to work and > >give some small speedup but I need to do some more testing. > > > >>Rather than having two assembly classes, would it be worth using > >>OpenMP instead? I experimented with OpenMP some time ago, but never > >>added it since at the time it required a very recent version of gcc. > >>This shouldn't be a problem now. > > > >I don't think this would work with OpenMP since we need to control how > >the rows are inserted. > > > >If this works out and we get good speedups, we could consider > >replacing Assembler by MulticoreAssembler. It's not that much extra > >code and it's pretty clean. I haven't tried yet, but it should also > >work in combination with MPI (each node has a part of the mesh and > >does multi-core assembly). > > _______________________________________________ Mailing list: https://launchpad.net/~dolfin Post to : [email protected] Unsubscribe : https://launchpad.net/~dolfin More help : https://help.launchpad.net/ListHelp

