Roy Stogner wrote: > > There's an optimization project proposal being put together at > UT-Austin, which is likely to end up getting some new eyes towards > optimizing libMesh; possibly PETSc later. (not to insinuate that > PETSc isn't already better optimized than we are, but since it ends up > being a big chunk of many libMesh apps' runtime then any possible > further improvements would be welcome)
The core PETSc kernels have had a bit of attention, but there is certainly more loop optimization that would be possible (and restrict annotation, many compilers support it even though C++ refuses to mandate it). But this will have very limited effect because almost nothing is CPU-bound (less every year), instead we're limited by memory bandwidth and L1 reuse. This paper is still very relevant ftp://info.mcs.anl.gov/pub/tech_reports/reports/P833.pdf My lightweight p-version FEM library (implemented as a PETSc DM) performs two memory "optimizations" that have significant effects on the solver: 1. Interlace fields in vector problems, so e.g. deformation dofs in elasticity would be ordered [u0,v0,w0,u1,v1,w1,...]. This allows better L1 reuse (better hardware prefetch because there are fewer pointers to track, you only pay for the latency of a cache miss once for all 3 fields rather than once per field). It also allows blocked matrix formats (BAIJ) which are smaller (one column index per block rather than one per entry, thus sizeof(MatScalar) + sizeof(PetscInt)/(bs*bs) per entry instead of sizeof(MatScalar) + sizeof(PetscInt) Furthermore, matrix kernels (multiplication, factorization, and triangular solves) are fully unrolled within blocks and SOR turns into block SOR (stronger). Insertion is also faster with blocked formats, note that you can use MatSetValuesBlocked() (and variants) even with scalar formats, you just need to call MatSetBlockSize(). Even when you are using a scalar format (AIJ), interlacing fields offers more than just better cache reuse because PETSc uses inodes (by default) for the critical kernels. Inodes identify consecutive rows with identical nonzero structure and store the column indices once, then matrix kernels will be unrolled over the identical rows meaning that you are only seeing sizeof(MatScalar) + sizeof(PetscInt)/bs per entry. Also SOR again becomes block SOR. 2. Reorder dofs within subdomains to reduce bandwidth. This is *not* to speed up direct solves, it is a myth that direct solvers try to minimize bandwidth. It is to improve L1 reuse in matrix kernels and to make relaxation and *incomplete* factorization stronger. Try running src/ksp/ksp/examples/tutorials/ex2 (or any other) with -pc_type lu -pc_factor_mat_ordering_type nd -mat_view_ordering_draw \ -draw_pause 2 -mat_view_draw -info |grep fill Nested dissection (nd) is the default ordering for LU, it is *not* low-bandwidth. Compare to a low-bandwidth ordering like reverse Cuthill-McKee (rcm) which will produce more fill. Now compare iteration count with -pc_type ilu and the ordering choices above, ND is a horrible ordering (much worse than the natural ordering in this case) for incomplete factorization. I reorder nodes within subdomains so that my natural ordering is approximately RCM (by default), thus incomplete factorization with the natural ordering is as strong (and fast) as possible. Beyond these, large improvements are mostly algorithmic (physics-based preconditioners, latency tolerant Krylov methods, improved globalization schemes, stiff time integration, etc). Jed
signature.asc
Description: OpenPGP digital signature
------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________ Libmesh-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/libmesh-devel
