Re: [Libmesh-devel] optimized function code (fwd)

Jed Brown Thu, 19 Nov 2009 03:53:58 -0800

Roy Stogner wrote:
> 
> There's an optimization project proposal being put together at
> UT-Austin, which is likely to end up getting some new eyes towards
> optimizing libMesh; possibly PETSc later.  (not to insinuate that
> PETSc isn't already better optimized than we are, but since it ends up
> being a big chunk of many libMesh apps' runtime then any possible
> further improvements would be welcome)


The core PETSc kernels have had a bit of attention, but there is
certainly more loop optimization that would be possible (and restrict
annotation, many compilers support it even though C++ refuses to mandate
it).  But this will have very limited effect because almost nothing is
CPU-bound (less every year), instead we're limited by memory bandwidth
and L1 reuse.

This paper is still very relevant

  ftp://info.mcs.anl.gov/pub/tech_reports/reports/P833.pdf


My lightweight p-version FEM library (implemented as a PETSc DM)
performs two memory "optimizations" that have significant effects on the
solver:

1. Interlace fields in vector problems, so e.g. deformation dofs in
elasticity would be ordered [u0,v0,w0,u1,v1,w1,...].  This allows better
L1 reuse (better hardware prefetch because there are fewer pointers to
track, you only pay for the latency of a cache miss once for all 3
fields rather than once per field).  It also allows blocked matrix
formats (BAIJ) which are smaller (one column index per block rather than
one per entry, thus

  sizeof(MatScalar) + sizeof(PetscInt)/(bs*bs)

per entry instead of

  sizeof(MatScalar) + sizeof(PetscInt)

Furthermore, matrix kernels (multiplication, factorization, and
triangular solves) are fully unrolled within blocks and SOR turns into
block SOR (stronger).  Insertion is also faster with blocked formats,
note that you can use MatSetValuesBlocked() (and variants) even with
scalar formats, you just need to call MatSetBlockSize().  Even when you
are using a scalar format (AIJ), interlacing fields offers more than
just better cache reuse because PETSc uses inodes (by default) for the
critical kernels.  Inodes identify consecutive rows with identical
nonzero structure and store the column indices once, then matrix kernels
will be unrolled over the identical rows meaning that you are only
seeing

  sizeof(MatScalar) + sizeof(PetscInt)/bs

per entry.  Also SOR again becomes block SOR.


2. Reorder dofs within subdomains to reduce bandwidth.  This is *not* to
speed up direct solves, it is a myth that direct solvers try to minimize
bandwidth.  It is to improve L1 reuse in matrix kernels and to make
relaxation and *incomplete* factorization stronger.  Try running
src/ksp/ksp/examples/tutorials/ex2 (or any other) with

  -pc_type lu -pc_factor_mat_ordering_type nd -mat_view_ordering_draw \
     -draw_pause 2 -mat_view_draw -info |grep fill

Nested dissection (nd) is the default ordering for LU, it is *not*
low-bandwidth.  Compare to a low-bandwidth ordering like reverse
Cuthill-McKee (rcm) which will produce more fill.

Now compare iteration count with -pc_type ilu and the ordering choices
above, ND is a horrible ordering (much worse than the natural ordering
in this case) for incomplete factorization.  I reorder nodes within
subdomains so that my natural ordering is approximately RCM (by
default), thus incomplete factorization with the natural ordering is as
strong (and fast) as possible.


Beyond these, large improvements are mostly algorithmic (physics-based
preconditioners, latency tolerant Krylov methods, improved globalization
schemes, stiff time integration, etc).

Jed

signature.asc
Description: OpenPGP digital signature

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july

_______________________________________________
Libmesh-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Re: [Libmesh-devel] optimized function code (fwd)

Reply via email to