general question on speed using quad core Xeons

Jed Brown Fri, 23 May 2008 09:52:13 +0200

On Fri 2008-05-23 10:30, amjad ali wrote:
> Would you please explain a bit about "unassembled structures"?
> Does Discontinuous Galerkin Method falls into this category?

I'm doing some work on this so I'll try to answer.

There are two components which can be ``unassembled'' namely the matrix
application and the preconditioner. In general, unassembled makes the most
sense for semi-structured approximations where there is natural data granularity
similar to L1 cache size. A standard example is the spectral or p-version
finite element method. In these methods, the element stiffness matrix is dense
and can be very large, but it is possible to apply it without storing the
entries. For instance, suppose we have polynomial order p-1 on a hexahedral
element. Then there are p^3 element degrees of freedom and the global matrix
will have p^6 nonzeros contributed by this element (if we assemble it). For
p=10, this is already big and will make preconditioning very expensive. On the
other hand, we can apply the element Jacobian in O(p^3) space and O(p^4) time if
we exploit a tensor product basis. Even if we don't use tensor products, the
space requirement can stay the same. For high order p, this will be a lot less
operations, but the important point with regard to FLOP/s is that there is now
more work per amount of memory touched. If there are N elements, the total
number of degrees of freedom is O(N p^3) and the matrix has O(N p^6) entries so
multiplication is O(N p^6) time and space. If applied locally (i.e. scatter to
local basis, apply there, scatter back) the space requirement is O(N p^3) and
the time is O(N p^4) or O(N p^6) depending on the basis. In addition, the
element operations can often be done entirely in L1 cache. Clearly this puts a
lot less stress on the memory bus.

Of course, just applying the Jacobian fast won't cut it, we also need a
preconditioner. One way is to assemble a sparser approximation to the global
Jacobian and apply standard preconditioners. Another is to apply a domain
decomposition preconditioner which exploits the data granularity. For instance,
Schur complement preconditioners can be formed based on solves on a single
element. These are most attractive when there is a `fast' way to solve the
local problem on a single element, but they can be expected to increase the
FLOP/s rate either way because the memory access pattern requires more work for
a given amount of memory. (I don't know a lot about DD preconditioners. I'm
using the former approach, assembling a sparser approximation of the Jacobian.)

Discontinuous Galerkin happens to be easy to implement for high order elements
and the scatter from global to local basis is trivial.

Jed
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL:
<http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080523/d1547be8/attachment.pgp>

general question on speed using quad core Xeons

Reply via email to