Re: [petsc-dev] Improving and stabilizing GPU support

Karl Rupp Fri, 19 Jul 2013 18:29:34 -0700

Hi Dave,

That sounds very reasonable.  Regarding polynomial preconditioning, were you
thinking of least squares polynomial preconditioning or something else?

I haven't thought about anything specific yet, just about theinfrastructure for applying any p(A).

  > > Will there be any improvements for GPU preconditioners in ViennaCL 1.5.0?
  > > When do you expect ViennaCL 1.5.0 to be available in PETSc?
  >
  > Jed gave me a good hint with respect to D-ILU0, which I'll also add to
  > PETSc. As with other GPU-accelerations using ILU, it will require a
  > proper matrix ordering to give good performance. I'm somewhat tempted to
  > port the SA-AMG implementation in CUSP to OpenCL as well, but this
  > certainly won't be in 1.5.0.

Porting SA-AMG to OpenCL also sounds attractive.  I was thinking that the
ViennaCL documentation already mentioned an algebraic preconditioner that was
in alpha or beta status.

The current AMG implementations all require a CPU-based setup stage andthus limit the gain you could eventually get. In some cases where thesetup is less pronounced (e.g. lagging the preconditioner for nonlinearor time-dependent problems) this is fine, but for stationary linearproblems with regular operators this is not very competitive.

I'm still trying to get my mind around the memory bandwidth issue for sparse
linear algebra.  Your report above of the Intel result adds to my confusion.
 From my understanding, the theoretical peak memory bandwidth for some systems
of interest is as follows:

Dual socket Sandy Bridge:  102 GB/s
Nvidia Kepler K20X:        250 GB/s
Intel Xeon Phi:            350 GB/s

What I am trying to understand is what sort of memory bandwidth is achievable
by a good implementation for the sparse linear algebra that PETSc does with
an iterative solver like CG using Jacobi preconditioning.  The plots which I
sent links to yesterday seemed to show memory bandwidth for a dual socket
Sandy Bridge to be well below the theoretical peak, perhaps less than 50 GB/s
for 16 threads.  For Xeon Phi, you are saying that Intel could not get more
than 95 GB/s.  But I saw a presentation last week where Nvidia was getting
about 200 GB/s for a matrix transpose.  So it makes me wonder if the
different systems are equally good at exploiting their theoretical peak
memory bandwidths or whether one, like the Nvidia K20X, might be better.  If
that were the case, then I might expect a good implementation of sparse
linear algebra on a Kepler K20X to be 4-5 times faster than a good
implementation on a dual socket Sandy Bridge node rather than a 2.5x
difference.

Intel's marketing machinery was tricking you: The 350 GB/sec are thepeak bandwidth from the ring bus connecting the MIC cores to GDDRAM.However, the internal ring bus operates at only 220 GB/sec (see forexample the following paper [1]). With some prefetching tricks and Intelpragma/compiler magic one obtains about 160 GB/sec for the STREAMbenchmark, which is 75% of peak. The Intel OpenCL SDK adds another losshere, resulting in only 95 GB/sec. This was why I got in contact withIntel in order to find out whether this is a weakness of the SDK orwhether I missed something. Turned out to be the former...

As you know, for dual Socket systems one only gets good bandwidth if theplacement in memory is done in order to adhere to NUMA. On such a dualsocket system I recently managed to get 75 GB/sec with OpenCL, which isagain 75% of peak performance. Unfortunately OpenCL does not considerNUMA, so this is not very stable, so you may only get half of it if alldata happens to reside on the same memory link.

On GPUs including the K20X one also obtains about 75% of peak: On aRadeon 7970 I got 220 out of 288 theoretical peak, other people evenreported up to 250 GB/sec for a GTX Titan (288 GB/sec theoretical peak),and I also got 131 GB/sec out of 159 GB/sec peak for a rather dated GTX 285.

Overall, the rule of thumb seems to be 75% of peak if everything is donecorrectly and if one finds the right baseline (the Xeon Phi is a beastin this regard). These are numbers for sequential reads, hence no cacheeffects or other mechanisms such as paging cause other spurious effects.When it comes to actual optimizations for sparse linear algebra, CPUsand GPUs ask for slightly different sets of optimizations because cachelines and memory controllers differ...


Best regards,
Karli

[1] http://arxiv.org/abs/1302.1078

Re: [petsc-dev] Improving and stabilizing GPU support

Reply via email to