Hi Dave,

That sounds very reasonable.  Regarding polynomial preconditioning, were you
thinking of least squares polynomial preconditioning or something else?

I haven't thought about anything specific yet, just about the infrastructure for applying any p(A).

  > > Will there be any improvements for GPU preconditioners in ViennaCL 1.5.0?
  > > When do you expect ViennaCL 1.5.0 to be available in PETSc?
  >
  > Jed gave me a good hint with respect to D-ILU0, which I'll also add to
  > PETSc. As with other GPU-accelerations using ILU, it will require a
  > proper matrix ordering to give good performance. I'm somewhat tempted to
  > port the SA-AMG implementation in CUSP to OpenCL as well, but this
  > certainly won't be in 1.5.0.

Porting SA-AMG to OpenCL also sounds attractive.  I was thinking that the
ViennaCL documentation already mentioned an algebraic preconditioner that was
in alpha or beta status.

The current AMG implementations all require a CPU-based setup stage and thus limit the gain you could eventually get. In some cases where the setup is less pronounced (e.g. lagging the preconditioner for nonlinear or time-dependent problems) this is fine, but for stationary linear problems with regular operators this is not very competitive.


I'm still trying to get my mind around the memory bandwidth issue for sparse
linear algebra.  Your report above of the Intel result adds to my confusion.
 From my understanding, the theoretical peak memory bandwidth for some systems
of interest is as follows:

Dual socket Sandy Bridge:  102 GB/s
Nvidia Kepler K20X:        250 GB/s
Intel Xeon Phi:            350 GB/s

What I am trying to understand is what sort of memory bandwidth is achievable
by a good implementation for the sparse linear algebra that PETSc does with
an iterative solver like CG using Jacobi preconditioning.  The plots which I
sent links to yesterday seemed to show memory bandwidth for a dual socket
Sandy Bridge to be well below the theoretical peak, perhaps less than 50 GB/s
for 16 threads.  For Xeon Phi, you are saying that Intel could not get more
than 95 GB/s.  But I saw a presentation last week where Nvidia was getting
about 200 GB/s for a matrix transpose.  So it makes me wonder if the
different systems are equally good at exploiting their theoretical peak
memory bandwidths or whether one, like the Nvidia K20X, might be better.  If
that were the case, then I might expect a good implementation of sparse
linear algebra on a Kepler K20X to be 4-5 times faster than a good
implementation on a dual socket Sandy Bridge node rather than a 2.5x
difference.

Intel's marketing machinery was tricking you: The 350 GB/sec are the peak bandwidth from the ring bus connecting the MIC cores to GDDRAM. However, the internal ring bus operates at only 220 GB/sec (see for example the following paper [1]). With some prefetching tricks and Intel pragma/compiler magic one obtains about 160 GB/sec for the STREAM benchmark, which is 75% of peak. The Intel OpenCL SDK adds another loss here, resulting in only 95 GB/sec. This was why I got in contact with Intel in order to find out whether this is a weakness of the SDK or whether I missed something. Turned out to be the former...

As you know, for dual Socket systems one only gets good bandwidth if the placement in memory is done in order to adhere to NUMA. On such a dual socket system I recently managed to get 75 GB/sec with OpenCL, which is again 75% of peak performance. Unfortunately OpenCL does not consider NUMA, so this is not very stable, so you may only get half of it if all data happens to reside on the same memory link.

On GPUs including the K20X one also obtains about 75% of peak: On a Radeon 7970 I got 220 out of 288 theoretical peak, other people even reported up to 250 GB/sec for a GTX Titan (288 GB/sec theoretical peak), and I also got 131 GB/sec out of 159 GB/sec peak for a rather dated GTX 285.

Overall, the rule of thumb seems to be 75% of peak if everything is done correctly and if one finds the right baseline (the Xeon Phi is a beast in this regard). These are numbers for sequential reads, hence no cache effects or other mechanisms such as paging cause other spurious effects. When it comes to actual optimizations for sparse linear algebra, CPUs and GPUs ask for slightly different sets of optimizations because cache lines and memory controllers differ...

Best regards,
Karli

[1] http://arxiv.org/abs/1302.1078

Reply via email to