On Tue, Feb 12, 2013 at 6:06 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:
> Hi guys, > > I finally got to play with the Intel Xeon Phi Beta hardware here. It's > supposed to have a slightly higher peak memory bandwidth (352 GB/sec) than > the release hardware (320 GB/sec), and it gives a first impression on what > can be done with it. One thing to keep in mind is that the ring bus > connecting memory controllers with the cores saturates at 220 GB/sec, so > this represents the theoretical peak performance for applications. > > A sparse matrix-vector multiplication paper [1] got published recently, > but I'm more interested in what the Xeon Phi can do in terms of iterative > solvers. Thus, I ran some benchmarks with ViennaCL on the Xeon Phi in both > native mode (everything runs on the Xeon Phi) and using OpenCL. I also > tried to use the offload-mode, i.e. one specifies via some #pragma that > data should be moved to the MIC and computations are run there, but this > #pragma-handling turned out to be fairly unusable for anything where > PCI-Express can be a bottleneck. For PETSc-purposes this means that it is > completely useless. Even though I haven't tried it yet, I think this > consequently also applies to OpenACC in general. > > All benchmarks are run on Linux OSes using double precision. Blue colors > in the graphs denote Intel hardware, red colors AMD, and green colors > NVIDIA. Although the test machine gets occasional updates of the Intel > toolchain, I'm not entirely sure whether the latest version is installed. > > The first STREAM-like benchmark is the vector addition x = y + z in > vector-timings.png. It is surprising that the OpenCL-overhead at small > vector sizes (less than 10^6) is fairly large, so either the Beta-stage of > OpenCL on MIC is indeed very beta, or the MIC is not designed for fast > responses to requests from the host. OpenCL memory transfer rates reach the > range of 25 GB/sec on MIC, which is unexpectedly far from peak. With native > execution on the MIC, one obtains around 75 GB/sec. This is in line with > results in [1]. Higher performance requires vectorization and prefetching - > apparently injected by the programmer and thus not very convenient. The > GPUs are about a factor of two faster and get close to their peak > performance without any explicit vectorization or prefetching. > > The second benchmark is a sparse matrix-vector multiplication for a > standard 2D finite-difference discretization of the Laplace operator on the > unit square (sparse-timings.png). The performance of MIC is better than > that of the CPU, but again the overhead at smaller problem sizes is > considerable and larger than for NVIDIA GPUs (both OpenCL and CUDA). The > poor performance at around 10^4 on the MIC is reproducible, but I don't > have an explanation for it. Overall, the GPUs are by a factor of around 2-3 > faster than MIC. Further tuning might reduce this gap, as some experiments > with vectorizations on MIC have shown mild improvements (~30%). > Karl, I am assuming that the places in the article where the Phi beats the K20 are for denser matrices where they have explicitly vectorized? Matt > Finally, 50 iterations of a full conjugate gradient solver are benchmarked > in cg-timings.png. One could hope that the native execution on the Xeon Phi > eliminates all the high-latency transfers via PCI-Express for the GPU case, > but this is not the case. While MIC beats the OpenMP-accelerated CPU > implementation, it fails to reach the performance of GPUs. Some of the > overhead of MIC at smaller problem sizes was found to be due to OpenMP and > can be reduced to somewhere between NVIDIA's CUDA and OpenCL > implementations. However, either the cores on MIC are too weak to be run > for the serial portions, or the ring-bus and thread startup > synchronizations are too high to keep up with GPUs. > > Overall, I'm not very impressed by the Xeon Phi. In contrast to GPUs it > seems to require even more effort to get good memory bandwidth. The OpenCL > implementation on MIC could do a lot better because it allows for more > aggressive optimizations in principle, but this is not yet seen in > practice. The offload-pragma is - if at all - useful for compute intensive > problems. It might be a good fit for problems which map well to the 61 > cores and can be pinned there, but I doubt that we want to run 61 MPI > processes on a MIC within PETSc. > > Best regards, > Karli > > [1] http://arxiv.org/abs/1302.1078 > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130212/afc5a125/attachment.html>
