I'm kind of surprised at the > 10k element crossover myself. For the strong scaling cases, at high core counts, that's not terribly far from the number of DOFS per processor, is it? I guess CPUs will be slower than the Xeon in most cases (BGx), or fewer (Titan), but still.
- tim On 02/12/2013 05:06 PM, Karl Rupp wrote: > Hi guys, > > I finally got to play with the Intel Xeon Phi Beta hardware here. It's > supposed to have a slightly higher peak memory > bandwidth (352 GB/sec) than the release hardware (320 GB/sec), and it gives a > first impression on what can be done with > it. One thing to keep in mind is that the ring bus connecting memory > controllers with the cores saturates at 220 GB/sec, > so this represents the theoretical peak performance for applications. > > A sparse matrix-vector multiplication paper [1] got published recently, but > I'm more interested in what the Xeon Phi can > do in terms of iterative solvers. Thus, I ran some benchmarks with ViennaCL > on the Xeon Phi in both native mode > (everything runs on the Xeon Phi) and using OpenCL. I also tried to use the > offload-mode, i.e. one specifies via some > #pragma that data should be moved to the MIC and computations are run there, > but this #pragma-handling turned out to be > fairly unusable for anything where PCI-Express can be a bottleneck. For > PETSc-purposes this means that it is completely > useless. Even though I haven't tried it yet, I think this consequently also > applies to OpenACC in general. > > All benchmarks are run on Linux OSes using double precision. Blue colors in > the graphs denote Intel hardware, red colors > AMD, and green colors NVIDIA. Although the test machine gets occasional > updates of the Intel toolchain, I'm not entirely > sure whether the latest version is installed. > > The first STREAM-like benchmark is the vector addition x = y + z in > vector-timings.png. It is surprising that the > OpenCL-overhead at small vector sizes (less than 10^6) is fairly large, so > either the Beta-stage of OpenCL on MIC is > indeed very beta, or the MIC is not designed for fast responses to requests > from the host. OpenCL memory transfer rates > reach the range of 25 GB/sec on MIC, which is unexpectedly far from peak. > With native execution on the MIC, one obtains > around 75 GB/sec. This is in line with results in [1]. Higher performance > requires vectorization and prefetching - > apparently injected by the programmer and thus not very convenient. The GPUs > are about a factor of two faster and get > close to their peak performance without any explicit vectorization or > prefetching. > > The second benchmark is a sparse matrix-vector multiplication for a standard > 2D finite-difference discretization of the > Laplace operator on the unit square (sparse-timings.png). The performance of > MIC is better than that of the CPU, but > again the overhead at smaller problem sizes is considerable and larger than > for NVIDIA GPUs (both OpenCL and CUDA). The > poor performance at around 10^4 on the MIC is reproducible, but I don't have > an explanation for it. Overall, the GPUs > are by a factor of around 2-3 faster than MIC. Further tuning might reduce > this gap, as some experiments with > vectorizations on MIC have shown mild improvements (~30%). > > Finally, 50 iterations of a full conjugate gradient solver are benchmarked in > cg-timings.png. One could hope that the > native execution on the Xeon Phi eliminates all the high-latency transfers > via PCI-Express for the GPU case, but this is > not the case. While MIC beats the OpenMP-accelerated CPU implementation, it > fails to reach the performance of GPUs. Some > of the overhead of MIC at smaller problem sizes was found to be due to OpenMP > and can be reduced to somewhere between > NVIDIA's CUDA and OpenCL implementations. However, either the cores on MIC > are too weak to be run for the serial > portions, or the ring-bus and thread startup synchronizations are too high to > keep up with GPUs. > > Overall, I'm not very impressed by the Xeon Phi. In contrast to GPUs it seems > to require even more effort to get good > memory bandwidth. The OpenCL implementation on MIC could do a lot better > because it allows for more aggressive > optimizations in principle, but this is not yet seen in practice. The > offload-pragma is - if at all - useful for compute > intensive problems. It might be a good fit for problems which map well to the > 61 cores and can be pinned there, but I > doubt that we want to run 61 MPI processes on a MIC within PETSc. > > Best regards, > Karli > > [1] http://arxiv.org/abs/1302.1078 -- ================================================================ "You will keep in perfect peace him whose mind is steadfast, because he trusts in you." Isaiah 26:3 Tim Tautges Argonne National Laboratory (tautges at mcs.anl.gov) (telecommuting from UW-Madison) phone (gvoice): (608) 354-1459 1500 Engineering Dr. fax: (608) 263-4499 Madison, WI 53706
