[petsc-dev] A closer look at the Xeon Phi

Tim Tautges Tue, 12 Feb 2013 18:06:24 -0600

I'm kind of surprised at the > 10k element crossover myself.  For the strong 
scaling cases, at high core counts, that's 
not terribly far from the number of DOFS per processor, is it?  I guess CPUs 
will be slower than the Xeon in most cases 
(BGx), or fewer (Titan), but still.


- tim

On 02/12/2013 05:06 PM, Karl Rupp wrote:
> Hi guys,
>
> I finally got to play with the Intel Xeon Phi Beta hardware here. It's 
> supposed to have a slightly higher peak memory
> bandwidth (352 GB/sec) than the release hardware (320 GB/sec), and it gives a 
> first impression on what can be done with
> it. One thing to keep in mind is that the ring bus connecting memory 
> controllers with the cores saturates at 220 GB/sec,
> so this represents the theoretical peak performance for applications.
>
> A sparse matrix-vector multiplication paper [1] got published recently, but 
> I'm more interested in what the Xeon Phi can
> do in terms of iterative solvers. Thus, I ran some benchmarks with ViennaCL 
> on the Xeon Phi in both native mode
> (everything runs on the Xeon Phi) and using OpenCL. I also tried to use the 
> offload-mode, i.e. one specifies via some
> #pragma that data should be moved to the MIC and computations are run there, 
> but this #pragma-handling turned out to be
> fairly unusable for anything where PCI-Express can be a bottleneck. For 
> PETSc-purposes this means that it is completely
> useless. Even though I haven't tried it yet, I think this consequently also 
> applies to OpenACC in general.
>
> All benchmarks are run on Linux OSes using double precision. Blue colors in 
> the graphs denote Intel hardware, red colors
> AMD, and green colors NVIDIA. Although the test machine gets occasional 
> updates of the Intel toolchain, I'm not entirely
> sure whether the latest version is installed.
>
> The first STREAM-like benchmark is the vector addition x = y + z in 
> vector-timings.png. It is surprising that the
> OpenCL-overhead at small vector sizes (less than 10^6) is fairly large, so 
> either the Beta-stage of OpenCL on MIC is
> indeed very beta, or the MIC is not designed for fast responses to requests 
> from the host. OpenCL memory transfer rates
> reach the range of 25 GB/sec on MIC, which is unexpectedly far from peak. 
> With native execution on the MIC, one obtains
> around 75 GB/sec. This is in line with results in [1]. Higher performance 
> requires vectorization and prefetching -
> apparently injected by the programmer and thus not very convenient. The GPUs 
> are about a factor of two faster and get
> close to their peak performance without any explicit vectorization or 
> prefetching.
>
> The second benchmark is a sparse matrix-vector multiplication for a standard 
> 2D finite-difference discretization of the
> Laplace operator on the unit square (sparse-timings.png). The performance of 
> MIC is better than that of the CPU, but
> again the overhead at smaller problem sizes is considerable and larger than 
> for NVIDIA GPUs (both OpenCL and CUDA). The
> poor performance at around 10^4 on the MIC is reproducible, but I don't have 
> an explanation for it. Overall, the GPUs
> are by a factor of around 2-3 faster than MIC. Further tuning might reduce 
> this gap, as some experiments with
> vectorizations on MIC have shown mild improvements (~30%).
>
> Finally, 50 iterations of a full conjugate gradient solver are benchmarked in 
> cg-timings.png. One could hope that the
> native execution on the Xeon Phi eliminates all the high-latency transfers 
> via PCI-Express for the GPU case, but this is
> not the case. While MIC beats the OpenMP-accelerated CPU implementation, it 
> fails to reach the performance of GPUs. Some
> of the overhead of MIC at smaller problem sizes was found to be due to OpenMP 
> and can be reduced to somewhere between
> NVIDIA's CUDA and OpenCL implementations. However, either the cores on MIC 
> are too weak to be run for the serial
> portions, or the ring-bus and thread startup synchronizations are too high to 
> keep up with GPUs.
>
> Overall, I'm not very impressed by the Xeon Phi. In contrast to GPUs it seems 
> to require even more effort to get good
> memory bandwidth. The OpenCL implementation on MIC could do a lot better 
> because it allows for more aggressive
> optimizations in principle, but this is not yet seen in practice. The 
> offload-pragma is - if at all - useful for compute
> intensive problems. It might be a good fit for problems which map well to the 
> 61 cores and can be pinned there, but I
> doubt that we want to run 61 MPI processes on a MIC within PETSc.
>
> Best regards,
> Karli
>
> [1] http://arxiv.org/abs/1302.1078

-- 
================================================================
"You will keep in perfect peace him whose mind is
   steadfast, because he trusts in you."               Isaiah 26:3

              Tim Tautges            Argonne National Laboratory
          (tautges at mcs.anl.gov)      (telecommuting from UW-Madison)
  phone (gvoice): (608) 354-1459      1500 Engineering Dr.
             fax: (608) 263-4499      Madison, WI 53706

[petsc-dev] A closer look at the Xeon Phi

Reply via email to