As a point of comparison. I've been running a PETSc CG algorithm on an Nvidia K20. The simulation has 1.4e7 elements.
The PETSc AXPY takes .001 seconds in single precision. That's 26 GFlops. In another simulation using a double complex BiCG algorithm with 1.e6 unknowns, the Petsc MatMult on the K20 runs at 55 GFlops! -Paul > Hi guys, > > today I got a gentle introduction into our testing machine equipped > with two Intel MICs. They are still beta, yet I could run some simple > kernels in native mode. As an example, without any modification of > existing OpenMP code for vector addition in double precision of 3e6 > elements, I got the following timings: > > -- Native mode, i.e. all code executed on MIC -- > Single core time: 0.642 sec > All-core time: 0.011 sec > > For offloaded execution (CPU <-> MIC, just like with GPUs), additional > pragmas are required, I haven't tried that yet. > > For comparison, the same code on the CPU (Sandy Bridge, 8x2 cores, 2.6 > GHz) takes 0.060 sec without OpenMP and 0.030 sec with OpenMP. Thus, > the conclusion is that one *really* needs to get all cores on the MIC > busy in order to get the full memory bandwidth. Thus, a plain 'just > recompile for MIC and you get good performance' won't work for most > applications in practice, simply because the serial performance is so > limited. > > @Shri: It would be interesting to give pthreads a try, particularly > how it compares with OpenMP. I'll be out of the lab until the > beginning of January, but I can help you with getting an account and > getting started. > > Btw: I just got a call regarding Altera hardware, we might have > chances to get our fingers on their OpenCL-enabled hardware. > > Best regards, > Karli >
