Hi guys, today I got a gentle introduction into our testing machine equipped with two Intel MICs. They are still beta, yet I could run some simple kernels in native mode. As an example, without any modification of existing OpenMP code for vector addition in double precision of 3e6 elements, I got the following timings:
-- Native mode, i.e. all code executed on MIC -- Single core time: 0.642 sec All-core time: 0.011 sec For offloaded execution (CPU <-> MIC, just like with GPUs), additional pragmas are required, I haven't tried that yet. For comparison, the same code on the CPU (Sandy Bridge, 8x2 cores, 2.6 GHz) takes 0.060 sec without OpenMP and 0.030 sec with OpenMP. Thus, the conclusion is that one *really* needs to get all cores on the MIC busy in order to get the full memory bandwidth. Thus, a plain 'just recompile for MIC and you get good performance' won't work for most applications in practice, simply because the serial performance is so limited. @Shri: It would be interesting to give pthreads a try, particularly how it compares with OpenMP. I'll be out of the lab until the beginning of January, but I can help you with getting an account and getting started. Btw: I just got a call regarding Altera hardware, we might have chances to get our fingers on their OpenCL-enabled hardware. Best regards, Karli
