I looked a bit deeper (i.e. found a machine where I have access to an Intel compiler, albeit not up to date - my shop is cursed by budget cuts). ICC breaks up a loop like for (i=0; i<n; i++) { a[i] = exp(cos(b[i])); s += a[i]; } into calls to vector math library functions and a separate loop for the sum. The library is bundled with ICC; it's not MKL, but its domain overlaps with MKL - hence my misapprehension - so your point stands. Something like blackscholes benefits from these vector library calls, and GCC doesn't do that.
It would be nice if Julia's LLVM system included an optimization pass which invoked a vector math library when appropriate. I guess that's a challenge outside the scope of ParallelAccelerator, but maybe good ground for some other project. On Thursday, October 27, 2016 at 1:04:33 PM UTC-4, Todd Anderson wrote: > > That's interesting. I generally don't test with gcc and my experiments > with ICC/C have shown something like 20% slower for LLVM/native threads for > some class of benchmarks (like blackscholes) but 2-4x slower for some other > benchmarks (like laplace-3d). The 20% may be attributable to ICC being > better (including at vectorization like you mention) but certainly not the > 2-4x. These larger differences are still under investigation. > > I guess something we have said in the docs or our postings have created > this impression that our performance gains are somehow related to MKL or > blas in general. If you have MKL then you can compile Julia to use it > through its LLVM path. ParallelAccelerator does not insert calls to MKL > where they didn't exist in the incoming IR and I don't think ICC does > either. If MKL calls exist in the incoming IR then we don't modify them > either. > > On Wednesday, October 26, 2016 at 7:51:33 PM UTC-7, Ralph Smith wrote: >> >> This is great stuff. Initial observations (under Linux/GCC) are that >> native threads are about 20% faster than OpenMP, so I surmise you are >> feeding LLVM some very tasty >> code. (I tested long loops with straightforward memory access.) >> >> On the other hand, some of the earlier posts make me think that you were >> leveraging the strong vector optimization of the Intel C compiler and its >> tight coupling to >> MKL libraries. If so, is there any prospect of getting LLVM to take >> advantage of MKL? >> >> >> On Wednesday, October 26, 2016 at 8:13:38 PM UTC-4, Todd Anderson wrote: >>> >>> Okay, METADATA with ParallelAccelerator verison 0.2 has been merged so >>> if you do a standard Pkg.add() or update() you should get the latest >>> version. >>> >>> For native threads, please note that we've identified some issues with >>> reductions and stencils that have been fixed and we will shortly be >>> released in version 0.2.1. I will post here again when that release takes >>> place. >>> >>> Again, please give it a try and report back with experiences or file >>> bugs. >>> >>> thanks! >>> >>> Todd >>> >>