One thing to note about this benchmark is it doesn't exploit Julia's SIMD vectorization capabilities, because the value of u[i,j] depends on the just- computed value of u[i-1,j]. But I'd guess that Numba can also do SIMD when it's applicable?
I suspect that most things that compile down to raw machine code will give basically identical performance, whether Julia, Numba, or C. Where you'll see Julia shine is when the algorithm gets more complicated: pop something off a PriorityQueue, make a decision, do some linear algebra computation, push the next thing onto the queue, and repeat until the queue stops producing values that require further action. I'm doing some of that kind of stuff right now, and even though I've been doing Julia a while now, this afternoon I'm once again having one of those blown-away moments where I am amazed at how friggin' much computation I can get done in a short time. --Tim On Tuesday, September 30, 2014 07:41:45 PM Stefan Karpinski wrote: > Your code looks quite good – and the devectorized version avoids creating > copies of slices, which is currently one of the major performance issues > with this kind of code (to be fixed in the next major release). You can see > the inferred types of all the local variables like this: (@code_typed > laplace_unvectorized())[1].args[2][2] – and everything has a concrete type, > so there's nothing to improve in terms of typing. > > Adding the @inbounds annotation to the innermost for loop was one of the > first things I thought to try, but that doesn't seem to have any benefit – > I think the array accesses are all inlined and LLVM can hoist the bounds > checks out of the loop (or eliminate them entirely). > > What does end up making a big difference is swapping the iteration order of > i and j. Instead of doing `for i=2:nx-1, j=2:ny-1` do `for j=2:ny-1, > i=2:nx-1` – I don't recall whether NumPy arrays are row-major (I think they > are) but Julia is column-major like Fortran and you want to iterate along > columns innermost. On my machine, doing it in the original order is 1.61x > slower, which may be exactly the explanation of the difference you're > seeing between Julia and Numba. > > On Tue, Sep 30, 2014 at 7:08 PM, <[email protected]> wrote: > > Greetings, > > > > I'm a reasonably proficient user of MATLAB and Python/NumPy/SciPy doing > > computational physics. Since Julia appears to be designed to be very well > > suited to many such applications, I was curious to test its performance > > before investing much time in converting any research code. To start out, > > I > > wrote up the classic 2D regular finite difference Laplace benchmark in > > Julia, Python and MATLAB in both vectorized and loop versions and tested > > them all. > > > > The results shown in the following Google spreadsheet (all results > > obtained using a 5000x5000 grid and doing 100 iterations for a reasonable > > sample using a Haswell i7-4710HQ CPU on Windows 8.1, using Julia 0.3.1, > > Anaconda 2.0.1 and MATLAB R2014a): > > > > https://docs.google.com/spreadsheets/d/1mJ8wNiyYVszkVapRVHvRJZG9j9XQhLLPva > > UJwiWrJXY/pubhtml > > > > The code itself is published as follows: > > Julia: http://pastebin.com/AAdXXYZC > > Python: http://pastebin.com/5hqi9xzf > > > > As can be clearly seen, Julia does handily beat both MATLAB and basic > > Python/NumPy. It does however lose by a factor of 1.65 to a Numba-jitted > > version of the same Python code (obtained by simply adding a > > "@jit(target='cpu')" decorator on top of the appropriate function in naive > > Python code), which compiles to the same LLVM stack Julia uses. I > > deliberately avoided using more complex JIT techniques for Python (such as > > using Pythran to compile to OpenMP-enabled C code by specifying function > > signatures) to stick to single core performance only. > > > > Given these results and the virtual given that my Julia code is naive, > > non-idiomatic and just plain bad, I'd like to know if there's anything I > > could improve to match or (if possible) beat JIT-compiled Python.
