Here's an interesting comparison: https://gist.github.com/simonster/6195af68c6df33ca965d
idiv for 64-bit integers is one of the most expensive extant x86-64 instructions. For 32-bit integers, it is much cheaper, and this function runs nearly twice as fast. When LLVM knows the divisor in advance, it can avoid idiv and perform the division as a multiplication and bit shift, which is faster still. But the biggest gain comes from avoiding rem in the inner loop entirely. Of course, this is only tackling the odd case, and additional performance can be gained for all of these benchmarks by avoiding allocation. Simon On Tuesday, August 26, 2014 9:11:47 AM UTC-4, Phillip Berndt wrote: > > > (1) allocate the output M outside of the core algorithm, and pass it as an >> input, i.e., >> > > I did that, though it can be argued that this is cheating given that the > competitors also have to allocate an array for each loop. With that version > (and some more slight optimization: Storing intermediate values in the for > loops, using column-major indexing and @simd) [ > https://gist.github.com/phillipberndt/7dc0aed7eb855f900f0d/21cce76664bdc59f6203ff6f3496e80e256f54cb], > > the overall time for the N=3..1000 test case is down to 3.67s. > > (2) @time (for i = 1:100; magic!(M); end). Did it allocate any memory? >> Then >> you have a problem. Use the profiler, or run julia with --track- >> allocation=user, to find out where it occurs. > > > It does, about 3 Mb on line 2 (if n % 2 == 1). Doesn't make much sense so > I guess the profiler interfered with the optimizer here?! I doubt that > trying to get rid of the 3Mb will gain another second though. > > >> (3) Even if it's not allocating, you may have a bottleneck. Use the >> profiler to >> find it. >> > > The line where the most time is spent is line 11, filling the array in the > odd case. I don't see how it could be optimized any further, so that's > probably as far as one gets?! > > - Phillip >
