I was experimenting with unsafe_load() indexing and "@inbounds begin ... end" speedups, and found that they are practically of the same speed, and are among the fastest versions. But the Julia code optimization turned out to be very fragile. Here is the code which can use only an 8 bytes array g for the gaps between divisors (at similar speed as an Int array).
https://gist.github.com/LaszloHars/c691e5c650d8b22bc8a3 This code beats the built in factor() function as it is written. However, if we replace the initialization d = d0 in the middle of the function with d += g[j0] as noted in the corresponding comment in the code, the running time doubles. Since this instruction is outside of loops, the speed difference should not be noticeable. It looks like the Julia code optimization gets screwed up with this constant expression. Weird!
