in the Cython code you turned off bounds checking. This can be done for Julia with the @inbounds macro. Just use it in your loops like this:
@inbounds for i in whatever ... end also @simd may help, sems you can use it in a couple of the innrmost loops. It sems also simple to parallelize with a shared array and a @parallel for
