The clue is to structure it more like a c/java program and less like a matlab script. Mathworks has made great efforts to be able to run poorly structured programs fast. Julia focuses on generating fast machine code, but we currently don't optimize well for the common case where global variables don't change their type, so we and up doing the slow multiple dispatch lookup at every step of the loop, instead of only once at compile time.
Solution: wrap the code in a function, so that Julia can analyze the types. To get really high performance, it is worth noting that Julia don't have a fast garbage collector. (Nobody really does, but many are apparently faster than ours). It will often be useful to reduce the number of temporarily allocated objects, so that GC kicks in less often. Solution: devectorize your code and manipulate arrays in place, to reduce the number of temporary arrays that are needed.
