If you want to gain more performances, first identify hot spots by profiling your code. This is fairly easy in Julia: http://julia.readthedocs.org/en/latest/stdlib/profile/
Making your code less readable to gain a 0.01% speed increase in your whole program doesn't worth the pain. Once bottlenecks are identified, there are plenty of ways to get more speed. Explicit or implicit devectorization can be used (https://github.com/lindahua/Devectorize.jl), sometimes BLAS can be called directly with little or no modification in the program structure, etc. You have also to be careful about types and memory layout in these parts. If you need even more speed you can still leverage the SIMD instructions of your CPU and of course, multicore/multinode parallelism (http://julia.readthedocs.org/en/latest/manual/performance-tips/). But as I said, before optimizing, finish your program so that you would be able to understand it perfectly 6 months from now, properly profile it and if it is too slow for you, optimize the real bottlenecks (not the fantasized ones).
