To be fair, the simplest implementation being the fastest isn't *necessarily* to Julia's credit, since it may also mean that Julia can only optimise the simplest code. Not saying that's the case here, but it's worth looking at it from that angle.
Chris' point gives me an idea for an @unroll macro – it could unroll a generic for loop to do n at a time, or even take the loop out entirely when the iterator is a compile-time constant.
