@Kristoffer Carlsson , I do appreciate your help and your clever use of
Yeppp, which is limited to reals. I may be able to redesign my algorithm
with all reals and get faster execution with Julia than with Python, which
does not have a wrapper for Yeppp that I could find. Doing so may also
involve vectorized dot products with the BLAS library. Since I am
processing GB-sized vectors, this involves large temporary vectors, and I
may not have enough RAM. Also, it contradicts the advice I was given
previously in this forum to write my loops in pure Julia for speed. So I
would still like to hear from forum members about how to get parallelized
pure Julia code executing faster than single-threaded. Maybe I have to
wait for Julia v0.5 for this to manifest.