Thanks everyone for the suggestions! Here is my updated test:

using TimeIt
function vec!(x,y)
    y = x.*x
end

function comp!(x,y)
    y = [xi*xi for xi in x]
end

function forloop!(x,y,n)
    for i = 1:n 
        y[i] = x[i]*x[i]
    end
end

function forloop2!(x,y,n)
    @simd for i = 1:n 
        @inbounds y[i] = x[i]*x[i] 
    end
end
    
function test()
    n = 10000
    x = linspace(0.0,1.0,n)
    y = zeros(x)
    @timeit vec!(x,y)
    @timeit comp!(x,y)
    @timeit forloop!(x,y,n)
    @timeit forloop2!(x,y,n)
end
test();

10000 loops, best of 3: 87.82 µs per loop
1000 loops, best of 3: 62.73 µs per loop
10000 loops, best of 3: 12.66 µs per loop
100000 loops, best of 3: 3.54 µs per loop


So the SIMD macros combined with a literal for loop give performance 
essentially equivalent to a call to numpy. I switched to @time so I could 
see the allocations:

elapsed time: 2.467e-5 seconds (80512 bytes allocated)
elapsed time: 2.1358e-5 seconds (80048 bytes allocated)
elapsed time: 1.5124e-5 seconds (0 bytes allocated)
elapsed time: 6.108e-6 seconds (0 bytes allocated)


Looks like one temporary array has to be allocated in both vectorized and 
comprehension forms, which reduced the performance by about 5-7X. I suppose 
this would depend on the exact calculation being done and the size of the 
arrays involved and would have to be tested on a case-by-case basis. 

Thanks for the help - I'm sure I'll be back with more questions!

Dallas

Reply via email to