Here is my version.
Firstly, the benchmark of Python.
%timeit x**2
100000 loops, best of 3: 4.43 µs per loop
%timeit [xi**2 for xi in x]
100 loops, best of 3: 2.28 ms per loop
Secondly, the various Julia versions.
f1 = x -> x.^2
f2 = x -> x.*x
f3 = x -> [xi*xi for xi in x]
function f4(x)
y = x
for i = 1:length(x)
y[i] = x[i]*x[i]
end
return y
end
function f5(x)
for i = 1:length(x)
x[i] *= x[i]
end
end
function f6(x)
@simd for i = 1:length(x)
@inbounds x[i] *= x[i]
end
end
@timeit f1(x)
1000 loops, best of 3: 149.74 µs per loop
@timeit f2(x)
1000 loops, best of 3: 60.59 µs per loop
@timeit f3(x)
10 loops, best of 3: 4.38 ms per loop
@timeit f4(x)
100000 loops, best of 3: 9.08 µs per loop
@timeit f5(x)
100000 loops, best of 3: 9.18 µs per loop
@timeit f6(x)
100000 loops, best of 3: 2.97 µs per loop
The comparison of f1 and f2 shows that .* is faster than .^2
The comparison of f2 and f3 shows that comprehension is slower
The high performance of f4 and f5 shows that for higher performance we
should write loops ourselves instead of writing vectorization code.
The higher performance of f6 compared to Python shows that Julia is
eventually faster than NumPy.
Furthermore, I find an interesting phenomenon for NumPy:
%timeit x**2
%timeit x**4
%timeit (x**2)**2
100000 loops, best of 3: 4.54 µs per loop
1000 loops, best of 3: 774 µs per loop
100000 loops, best of 3: 9.85 µs per loop
That suggests NumPy only optimized **2, not ** operator itself !
On Monday, March 16, 2015 at 3:31:41 PM UTC+1, Sisyphuss wrote:
>
> That's interesting!
>
> On Sunday, March 15, 2015 at 4:18:34 PM UTC+1, Dallas Morisette wrote:
>>
>> Thanks everyone for the suggestions! Here is my updated test:
>>
>> using TimeIt
>> function vec!(x,y)
>> y = x.*x
>> end
>>
>> function comp!(x,y)
>> y = [xi*xi for xi in x]
>> end
>>
>> function forloop!(x,y,n)
>> for i = 1:n
>> y[i] = x[i]*x[i]
>> end
>> end
>>
>> function forloop2!(x,y,n)
>> @simd for i = 1:n
>> @inbounds y[i] = x[i]*x[i]
>> end
>> end
>>
>> function test()
>> n = 10000
>> x = linspace(0.0,1.0,n)
>> y = zeros(x)
>> @timeit vec!(x,y)
>> @timeit comp!(x,y)
>> @timeit forloop!(x,y,n)
>> @timeit forloop2!(x,y,n)
>> end
>> test();
>>
>> 10000 loops, best of 3: 87.82 µs per loop
>> 1000 loops, best of 3: 62.73 µs per loop
>> 10000 loops, best of 3: 12.66 µs per loop
>> 100000 loops, best of 3: 3.54 µs per loop
>>
>>
>> So the SIMD macros combined with a literal for loop give performance
>> essentially equivalent to a call to numpy. I switched to @time so I could
>> see the allocations:
>>
>> elapsed time: 2.467e-5 seconds (80512 bytes allocated)
>> elapsed time: 2.1358e-5 seconds (80048 bytes allocated)
>> elapsed time: 1.5124e-5 seconds (0 bytes allocated)
>> elapsed time: 6.108e-6 seconds (0 bytes allocated)
>>
>>
>> Looks like one temporary array has to be allocated in both vectorized and
>> comprehension forms, which reduced the performance by about 5-7X. I suppose
>> this would depend on the exact calculation being done and the size of the
>> arrays involved and would have to be tested on a case-by-case basis.
>>
>> Thanks for the help - I'm sure I'll be back with more questions!
>>
>> Dallas
>>
>