Here is my version.

Firstly, the benchmark of Python.
%timeit x**2

100000 loops, best of 3: 4.43 µs per loop

%timeit [xi**2 for xi in x]

100 loops, best of 3: 2.28 ms per loop


Secondly, the various Julia versions.
f1 = x -> x.^2
f2 = x -> x.*x
f3 = x -> [xi*xi for xi in x]

function f4(x)
    y = x
    for i = 1:length(x) 
        y[i] = x[i]*x[i]
    end
    return y
end

function f5(x)
    for i = 1:length(x) 
        x[i] *= x[i]
    end
end

function f6(x)
    @simd for i = 1:length(x) 
        @inbounds x[i] *= x[i]
    end
end

@timeit f1(x)

1000 loops, best of 3: 149.74 µs per loop
@timeit f2(x)

1000 loops, best of 3: 60.59 µs per loop
@timeit f3(x)

10 loops, best of 3: 4.38 ms per loop
@timeit f4(x)

100000 loops, best of 3: 9.08 µs per loop

@timeit f5(x)

100000 loops, best of 3: 9.18 µs per loop

@timeit f6(x)

100000 loops, best of 3: 2.97 µs per loop


The comparison of f1 and f2 shows that .* is faster than .^2
The comparison of f2 and f3 shows that comprehension is slower
The high performance of f4 and f5 shows that for higher performance we 
should write loops ourselves instead of writing vectorization code.
The higher performance of f6 compared to Python shows that Julia is 
eventually faster than NumPy.

Furthermore, I find an interesting phenomenon for NumPy:
%timeit x**2
%timeit x**4
%timeit (x**2)**2

100000 loops, best of 3: 4.54 µs per loop
1000 loops, best of 3: 774 µs per loop
100000 loops, best of 3: 9.85 µs per loop

That suggests NumPy only optimized **2, not ** operator itself !


On Monday, March 16, 2015 at 3:31:41 PM UTC+1, Sisyphuss wrote:
>
> That's interesting!
>
> On Sunday, March 15, 2015 at 4:18:34 PM UTC+1, Dallas Morisette wrote:
>>
>> Thanks everyone for the suggestions! Here is my updated test:
>>
>> using TimeIt
>> function vec!(x,y)
>>     y = x.*x
>> end
>>
>> function comp!(x,y)
>>     y = [xi*xi for xi in x]
>> end
>>
>> function forloop!(x,y,n)
>>     for i = 1:n 
>>         y[i] = x[i]*x[i]
>>     end
>> end
>>
>> function forloop2!(x,y,n)
>>     @simd for i = 1:n 
>>         @inbounds y[i] = x[i]*x[i] 
>>     end
>> end
>>     
>> function test()
>>     n = 10000
>>     x = linspace(0.0,1.0,n)
>>     y = zeros(x)
>>     @timeit vec!(x,y)
>>     @timeit comp!(x,y)
>>     @timeit forloop!(x,y,n)
>>     @timeit forloop2!(x,y,n)
>> end
>> test();
>>
>> 10000 loops, best of 3: 87.82 µs per loop
>> 1000 loops, best of 3: 62.73 µs per loop
>> 10000 loops, best of 3: 12.66 µs per loop
>> 100000 loops, best of 3: 3.54 µs per loop
>>
>>
>> So the SIMD macros combined with a literal for loop give performance 
>> essentially equivalent to a call to numpy. I switched to @time so I could 
>> see the allocations:
>>
>> elapsed time: 2.467e-5 seconds (80512 bytes allocated)
>> elapsed time: 2.1358e-5 seconds (80048 bytes allocated)
>> elapsed time: 1.5124e-5 seconds (0 bytes allocated)
>> elapsed time: 6.108e-6 seconds (0 bytes allocated)
>>
>>
>> Looks like one temporary array has to be allocated in both vectorized and 
>> comprehension forms, which reduced the performance by about 5-7X. I suppose 
>> this would depend on the exact calculation being done and the size of the 
>> arrays involved and would have to be tested on a case-by-case basis. 
>>
>> Thanks for the help - I'm sure I'll be back with more questions!
>>
>> Dallas
>>
>

Reply via email to