Indeed, x^2 seems to be an optimization case in  NumPy. See, for x^3:

Julia:
julia> @timeit y = x.^3
1000 loops, best of 3: 265.99 µs per loop

Python:
In [1]: %timeit y = x**3
1000 loops, best of 3: 259 µs per loop


On Monday, March 16, 2015 at 4:08:55 PM UTC, Sisyphuss wrote:
>
> Here is my version.
>
> Firstly, the benchmark of Python.
> %timeit x**2
>
> 100000 loops, best of 3: 4.43 µs per loop
>
> %timeit [xi**2 for xi in x]
>
> 100 loops, best of 3: 2.28 ms per loop
>
>
> Secondly, the various Julia versions.
> f1 = x -> x.^2
> f2 = x -> x.*x
> f3 = x -> [xi*xi for xi in x]
>
> function f4(x)
>     y = x
>     for i = 1:length(x) 
>         y[i] = x[i]*x[i]
>     end
>     return y
> end
>
> function f5(x)
>     for i = 1:length(x) 
>         x[i] *= x[i]
>     end
> end
>
> function f6(x)
>     @simd for i = 1:length(x) 
>         @inbounds x[i] *= x[i]
>     end
> end
>
> @timeit f1(x)
>
> 1000 loops, best of 3: 149.74 µs per loop
> @timeit f2(x)
>
> 1000 loops, best of 3: 60.59 µs per loop
> @timeit f3(x)
>
> 10 loops, best of 3: 4.38 ms per loop
> @timeit f4(x)
>
> 100000 loops, best of 3: 9.08 µs per loop
>
> @timeit f5(x)
>
> 100000 loops, best of 3: 9.18 µs per loop
>
> @timeit f6(x)
>
> 100000 loops, best of 3: 2.97 µs per loop
>
>
> The comparison of f1 and f2 shows that .* is faster than .^2
> The comparison of f2 and f3 shows that comprehension is slower
> The high performance of f4 and f5 shows that for higher performance we 
> should write loops ourselves instead of writing vectorization code.
> The higher performance of f6 compared to Python shows that Julia is 
> eventually faster than NumPy.
>
> Furthermore, I find an interesting phenomenon for NumPy:
> %timeit x**2
> %timeit x**4
> %timeit (x**2)**2
>
> 100000 loops, best of 3: 4.54 µs per loop
> 1000 loops, best of 3: 774 µs per loop
> 100000 loops, best of 3: 9.85 µs per loop
>
> That suggests NumPy only optimized **2, not ** operator itself !
>
>
> On Monday, March 16, 2015 at 3:31:41 PM UTC+1, Sisyphuss wrote:
>>
>> That's interesting!
>>
>> On Sunday, March 15, 2015 at 4:18:34 PM UTC+1, Dallas Morisette wrote:
>>>
>>> Thanks everyone for the suggestions! Here is my updated test:
>>>
>>> using TimeIt
>>> function vec!(x,y)
>>>     y = x.*x
>>> end
>>>
>>> function comp!(x,y)
>>>     y = [xi*xi for xi in x]
>>> end
>>>
>>> function forloop!(x,y,n)
>>>     for i = 1:n 
>>>         y[i] = x[i]*x[i]
>>>     end
>>> end
>>>
>>> function forloop2!(x,y,n)
>>>     @simd for i = 1:n 
>>>         @inbounds y[i] = x[i]*x[i] 
>>>     end
>>> end
>>>     
>>> function test()
>>>     n = 10000
>>>     x = linspace(0.0,1.0,n)
>>>     y = zeros(x)
>>>     @timeit vec!(x,y)
>>>     @timeit comp!(x,y)
>>>     @timeit forloop!(x,y,n)
>>>     @timeit forloop2!(x,y,n)
>>> end
>>> test();
>>>
>>> 10000 loops, best of 3: 87.82 µs per loop
>>> 1000 loops, best of 3: 62.73 µs per loop
>>> 10000 loops, best of 3: 12.66 µs per loop
>>> 100000 loops, best of 3: 3.54 µs per loop
>>>
>>>
>>> So the SIMD macros combined with a literal for loop give performance 
>>> essentially equivalent to a call to numpy. I switched to @time so I could 
>>> see the allocations:
>>>
>>> elapsed time: 2.467e-5 seconds (80512 bytes allocated)
>>> elapsed time: 2.1358e-5 seconds (80048 bytes allocated)
>>> elapsed time: 1.5124e-5 seconds (0 bytes allocated)
>>> elapsed time: 6.108e-6 seconds (0 bytes allocated)
>>>
>>>
>>> Looks like one temporary array has to be allocated in both vectorized 
>>> and comprehension forms, which reduced the performance by about 5-7X. I 
>>> suppose this would depend on the exact calculation being done and the size 
>>> of the arrays involved and would have to be tested on a case-by-case basis. 
>>>
>>> Thanks for the help - I'm sure I'll be back with more questions!
>>>
>>> Dallas
>>>
>>

Reply via email to