Indeed, x^2 seems to be an optimization case in NumPy. See, for x^3: Julia: julia> @timeit y = x.^3 1000 loops, best of 3: 265.99 µs per loop
Python: In [1]: %timeit y = x**3 1000 loops, best of 3: 259 µs per loop On Monday, March 16, 2015 at 4:08:55 PM UTC, Sisyphuss wrote: > > Here is my version. > > Firstly, the benchmark of Python. > %timeit x**2 > > 100000 loops, best of 3: 4.43 µs per loop > > %timeit [xi**2 for xi in x] > > 100 loops, best of 3: 2.28 ms per loop > > > Secondly, the various Julia versions. > f1 = x -> x.^2 > f2 = x -> x.*x > f3 = x -> [xi*xi for xi in x] > > function f4(x) > y = x > for i = 1:length(x) > y[i] = x[i]*x[i] > end > return y > end > > function f5(x) > for i = 1:length(x) > x[i] *= x[i] > end > end > > function f6(x) > @simd for i = 1:length(x) > @inbounds x[i] *= x[i] > end > end > > @timeit f1(x) > > 1000 loops, best of 3: 149.74 µs per loop > @timeit f2(x) > > 1000 loops, best of 3: 60.59 µs per loop > @timeit f3(x) > > 10 loops, best of 3: 4.38 ms per loop > @timeit f4(x) > > 100000 loops, best of 3: 9.08 µs per loop > > @timeit f5(x) > > 100000 loops, best of 3: 9.18 µs per loop > > @timeit f6(x) > > 100000 loops, best of 3: 2.97 µs per loop > > > The comparison of f1 and f2 shows that .* is faster than .^2 > The comparison of f2 and f3 shows that comprehension is slower > The high performance of f4 and f5 shows that for higher performance we > should write loops ourselves instead of writing vectorization code. > The higher performance of f6 compared to Python shows that Julia is > eventually faster than NumPy. > > Furthermore, I find an interesting phenomenon for NumPy: > %timeit x**2 > %timeit x**4 > %timeit (x**2)**2 > > 100000 loops, best of 3: 4.54 µs per loop > 1000 loops, best of 3: 774 µs per loop > 100000 loops, best of 3: 9.85 µs per loop > > That suggests NumPy only optimized **2, not ** operator itself ! > > > On Monday, March 16, 2015 at 3:31:41 PM UTC+1, Sisyphuss wrote: >> >> That's interesting! >> >> On Sunday, March 15, 2015 at 4:18:34 PM UTC+1, Dallas Morisette wrote: >>> >>> Thanks everyone for the suggestions! Here is my updated test: >>> >>> using TimeIt >>> function vec!(x,y) >>> y = x.*x >>> end >>> >>> function comp!(x,y) >>> y = [xi*xi for xi in x] >>> end >>> >>> function forloop!(x,y,n) >>> for i = 1:n >>> y[i] = x[i]*x[i] >>> end >>> end >>> >>> function forloop2!(x,y,n) >>> @simd for i = 1:n >>> @inbounds y[i] = x[i]*x[i] >>> end >>> end >>> >>> function test() >>> n = 10000 >>> x = linspace(0.0,1.0,n) >>> y = zeros(x) >>> @timeit vec!(x,y) >>> @timeit comp!(x,y) >>> @timeit forloop!(x,y,n) >>> @timeit forloop2!(x,y,n) >>> end >>> test(); >>> >>> 10000 loops, best of 3: 87.82 µs per loop >>> 1000 loops, best of 3: 62.73 µs per loop >>> 10000 loops, best of 3: 12.66 µs per loop >>> 100000 loops, best of 3: 3.54 µs per loop >>> >>> >>> So the SIMD macros combined with a literal for loop give performance >>> essentially equivalent to a call to numpy. I switched to @time so I could >>> see the allocations: >>> >>> elapsed time: 2.467e-5 seconds (80512 bytes allocated) >>> elapsed time: 2.1358e-5 seconds (80048 bytes allocated) >>> elapsed time: 1.5124e-5 seconds (0 bytes allocated) >>> elapsed time: 6.108e-6 seconds (0 bytes allocated) >>> >>> >>> Looks like one temporary array has to be allocated in both vectorized >>> and comprehension forms, which reduced the performance by about 5-7X. I >>> suppose this would depend on the exact calculation being done and the size >>> of the arrays involved and would have to be tested on a case-by-case basis. >>> >>> Thanks for the help - I'm sure I'll be back with more questions! >>> >>> Dallas >>> >>
