Actually, if you want this to be fast I don't think you can avoid the if y 
== 2 branch, although ideally it should be outside the loop and not inside 
it. LLVM will optimize x^2, but if it doesn't know what the y in x^y is at 
compile time, it will just call libm. It's not going to emit a branch to 
handle the case where y == 2, since it has no idea what arguments the 
function will actually be called with and it doesn't want to slow things 
down if it's not called with y == 2. It just so happens that X.^2 is pretty 
common in real life.

In the Octave source I see that Octave special-cases X.^2, X.^3, and 
X.^(-1). (Notice how X.^4 is ~6x slower.) We could do the same, or inline 
the loop so that LLVM's optimizers can get a crack at it in the constant 
power case, or dynamically generate a specialized loop that uses multiply 
operations for arbitrary integer powers (but see here 
<https://github.com/JuliaLang/julia/issues/2741#issuecomment-47008166> for 
a caveat: this will be faster than calling libm, but it will be less 
accurate). But if you want a fast version of X.^2, right now you can just 
call abs2(X).

Tony is right that sumabs2(X, 1) will be faster, both versus sum(abs2(X), 1) 
and versus anything you could achieve with Octave, since it avoids 
allocating a temporary array and exploits BLAS. (On my machine it's about 
9x faster than sum(X.^2, 1) in Octave.) But on my machine sum(abs2(X), 1) 
in Julia is only a few percent slower than sum(X.^2, 1) in Octave if that's 
all you want.

Simon

On Monday, September 8, 2014 12:37:16 PM UTC-4, Tony Kelman wrote:
>
> Calm down. It's very easy to match Octave here, by writing the standard 
> library in C. Julia chose not to go that route. Of course everyone would 
> like the concise vectorized syntax to be as fast as the for-loop syntax. If 
> it were easy to do that *without* writing the standard library in C, it 
> would've been done already. In the meantime it isn't easy and hasn't been 
> done yet, so when someone asks how to write code that's faster than Octave, 
> this is the answer.
>
> There have been various LLVM bugs regarding automatic optimizations of 
> converting small integer powers to multiplications. "if y == 2" code really 
> shouldn't be necessary here, LLVM is smarter than that but sometimes fickle 
> and buggy under the demands Julia puts on it.
>
> And in this case the answer is clearly a reduction for which Julia has a 
> very good set of tools to express efficiently. If your output is much less 
> data than your input, who cares how fast you can calculate temporaries that 
> you don't need to save?
>
>
> On Monday, September 8, 2014 9:18:02 AM UTC-7, Jeff Waller wrote:
>>
>>
>>
>> On Monday, September 8, 2014 7:37:52 AM UTC-4, Mohammed El-Beltagy wrote:
>>>
>>> Octave seems to be doing a pretty good job for this type of calculation. 
>>> If you want to squeeze a bit more performance out of Julia, you can try to 
>>> use explicit loops (devectorize as in 
>>> http://julialang.org/blog/2013/09/fast-numeric/). Then you might also 
>>> remove bounds checking in a loop for faster performance. 
>>>
>>> Try this function:
>>> function doCalc(x::Array{Float64,2})
>>>             XX=Array(Float64, 7000,7000)
>>>             for j=1:7000,i=1:7000
>>>                 @inbounds XX[i,j]=x[i,j]*x[i,j]
>>>             end
>>>             XX
>>> end
>>>
>>> Followed by 
>>> @time XX=doCalc(X);
>>>
>>
>>
>> I am totally not cool with this. Just simply making things fast at 
>> whatever the cost is not good enough. You can't tell someone that "Oh sure, 
>> Julia does support that extremely convenient syntax, but because it's 6 
>> times slower (and it is), you need to essentially write C".  There's no 
>> reason that Julia for this should be any less fast than Octave except for 
>> bugs which will eventually be fixed.  I think it's perfectly ok to say 
>> devectorize if you want to do even better, but it must be at least equal.
>>
>> Here's the implementation of .^
>>
>>
>> .^(x::StridedArray, y::Number) = reshape([ x[i] ^ y for i=1:length(x) ], 
>> size(x))
>> how is that made less vectorized?
>>
>> Yes R, Matlab, Octave do conflate vectorization for speed and style and 
>> that's a mistake.  Julia tries not to do that that's better.  But do you 
>> expect people to throw away the Matlab syntax?  No way.
>>
>> for me:
>>
>> *julia> **x = rand(7000,7000);*
>>
>> *julia> **@time x.^2;*
>>
>> elapsed time: 1.340131226 seconds (392000256 bytes allocated)
>>
>> And in Octave....
>>
>> >> x=rand(7000);
>>
>> >> tic;y=x.^2;toc
>>
>> Elapsed time is 0.201613 seconds.
>>
>> Comprehensions might be the culprit, or simply use of the generic x^y 
>>  here's an alternate implementation I threw together
>>
>> *julia> **function .^{T}(x::Array{T},y::Number)*
>>
>>        *   z = *(size(x)...)*
>>
>>        *   new_x = Array(T,z);*
>>
>>        *   tmp_x = reshape(x,z);*
>>
>>        *   for i in 1:z*
>>
>>        *      @inbounds if y == 2*
>>
>>        *         new_x[i] = tmp_x[i]*tmp_x[i]*
>>
>>        *      else*
>>
>>        *         new_x[i] = tmp_x[i]^y*
>>
>>        *      end*
>>
>>        *   end*
>>
>>        *   reshape(new_x,size(x))*
>>
>>        *end*
>>
>> *.^ (generic function with 24 methods)*
>>
>> *julia> **@time x.^2;*
>>
>> elapsed time: 0.248816016 seconds (392000360 bytes allocated, 4.73% gc 
>> time)
>>
>> Close....
>>
>>
>>

Reply via email to