Le jeudi 12 mars 2015 à 10:31 -0700, Phil Tomson a écrit :
> 
> 
> On Thursday, March 12, 2015 at 2:14:34 AM UTC-7, Mauro wrote:
>         Julia is not yet very good with producing fast vectorized code
>         which 
>         does not allocate temporaries.  The temporaries is what gets
>         you here. 
>         
>         However, running your example, I get a slightly different a
>         different 
>         *.mem file (which makes more sense to me): 
>         
>                 - function
>         forward_propagate(nl::NeuralLayer,x::Vector{Float32}) 
>                 0   nl.hx = x 
>         248832000   wx = nl.w * nl.hx 
>         348364800   nl.pa = nl.b+wx 
>         1094864752   nl.pr = tanh(nl.pa).*nl.scale 
>                 - end 
> 
> I would have guessed it should look more like that; why would the
> multiplication not result in temporaries (in my case)? That was a bit
> mysterious. 
> 
>         
>         (what version of julia are you running, me 0.3.6).  
> 
> 0.3.4 in my case.
> 
>  
> 
>         So everytime 
>         forward_propagate is called some temporaries are allocated.
>          So in 
>         performance critical code you have write loops instead: 
> 
> Will this always be the case or is this a current limitation of the
> Julia compiler? It seems like the more idiomatic, compact code should
> be handled more efficiently. Having to break this out into nested
> for-loops definitely hurts both readability as well as productivity.
There's been a discussion to find a syntax that would allow efficient
non-allocating element-wise operations. See
https://github.com/JuliaLang/julia/issues/8450

I think your code is a typical pattern that should be handled by this
design. By adapting the .* operator and making it return a generator
rather than a newly allocated array, one would be able to replace the
inefficient
wx = nl.w * nl.hx
with a non-allocating:
wx[:] = nl.w .* nl.hx

But these are just discussions at the moment.


Regards

>         
>         function
>         forward_propagate(nl::NeuralLayer,x::Vector{Float32}) 
>             nl.hx = x # note: nl.hx now point to the same junk of
>         memory 
>             for i=1:size(nl.w,1) 
>                 nl.pa[i] = 0.; 
>                 for j=1:size(nl.w,2) 
>                     nl.pa[i] += nl.w[i,j]*nl.hx[j] 
>                 end 
>                 nl.pa[i] += nl.b[i] 
>                 nl.pr[i] = tanh(nl.pa[i])*nl.scale[i] 
>             end 
>         end 
>         
>  
>         This does not allocate any memory and runs your test case at
>         about 2x 
>         the speed. 
>         
>         Also a note on the code in your first email.  Instead of: 
>         
>           for y in 1:img.height 
>             @simd for x in 1:img.wid 
>               if 1 < x < img.wid 
>                 @inbounds left   = img.data[x-1,y] 
>                 @inbounds center = img.data[x,y] 
>                 @inbounds right  = img.data[x+1,y] 
>         
>         you should be able to write: 
>         
>           @inbounds for y in 1:img.height 
>             @simd for x in 1:img.wid 
>               if 1 < x < img.wid 
>                 left   = img.data[x-1,y] 
>                 center = img.data[x,y] 
>                 @inbounds right  = img.data[x+1,y] 
>         
>         Also, did you check that the @simd works?  I'm no expert on
>         that but my 
>         understanding is that most of the time it doesn't work with
>         if-else.  If 
>         that is the case, maybe special-case the first and last
>         iteration and 
>         run the loop like: @simd for x in 2:img.wid-1 .  In fact that
>         would save 
>         you a comparisons in each iteration irrespective of @simd. 
>         
>         On Thu, 2015-03-12 at 02:17, Phil Tomson <[email protected]>
>         wrote: 
>         > I transformed it into a single-file testcase: 
>         > 
>         > ######################################################### 
>         > type NeuralLayer 
>         >     w::Matrix{Float32}   # weights 
>         >     cm::Matrix{Float32}  # connection matrix 
>         >     b::Vector{Float32}   # biases 
>         >     scale::Vector{Float32}  # 
>         >     a_func::Symbol     # activation function 
>         >     hx::Vector{Float32}  # input values 
>         >     pa::Vector{Float32}  # pre activation values 
>         >     pr::Vector{Float32}  # predictions (activation values) 
>         >     frozen::Bool 
>         > end 
>         > 
>         > function
>         forward_propagate(nl::NeuralLayer,x::Vector{Float32}) 
>         >   nl.hx = x 
>         >   wx = nl.w * nl.hx 
>         >   nl.pa = nl.b+wx 
>         >   nl.pr = tanh(nl.pa).*nl.scale 
>         > end 
>         > 
>         > out_dim = 10 
>         > in_dim = 10 
>         > b = sqrt(6) / sqrt(in_dim + out_dim) 
>         > 
>         > nl = NeuralLayer( 
>         >        float32(2.0b * rand(Float32,out_dim,in_dim) - b),
>         #setup rand weights 
>         >        ones(Float32,out_dim,in_dim), #connection matrix 
>         >
>          float32(map(x->x*(randbool()?-1:1),rand(out_dim)*rand(1:4))), 
>         > #biases 
>         >        rand(Float32,out_dim),  # scale 
>         >        :tanh, 
>         >        rand(Float32,in_dim), 
>         >        rand(Float32,out_dim), 
>         >        rand(Float32,out_dim), 
>         >        false 
>         >     ) 
>         > 
>         > x = ones(Float32,in_dim) 
>         > forward_propagate(nl,x) 
>         > clear_malloc_data() 
>         > for i in 1:(1920*1080) 
>         >   forward_propagate(nl,x) 
>         > end 
>         > println("nl.pr is: $(nl.pr)") 
>         >
>         
> ############################################################################# 
>         > 
>         > Now the interesting part of the  .mem file looks like this: 
>         > 
>         >        - function
>         forward_propagate(nl::NeuralLayer,x::Vector{Float32}) 
>         >         0   nl.hx = x 
>         >         0   wx = nl.w * nl.hx 
>         >   348368752   nl.pa = nl.b+wx 
>         >         0   nl.pr = tanh(nl.pa).*nl.scale 
>         >         - end 
>         > 
>         > I split up the matrix multiply and the addition of bias
>         vector into two 
>         > separate lines and it looks like it's the vector addition
>         that's allocating 
>         > all of the memory (which seems surprising, but maybe I'm
>         missing something). 
>         > 
>         > Phil 
>         

Reply via email to