Julia is not yet very good with producing fast vectorized code which
does not allocate temporaries.  The temporaries is what gets you here.

However, running your example, I get a slightly different a different
*.mem file (which makes more sense to me):

        - function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
        0   nl.hx = x 
248832000   wx = nl.w * nl.hx
348364800   nl.pa = nl.b+wx
1094864752   nl.pr = tanh(nl.pa).*nl.scale 
        - end

(what version of julia are you running, me 0.3.6).  So everytime
forward_propagate is called some temporaries are allocated.  So in
performance critical code you have write loops instead:

function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
    nl.hx = x # note: nl.hx now point to the same junk of memory
    for i=1:size(nl.w,1)
        nl.pa[i] = 0.;
        for j=1:size(nl.w,2)
            nl.pa[i] += nl.w[i,j]*nl.hx[j]
        end
        nl.pa[i] += nl.b[i]
        nl.pr[i] = tanh(nl.pa[i])*nl.scale[i]
    end
end

This does not allocate any memory and runs your test case at about 2x
the speed.

Also a note on the code in your first email.  Instead of:

  for y in 1:img.height
    @simd for x in 1:img.wid
      if 1 < x < img.wid
        @inbounds left   = img.data[x-1,y]
        @inbounds center = img.data[x,y]
        @inbounds right  = img.data[x+1,y]

you should be able to write:

  @inbounds for y in 1:img.height
    @simd for x in 1:img.wid
      if 1 < x < img.wid
        left   = img.data[x-1,y]
        center = img.data[x,y]
        @inbounds right  = img.data[x+1,y]

Also, did you check that the @simd works?  I'm no expert on that but my
understanding is that most of the time it doesn't work with if-else.  If
that is the case, maybe special-case the first and last iteration and
run the loop like: @simd for x in 2:img.wid-1 .  In fact that would save
you a comparisons in each iteration irrespective of @simd.

On Thu, 2015-03-12 at 02:17, Phil Tomson <[email protected]> wrote:
> I transformed it into a single-file testcase:
>
> #########################################################
> type NeuralLayer
>     w::Matrix{Float32}   # weights 
>     cm::Matrix{Float32}  # connection matrix 
>     b::Vector{Float32}   # biases 
>     scale::Vector{Float32}  # 
>     a_func::Symbol     # activation function
>     hx::Vector{Float32}  # input values
>     pa::Vector{Float32}  # pre activation values
>     pr::Vector{Float32}  # predictions (activation values)
>     frozen::Bool
> end
>
> function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
>   nl.hx = x 
>   wx = nl.w * nl.hx
>   nl.pa = nl.b+wx
>   nl.pr = tanh(nl.pa).*nl.scale 
> end
>
> out_dim = 10
> in_dim = 10
> b = sqrt(6) / sqrt(in_dim + out_dim)
>
> nl = NeuralLayer(
>        float32(2.0b * rand(Float32,out_dim,in_dim) - b), #setup rand weights
>        ones(Float32,out_dim,in_dim), #connection matrix
>          float32(map(x->x*(randbool()?-1:1),rand(out_dim)*rand(1:4))), 
> #biases
>        rand(Float32,out_dim),  # scale 
>        :tanh, 
>        rand(Float32,in_dim),
>        rand(Float32,out_dim),
>        rand(Float32,out_dim),
>        false
>     )
>
> x = ones(Float32,in_dim)
> forward_propagate(nl,x)
> clear_malloc_data()
> for i in 1:(1920*1080)
>   forward_propagate(nl,x)
> end
> println("nl.pr is: $(nl.pr)")
> #############################################################################
>
> Now the interesting part of the  .mem file looks like this:
>
>        - function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
>         0   nl.hx = x
>         0   wx = nl.w * nl.hx
>   348368752   nl.pa = nl.b+wx
>         0   nl.pr = tanh(nl.pa).*nl.scale
>         - end
>
> I split up the matrix multiply and the addition of bias vector into two 
> separate lines and it looks like it's the vector addition that's allocating 
> all of the memory (which seems surprising, but maybe I'm missing something).
>
> Phil

Reply via email to