On Thursday, March 12, 2015 at 2:14:34 AM UTC-7, Mauro wrote:
>
> Julia is not yet very good with producing fast vectorized code which
> does not allocate temporaries. The temporaries is what gets you here.
>
> However, running your example, I get a slightly different a different
> *.mem file (which makes more sense to me):
>
> - function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> 0 nl.hx = x
> 248832000 wx = nl.w * nl.hx
> 348364800 nl.pa = nl.b+wx
> 1094864752 nl.pr = tanh(nl.pa).*nl.scale
> - end
>
I would have guessed it should look more like that; why would the
multiplication not result in temporaries (in my case)? That was a bit
mysterious.
>
> (what version of julia are you running, me 0.3.6).
0.3.4 in my case.
> So everytime
> forward_propagate is called some temporaries are allocated. So in
> performance critical code you have write loops instead:
>
Will this always be the case or is this a current limitation of the Julia
compiler? It seems like the more idiomatic, compact code should be handled
more efficiently. Having to break this out into nested for-loops definitely
hurts both readability as well as productivity.
>
> function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> nl.hx = x # note: nl.hx now point to the same junk of memory
> for i=1:size(nl.w,1)
> nl.pa[i] = 0.;
> for j=1:size(nl.w,2)
> nl.pa[i] += nl.w[i,j]*nl.hx[j]
> end
> nl.pa[i] += nl.b[i]
> nl.pr[i] = tanh(nl.pa[i])*nl.scale[i]
> end
> end
>
>
> This does not allocate any memory and runs your test case at about 2x
> the speed.
>
> Also a note on the code in your first email. Instead of:
>
> for y in 1:img.height
> @simd for x in 1:img.wid
> if 1 < x < img.wid
> @inbounds left = img.data[x-1,y]
> @inbounds center = img.data[x,y]
> @inbounds right = img.data[x+1,y]
>
> you should be able to write:
>
> @inbounds for y in 1:img.height
> @simd for x in 1:img.wid
> if 1 < x < img.wid
> left = img.data[x-1,y]
> center = img.data[x,y]
> @inbounds right = img.data[x+1,y]
>
> Also, did you check that the @simd works? I'm no expert on that but my
> understanding is that most of the time it doesn't work with if-else. If
> that is the case, maybe special-case the first and last iteration and
> run the loop like: @simd for x in 2:img.wid-1 . In fact that would save
> you a comparisons in each iteration irrespective of @simd.
>
> On Thu, 2015-03-12 at 02:17, Phil Tomson <[email protected] <javascript:>>
> wrote:
> > I transformed it into a single-file testcase:
> >
> > #########################################################
> > type NeuralLayer
> > w::Matrix{Float32} # weights
> > cm::Matrix{Float32} # connection matrix
> > b::Vector{Float32} # biases
> > scale::Vector{Float32} #
> > a_func::Symbol # activation function
> > hx::Vector{Float32} # input values
> > pa::Vector{Float32} # pre activation values
> > pr::Vector{Float32} # predictions (activation values)
> > frozen::Bool
> > end
> >
> > function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> > nl.hx = x
> > wx = nl.w * nl.hx
> > nl.pa = nl.b+wx
> > nl.pr = tanh(nl.pa).*nl.scale
> > end
> >
> > out_dim = 10
> > in_dim = 10
> > b = sqrt(6) / sqrt(in_dim + out_dim)
> >
> > nl = NeuralLayer(
> > float32(2.0b * rand(Float32,out_dim,in_dim) - b), #setup rand
> weights
> > ones(Float32,out_dim,in_dim), #connection matrix
> > float32(map(x->x*(randbool()?-1:1),rand(out_dim)*rand(1:4))),
> > #biases
> > rand(Float32,out_dim), # scale
> > :tanh,
> > rand(Float32,in_dim),
> > rand(Float32,out_dim),
> > rand(Float32,out_dim),
> > false
> > )
> >
> > x = ones(Float32,in_dim)
> > forward_propagate(nl,x)
> > clear_malloc_data()
> > for i in 1:(1920*1080)
> > forward_propagate(nl,x)
> > end
> > println("nl.pr is: $(nl.pr)")
> >
> #############################################################################
>
> >
> > Now the interesting part of the .mem file looks like this:
> >
> > - function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> > 0 nl.hx = x
> > 0 wx = nl.w * nl.hx
> > 348368752 nl.pa = nl.b+wx
> > 0 nl.pr = tanh(nl.pa).*nl.scale
> > - end
> >
> > I split up the matrix multiply and the addition of bias vector into two
> > separate lines and it looks like it's the vector addition that's
> allocating
> > all of the memory (which seems surprising, but maybe I'm missing
> something).
> >
> > Phil
>
>