Le jeudi 12 mars 2015 à 10:31 -0700, Phil Tomson a écrit :
>
>
> On Thursday, March 12, 2015 at 2:14:34 AM UTC-7, Mauro wrote:
> Julia is not yet very good with producing fast vectorized code
> which
> does not allocate temporaries. The temporaries is what gets
> you here.
>
> However, running your example, I get a slightly different a
> different
> *.mem file (which makes more sense to me):
>
> - function
> forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> 0 nl.hx = x
> 248832000 wx = nl.w * nl.hx
> 348364800 nl.pa = nl.b+wx
> 1094864752 nl.pr = tanh(nl.pa).*nl.scale
> - end
>
> I would have guessed it should look more like that; why would the
> multiplication not result in temporaries (in my case)? That was a bit
> mysterious.
>
>
> (what version of julia are you running, me 0.3.6).
>
> 0.3.4 in my case.
>
>
>
> So everytime
> forward_propagate is called some temporaries are allocated.
> So in
> performance critical code you have write loops instead:
>
> Will this always be the case or is this a current limitation of the
> Julia compiler? It seems like the more idiomatic, compact code should
> be handled more efficiently. Having to break this out into nested
> for-loops definitely hurts both readability as well as productivity.
There's been a discussion to find a syntax that would allow efficient
non-allocating element-wise operations. See
https://github.com/JuliaLang/julia/issues/8450
I think your code is a typical pattern that should be handled by this
design. By adapting the .* operator and making it return a generator
rather than a newly allocated array, one would be able to replace the
inefficient
wx = nl.w * nl.hx
with a non-allocating:
wx[:] = nl.w .* nl.hx
But these are just discussions at the moment.
Regards
>
> function
> forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> nl.hx = x # note: nl.hx now point to the same junk of
> memory
> for i=1:size(nl.w,1)
> nl.pa[i] = 0.;
> for j=1:size(nl.w,2)
> nl.pa[i] += nl.w[i,j]*nl.hx[j]
> end
> nl.pa[i] += nl.b[i]
> nl.pr[i] = tanh(nl.pa[i])*nl.scale[i]
> end
> end
>
>
> This does not allocate any memory and runs your test case at
> about 2x
> the speed.
>
> Also a note on the code in your first email. Instead of:
>
> for y in 1:img.height
> @simd for x in 1:img.wid
> if 1 < x < img.wid
> @inbounds left = img.data[x-1,y]
> @inbounds center = img.data[x,y]
> @inbounds right = img.data[x+1,y]
>
> you should be able to write:
>
> @inbounds for y in 1:img.height
> @simd for x in 1:img.wid
> if 1 < x < img.wid
> left = img.data[x-1,y]
> center = img.data[x,y]
> @inbounds right = img.data[x+1,y]
>
> Also, did you check that the @simd works? I'm no expert on
> that but my
> understanding is that most of the time it doesn't work with
> if-else. If
> that is the case, maybe special-case the first and last
> iteration and
> run the loop like: @simd for x in 2:img.wid-1 . In fact that
> would save
> you a comparisons in each iteration irrespective of @simd.
>
> On Thu, 2015-03-12 at 02:17, Phil Tomson <[email protected]>
> wrote:
> > I transformed it into a single-file testcase:
> >
> > #########################################################
> > type NeuralLayer
> > w::Matrix{Float32} # weights
> > cm::Matrix{Float32} # connection matrix
> > b::Vector{Float32} # biases
> > scale::Vector{Float32} #
> > a_func::Symbol # activation function
> > hx::Vector{Float32} # input values
> > pa::Vector{Float32} # pre activation values
> > pr::Vector{Float32} # predictions (activation values)
> > frozen::Bool
> > end
> >
> > function
> forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> > nl.hx = x
> > wx = nl.w * nl.hx
> > nl.pa = nl.b+wx
> > nl.pr = tanh(nl.pa).*nl.scale
> > end
> >
> > out_dim = 10
> > in_dim = 10
> > b = sqrt(6) / sqrt(in_dim + out_dim)
> >
> > nl = NeuralLayer(
> > float32(2.0b * rand(Float32,out_dim,in_dim) - b),
> #setup rand weights
> > ones(Float32,out_dim,in_dim), #connection matrix
> >
> float32(map(x->x*(randbool()?-1:1),rand(out_dim)*rand(1:4))),
> > #biases
> > rand(Float32,out_dim), # scale
> > :tanh,
> > rand(Float32,in_dim),
> > rand(Float32,out_dim),
> > rand(Float32,out_dim),
> > false
> > )
> >
> > x = ones(Float32,in_dim)
> > forward_propagate(nl,x)
> > clear_malloc_data()
> > for i in 1:(1920*1080)
> > forward_propagate(nl,x)
> > end
> > println("nl.pr is: $(nl.pr)")
> >
>
> #############################################################################
> >
> > Now the interesting part of the .mem file looks like this:
> >
> > - function
> forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> > 0 nl.hx = x
> > 0 wx = nl.w * nl.hx
> > 348368752 nl.pa = nl.b+wx
> > 0 nl.pr = tanh(nl.pa).*nl.scale
> > - end
> >
> > I split up the matrix multiply and the addition of bias
> vector into two
> > separate lines and it looks like it's the vector addition
> that's allocating
> > all of the memory (which seems surprising, but maybe I'm
> missing something).
> >
> > Phil
>