Julia is not yet very good with producing fast vectorized code which
does not allocate temporaries. The temporaries is what gets you here.
However, running your example, I get a slightly different a different
*.mem file (which makes more sense to me):
- function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
0 nl.hx = x
248832000 wx = nl.w * nl.hx
348364800 nl.pa = nl.b+wx
1094864752 nl.pr = tanh(nl.pa).*nl.scale
- end
(what version of julia are you running, me 0.3.6). So everytime
forward_propagate is called some temporaries are allocated. So in
performance critical code you have write loops instead:
function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
nl.hx = x # note: nl.hx now point to the same junk of memory
for i=1:size(nl.w,1)
nl.pa[i] = 0.;
for j=1:size(nl.w,2)
nl.pa[i] += nl.w[i,j]*nl.hx[j]
end
nl.pa[i] += nl.b[i]
nl.pr[i] = tanh(nl.pa[i])*nl.scale[i]
end
end
This does not allocate any memory and runs your test case at about 2x
the speed.
Also a note on the code in your first email. Instead of:
for y in 1:img.height
@simd for x in 1:img.wid
if 1 < x < img.wid
@inbounds left = img.data[x-1,y]
@inbounds center = img.data[x,y]
@inbounds right = img.data[x+1,y]
you should be able to write:
@inbounds for y in 1:img.height
@simd for x in 1:img.wid
if 1 < x < img.wid
left = img.data[x-1,y]
center = img.data[x,y]
@inbounds right = img.data[x+1,y]
Also, did you check that the @simd works? I'm no expert on that but my
understanding is that most of the time it doesn't work with if-else. If
that is the case, maybe special-case the first and last iteration and
run the loop like: @simd for x in 2:img.wid-1 . In fact that would save
you a comparisons in each iteration irrespective of @simd.
On Thu, 2015-03-12 at 02:17, Phil Tomson <[email protected]> wrote:
> I transformed it into a single-file testcase:
>
> #########################################################
> type NeuralLayer
> w::Matrix{Float32} # weights
> cm::Matrix{Float32} # connection matrix
> b::Vector{Float32} # biases
> scale::Vector{Float32} #
> a_func::Symbol # activation function
> hx::Vector{Float32} # input values
> pa::Vector{Float32} # pre activation values
> pr::Vector{Float32} # predictions (activation values)
> frozen::Bool
> end
>
> function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> nl.hx = x
> wx = nl.w * nl.hx
> nl.pa = nl.b+wx
> nl.pr = tanh(nl.pa).*nl.scale
> end
>
> out_dim = 10
> in_dim = 10
> b = sqrt(6) / sqrt(in_dim + out_dim)
>
> nl = NeuralLayer(
> float32(2.0b * rand(Float32,out_dim,in_dim) - b), #setup rand weights
> ones(Float32,out_dim,in_dim), #connection matrix
> float32(map(x->x*(randbool()?-1:1),rand(out_dim)*rand(1:4))),
> #biases
> rand(Float32,out_dim), # scale
> :tanh,
> rand(Float32,in_dim),
> rand(Float32,out_dim),
> rand(Float32,out_dim),
> false
> )
>
> x = ones(Float32,in_dim)
> forward_propagate(nl,x)
> clear_malloc_data()
> for i in 1:(1920*1080)
> forward_propagate(nl,x)
> end
> println("nl.pr is: $(nl.pr)")
> #############################################################################
>
> Now the interesting part of the .mem file looks like this:
>
> - function forward_propagate(nl::NeuralLayer,x::Vector{Float32})
> 0 nl.hx = x
> 0 wx = nl.w * nl.hx
> 348368752 nl.pa = nl.b+wx
> 0 nl.pr = tanh(nl.pa).*nl.scale
> - end
>
> I split up the matrix multiply and the addition of bias vector into two
> separate lines and it looks like it's the vector addition that's allocating
> all of the memory (which seems surprising, but maybe I'm missing something).
>
> Phil