Thanks Mauro! I made the change it works wonderfully.
I’m still a little confused why this change in particular makes such a big
difference. Is it
when using l1, l2 to index (they still have the constraint of <: Int)? Or is it
that v isn’t concrete and that gets propagated to everything else through
diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - log(v)
?
Lastly, are there tools I could use check this sort of thing in the future?

On Sun, May 24, 2015 at 2:38 PM, Mauro < [email protected] 
[[email protected]] > wrote:
The problem is in:
type Model{T}
W_main::Matrix{T}
W_ctx::Matrix{T}
b_main::Vector{T}
b_ctx::Vector{T}
W_main_grad::Matrix{T}
W_ctx_grad::Matrix{T}
b_main_grad::Vector{T}
b_ctx_grad::Vector{T}
covec::Vector{Cooccurence} # <- needs to be concrete type
end

Instead use
covec::Vector{Cooccurence{Int, Int,T}}
or some more complicated parameterisation.

Then, when testing timings you usually do one warm-up to exclude
compilation time:

GloVe.train!(model, solver) # warm up
@time 1 # @time needs a warm up too
@time GloVe.train!(model, solver)

Timings I get:

stock clone from github:
elapsed time: 0.001617218 seconds (2419024 bytes allocated)

with improvements mentioned above:
elapsed time: 0.001344645 seconds (2335552 bytes allocated)

with improvements mentioned above and your loop-version:
elapsed time: 0.00030488 seconds (3632 bytes allocated)

Hope that helps.

On Sun, 2015-05-24 at 19:21, Dominique Luna < [email protected] 
[[email protected]] > wrote:
> Loop code
>
>
> # TODO: figure out memory issue
> function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75)
> J = 0.0
> shuffle!(m.covec)
> vecsize = size(m.W_main, 1)
> eltype = typeof(m.b_main[1])
> vm = zeros(eltype, vecsize)
> vc = zeros(eltype, vecsize)
> grad_main = zeros(eltype, vecsize)
> grad_ctx = zeros(eltype, vecsize)
> for n=1:s.niter
> # shuffle indices
> for i = 1:length(m.covec)
> @inbounds l1 = m.covec[i].i # main index
> @inbounds l2 = m.covec[i].j # context index
> @inbounds v = m.covec[i].v
> #= vm[:] = m.W_main[:, l1] =#
> #= vc[:] = m.W_ctx[:, l2] =#
> @inbounds for j = 1:vecsize
> vm[j] = m.W_main[j, l1]
> vc[j] = m.W_ctx[j, l2]
> end
> diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - log(v)
> fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * diff
> J += 0.5 * fdiff * diff
> fdiff *= s.lrate
> # inc memory by ~200 MB && running time by 2x
> #= grad_main[:] = fdiff * m.W_ctx[:, l2] =#
> #= grad_ctx[:] = fdiff * m.W_main[:, l1] =#
> @inbounds for j = 1:vecsize
> grad_main[j] = fdiff * m.W_ctx[j, l2]
> grad_ctx[j] = fdiff * m.W_main[j, l1]
> end
> # Adaptive learning
> # inc ~ 600MB + 0.75s
> #= m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, l1]) =#
> #= m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, l2]) =#
> #= m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) =#
> #= m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) =#
> @inbounds for j = 1:vecsize
> m.W_main[j, l1] -= grad_main[j] / sqrt(m.W_main_grad[j, l1])
> m.W_ctx[j, l2] -= grad_ctx[j] / sqrt(m.W_ctx_grad[j, l2])
> end
> m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1])
> m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2])
> # Gradients
> fdiff *= fdiff
> #= m.W_main_grad[:, l1] += grad_main .^ 2 =#
> #= m.W_ctx_grad[:, l2] += grad_ctx .^ 2 =#
> #= m.b_main_grad[l1] += fdiff =#
> #= m.b_ctx_grad[l2] += fdiff =#
> @inbounds for j = 1:vecsize
> m.W_main_grad[j, l1] += grad_main[j] ^ 2
> m.W_ctx_grad[j, l2] += grad_ctx[j] ^ 2
> end
> m.b_main_grad[l1] += fdiff
> m.b_ctx_grad[l2] += fdiff
> end
> #= if n % 10 == 0 =#
> #= println("iteration $n, cost $J") =#
> #= end =#
> end
> end
>
>
>
> Mixmax Not using Mixmax yet?
>
> And the respective timings
>
> @time GloVe.train!(model, GloVe.Adagrad(500))
> 7.097 seconds (96237 k allocations: 1468 MB, 7.01% gc time)
>
> Slower and more memory.
>
> On Sun, May 24, 2015 at 4:21 AM, Mauro < [email protected] 
> [[email protected]] > wrote:
>
> Loops should run without allocations. Can you post your loop-code?
>
> > A[i, :] = 0.5 * B[i, :]
>
> To state the obvious, as loop:
>
> for j=1:size(A,2)
> A[i,j] = 0.5 * B[i,j]
> end
>
> this shouldn't allocate, if i is an integer. Unless A and B have
> different type, then allocation might happen.
>
> On Sun, 2015-05-24 at 05:00, Dom Luna < [email protected] 
> [[email protected]] > wrote:
> > Reposting this from Gitter chat since it seems this is more active.
> >
> > I'm writing a GloVe module to learn Julia.
> >
> > How can I avoid memory allocations? My main function deals with a lot of
> > random indexing in Matrices.
> >
> > A[i, :] = 0.5 * B[i, :]
> >
> > In this case* i* isn't from a linear sequence. I'm not sure that matters.
> > Anyway, I’ve done analysis and I know B[i, :] is the issue here since
> it’s
> > creating a copy.
> >
> > https://github.com/JuliaLang/ julia/blob/master/base/array. jl#L309
[https://github.com/JuliaLang/julia/blob/master/base/array.jl#L309] makes
> the
> > copy
> >
> >
> > I tried to do it via loop but it looks like that doesn’t help either. In
> > fact, it seems to allocate slight more memory which seems really odd.
> >
> > Here’s some of the code, it’s a little messy since I’m commenting
> different
> > approaches I’m trying out.
> >
> > type Model{T}
> > W_main::Matrix{T}
> > W_ctx::Matrix{T}
> > b_main::Vector{T}
> > b_ctx::Vector{T}
> > W_main_grad::Matrix{T}
> > W_ctx_grad::Matrix{T}
> > b_main_grad::Vector{T}
> > b_ctx_grad::Vector{T}
> > covec::Vector{Cooccurence}
> > end
> >
> > # Each vocab word in associated with a main vector and a context vector.
> > # The paper initializes the to values [-0.5, 0.5] / vecsize+1 and
> > # the gradients to 1.0.
> > #
> > # The +1 term is for the bias.
> > function Model(comatrix; vecsize=100)
> > vs = size(comatrix, 1)
> > Model(
> > (rand(vecsize, vs) - 0.5) / (vecsize + 1),
> > (rand(vecsize, vs) - 0.5) / (vecsize + 1),
> > (rand(vs) - 0.5) / (vecsize + 1),
> > (rand(vs) - 0.5) / (vecsize + 1),
> > ones(vecsize, vs),
> > ones(vecsize, vs),
> > ones(vs),
> > ones(vs),
> > CoVector(comatrix), # not required in 0.4
> > )
> > end
> >
> > # TODO: figure out memory issue
> > # the memory comments are from 500 loop test with vecsize=100
> > function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75)
> > J = 0.0
> > shuffle!(m.covec)
> >
> > vecsize = size(m.W_main, 1)
> > eltype = typeof(m.b_main[1])
> > vm = zeros(eltype, vecsize)
> > vc = zeros(eltype, vecsize)
> > grad_main = zeros(eltype, vecsize)
> > grad_ctx = zeros(eltype, vecsize)
> >
> > for n=1:s.niter
> > # shuffle indices
> > for i = 1:length(m.covec)
> > @inbounds l1 = m.covec[i].i # main index
> > @inbounds l2 = m.covec[i].j # context index
> > @inbounds v = m.covec[i].v
> >
> > vm[:] = m.W_main[:, l1]
> > vc[:] = m.W_ctx[:, l2]
> >
> > diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] -
> > log(v)
> > fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * diff
> > J += 0.5 * fdiff * diff
> >
> > fdiff *= s.lrate
> > # inc memory by ~200 MB && running time by 2x
> > grad_main[:] = fdiff * m.W_ctx[:, l2]
> > grad_ctx[:] = fdiff * m.W_main[:, l1]
> >
> > # Adaptive learning
> > # inc ~ 600MB + 0.75s
> > #= @inbounds for ii = 1:vecsize =#
> > #= m.W_main[ii, l1] -= grad_main[ii] /
> > sqrt(m.W_main_grad[ii, l1]) =#
> > #= m.W_ctx[ii, l2] -= grad_ctx[ii] / sqrt(m.W_ctx_grad
> [ii,
> > l2]) =#
> > #= m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) =#
> > #= m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) =#
> > #= end =#
> >
> > m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, l1])
> > m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, l2])
> > m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1])
> > m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2])
> >
> > # Gradients
> > fdiff *= fdiff
> > m.W_main_grad[:, l1] += grad_main .^ 2
> > m.W_ctx_grad[:, l2] += grad_ctx .^ 2
> > m.b_main_grad[l1] += fdiff
> > m.b_ctx_grad[l2] += fdiff
> > end
> >
> > #= if n % 10 == 0 =#
> > #= println(“iteration $n, cost $J”) =#
> > #= end =#
> > end
> > end
> >
> >
> > Here’s the entire repo https://github.com/domluna/ GloVe.jl 
> > [https://github.com/domluna/GloVe.jl] . Might be
> > helpful.
> >
> > I tried doing some loops but it allocates more memory (oddly enough) and
> > gets slower.
> >
> > You’ll notice the word vectors are indexed by column, I changed the
> > representation to that
> > seeing if it would make a difference during the loop. It didn’t seem to.
> >
> > The memory analysis showed
> >
> > Julia Version 0.4.0-dev+4893
> > Commit eb5da26* (2015-05-19 11:51 UTC)
> > Platform Info:
> > System: Darwin (x86_64-apple-darwin14.4.0)
> > CPU: Intel(R) Core(TM) i5-2557M CPU @ 1.70GHz
> > WORD_SIZE: 64
> > BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
> > LAPACK: libopenblas
> > LIBM: libopenlibm
> > LLVM: libLLVM-3.3
> >
> > Here model consists of 100x19 Matrices and 100 element vectors, 19 words
> in
> > the vocab, 100 element word vector.
> >
> > @time GloVe.train!(model, GloVe.Adagrad(500))
> > 1.990 seconds (6383 k allocations: 1162 MB, 10.82% gc time)
> >
> > 0.3 has is a bit slower due to worse gc but same memory.
> >
> > Any help would be greatly appreciated!
>
>
>
>
> cheers,
>
> dom
>
>
> Sent with Mixmax
>
> *





cheers,

dom




Sent with Mixmax [https://mixmax.com]

[https://app.mixmax.com/api/track/v2/d6R7nmZkcOp1nsKJL/dluna132%40gmail.com/i02bj5ycwV3bydWZsd2bvdGQzJXZzVXLhlGb1pmI/i02bj5ycwV3bydWZsd2bvdGQzJXZzVXLhlGb1pmI?sc=false]

Reply via email to