Thanks Mauro! I made the change it works wonderfully. I’m still a little confused why this change in particular makes such a big difference. Is it when using l1, l2 to index (they still have the constraint of <: Int)? Or is it that v isn’t concrete and that gets propagated to everything else through diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - log(v) ? Lastly, are there tools I could use check this sort of thing in the future?
On Sun, May 24, 2015 at 2:38 PM, Mauro < [email protected] [[email protected]] > wrote: The problem is in: type Model{T} W_main::Matrix{T} W_ctx::Matrix{T} b_main::Vector{T} b_ctx::Vector{T} W_main_grad::Matrix{T} W_ctx_grad::Matrix{T} b_main_grad::Vector{T} b_ctx_grad::Vector{T} covec::Vector{Cooccurence} # <- needs to be concrete type end Instead use covec::Vector{Cooccurence{Int, Int,T}} or some more complicated parameterisation. Then, when testing timings you usually do one warm-up to exclude compilation time: GloVe.train!(model, solver) # warm up @time 1 # @time needs a warm up too @time GloVe.train!(model, solver) Timings I get: stock clone from github: elapsed time: 0.001617218 seconds (2419024 bytes allocated) with improvements mentioned above: elapsed time: 0.001344645 seconds (2335552 bytes allocated) with improvements mentioned above and your loop-version: elapsed time: 0.00030488 seconds (3632 bytes allocated) Hope that helps. On Sun, 2015-05-24 at 19:21, Dominique Luna < [email protected] [[email protected]] > wrote: > Loop code > > > # TODO: figure out memory issue > function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75) > J = 0.0 > shuffle!(m.covec) > vecsize = size(m.W_main, 1) > eltype = typeof(m.b_main[1]) > vm = zeros(eltype, vecsize) > vc = zeros(eltype, vecsize) > grad_main = zeros(eltype, vecsize) > grad_ctx = zeros(eltype, vecsize) > for n=1:s.niter > # shuffle indices > for i = 1:length(m.covec) > @inbounds l1 = m.covec[i].i # main index > @inbounds l2 = m.covec[i].j # context index > @inbounds v = m.covec[i].v > #= vm[:] = m.W_main[:, l1] =# > #= vc[:] = m.W_ctx[:, l2] =# > @inbounds for j = 1:vecsize > vm[j] = m.W_main[j, l1] > vc[j] = m.W_ctx[j, l2] > end > diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - log(v) > fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * diff > J += 0.5 * fdiff * diff > fdiff *= s.lrate > # inc memory by ~200 MB && running time by 2x > #= grad_main[:] = fdiff * m.W_ctx[:, l2] =# > #= grad_ctx[:] = fdiff * m.W_main[:, l1] =# > @inbounds for j = 1:vecsize > grad_main[j] = fdiff * m.W_ctx[j, l2] > grad_ctx[j] = fdiff * m.W_main[j, l1] > end > # Adaptive learning > # inc ~ 600MB + 0.75s > #= m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, l1]) =# > #= m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, l2]) =# > #= m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) =# > #= m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) =# > @inbounds for j = 1:vecsize > m.W_main[j, l1] -= grad_main[j] / sqrt(m.W_main_grad[j, l1]) > m.W_ctx[j, l2] -= grad_ctx[j] / sqrt(m.W_ctx_grad[j, l2]) > end > m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) > m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) > # Gradients > fdiff *= fdiff > #= m.W_main_grad[:, l1] += grad_main .^ 2 =# > #= m.W_ctx_grad[:, l2] += grad_ctx .^ 2 =# > #= m.b_main_grad[l1] += fdiff =# > #= m.b_ctx_grad[l2] += fdiff =# > @inbounds for j = 1:vecsize > m.W_main_grad[j, l1] += grad_main[j] ^ 2 > m.W_ctx_grad[j, l2] += grad_ctx[j] ^ 2 > end > m.b_main_grad[l1] += fdiff > m.b_ctx_grad[l2] += fdiff > end > #= if n % 10 == 0 =# > #= println("iteration $n, cost $J") =# > #= end =# > end > end > > > > Mixmax Not using Mixmax yet? > > And the respective timings > > @time GloVe.train!(model, GloVe.Adagrad(500)) > 7.097 seconds (96237 k allocations: 1468 MB, 7.01% gc time) > > Slower and more memory. > > On Sun, May 24, 2015 at 4:21 AM, Mauro < [email protected] > [[email protected]] > wrote: > > Loops should run without allocations. Can you post your loop-code? > > > A[i, :] = 0.5 * B[i, :] > > To state the obvious, as loop: > > for j=1:size(A,2) > A[i,j] = 0.5 * B[i,j] > end > > this shouldn't allocate, if i is an integer. Unless A and B have > different type, then allocation might happen. > > On Sun, 2015-05-24 at 05:00, Dom Luna < [email protected] > [[email protected]] > wrote: > > Reposting this from Gitter chat since it seems this is more active. > > > > I'm writing a GloVe module to learn Julia. > > > > How can I avoid memory allocations? My main function deals with a lot of > > random indexing in Matrices. > > > > A[i, :] = 0.5 * B[i, :] > > > > In this case* i* isn't from a linear sequence. I'm not sure that matters. > > Anyway, I’ve done analysis and I know B[i, :] is the issue here since > it’s > > creating a copy. > > > > https://github.com/JuliaLang/ julia/blob/master/base/array. jl#L309 [https://github.com/JuliaLang/julia/blob/master/base/array.jl#L309] makes > the > > copy > > > > > > I tried to do it via loop but it looks like that doesn’t help either. In > > fact, it seems to allocate slight more memory which seems really odd. > > > > Here’s some of the code, it’s a little messy since I’m commenting > different > > approaches I’m trying out. > > > > type Model{T} > > W_main::Matrix{T} > > W_ctx::Matrix{T} > > b_main::Vector{T} > > b_ctx::Vector{T} > > W_main_grad::Matrix{T} > > W_ctx_grad::Matrix{T} > > b_main_grad::Vector{T} > > b_ctx_grad::Vector{T} > > covec::Vector{Cooccurence} > > end > > > > # Each vocab word in associated with a main vector and a context vector. > > # The paper initializes the to values [-0.5, 0.5] / vecsize+1 and > > # the gradients to 1.0. > > # > > # The +1 term is for the bias. > > function Model(comatrix; vecsize=100) > > vs = size(comatrix, 1) > > Model( > > (rand(vecsize, vs) - 0.5) / (vecsize + 1), > > (rand(vecsize, vs) - 0.5) / (vecsize + 1), > > (rand(vs) - 0.5) / (vecsize + 1), > > (rand(vs) - 0.5) / (vecsize + 1), > > ones(vecsize, vs), > > ones(vecsize, vs), > > ones(vs), > > ones(vs), > > CoVector(comatrix), # not required in 0.4 > > ) > > end > > > > # TODO: figure out memory issue > > # the memory comments are from 500 loop test with vecsize=100 > > function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75) > > J = 0.0 > > shuffle!(m.covec) > > > > vecsize = size(m.W_main, 1) > > eltype = typeof(m.b_main[1]) > > vm = zeros(eltype, vecsize) > > vc = zeros(eltype, vecsize) > > grad_main = zeros(eltype, vecsize) > > grad_ctx = zeros(eltype, vecsize) > > > > for n=1:s.niter > > # shuffle indices > > for i = 1:length(m.covec) > > @inbounds l1 = m.covec[i].i # main index > > @inbounds l2 = m.covec[i].j # context index > > @inbounds v = m.covec[i].v > > > > vm[:] = m.W_main[:, l1] > > vc[:] = m.W_ctx[:, l2] > > > > diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - > > log(v) > > fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * diff > > J += 0.5 * fdiff * diff > > > > fdiff *= s.lrate > > # inc memory by ~200 MB && running time by 2x > > grad_main[:] = fdiff * m.W_ctx[:, l2] > > grad_ctx[:] = fdiff * m.W_main[:, l1] > > > > # Adaptive learning > > # inc ~ 600MB + 0.75s > > #= @inbounds for ii = 1:vecsize =# > > #= m.W_main[ii, l1] -= grad_main[ii] / > > sqrt(m.W_main_grad[ii, l1]) =# > > #= m.W_ctx[ii, l2] -= grad_ctx[ii] / sqrt(m.W_ctx_grad > [ii, > > l2]) =# > > #= m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) =# > > #= m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) =# > > #= end =# > > > > m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, l1]) > > m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, l2]) > > m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) > > m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) > > > > # Gradients > > fdiff *= fdiff > > m.W_main_grad[:, l1] += grad_main .^ 2 > > m.W_ctx_grad[:, l2] += grad_ctx .^ 2 > > m.b_main_grad[l1] += fdiff > > m.b_ctx_grad[l2] += fdiff > > end > > > > #= if n % 10 == 0 =# > > #= println(“iteration $n, cost $J”) =# > > #= end =# > > end > > end > > > > > > Here’s the entire repo https://github.com/domluna/ GloVe.jl > > [https://github.com/domluna/GloVe.jl] . Might be > > helpful. > > > > I tried doing some loops but it allocates more memory (oddly enough) and > > gets slower. > > > > You’ll notice the word vectors are indexed by column, I changed the > > representation to that > > seeing if it would make a difference during the loop. It didn’t seem to. > > > > The memory analysis showed > > > > Julia Version 0.4.0-dev+4893 > > Commit eb5da26* (2015-05-19 11:51 UTC) > > Platform Info: > > System: Darwin (x86_64-apple-darwin14.4.0) > > CPU: Intel(R) Core(TM) i5-2557M CPU @ 1.70GHz > > WORD_SIZE: 64 > > BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge) > > LAPACK: libopenblas > > LIBM: libopenlibm > > LLVM: libLLVM-3.3 > > > > Here model consists of 100x19 Matrices and 100 element vectors, 19 words > in > > the vocab, 100 element word vector. > > > > @time GloVe.train!(model, GloVe.Adagrad(500)) > > 1.990 seconds (6383 k allocations: 1162 MB, 10.82% gc time) > > > > 0.3 has is a bit slower due to worse gc but same memory. > > > > Any help would be greatly appreciated! > > > > > cheers, > > dom > > > Sent with Mixmax > > * cheers, dom Sent with Mixmax [https://mixmax.com] [https://app.mixmax.com/api/track/v2/d6R7nmZkcOp1nsKJL/dluna132%40gmail.com/i02bj5ycwV3bydWZsd2bvdGQzJXZzVXLhlGb1pmI/i02bj5ycwV3bydWZsd2bvdGQzJXZzVXLhlGb1pmI?sc=false]
