Re: [julia-users] Avoid memory allocations when reading from matrices

Mauro Mon, 25 May 2015 07:20:06 -0700

> Thanks Mauro! I made the change it works wonderfully.
> I’m still a little confused why this change in particular makes such a big
> difference. Is it
> when using l1, l2 to index (they still have the constraint of <: Int)? Or is 
> it
> that v isn’t concrete and that gets propagated to everything else through
> diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - log(v)
> ?


The manual section David mentioned covers this.

> Lastly, are there tools I could use check this sort of thing in the future?

@code_warntype can help.

> On Sun, May 24, 2015 at 2:38 PM, Mauro < [email protected] 
> [[email protected]] > wrote:
> The problem is in:
> type Model{T}
> W_main::Matrix{T}
> W_ctx::Matrix{T}
> b_main::Vector{T}
> b_ctx::Vector{T}
> W_main_grad::Matrix{T}
> W_ctx_grad::Matrix{T}
> b_main_grad::Vector{T}
> b_ctx_grad::Vector{T}
> covec::Vector{Cooccurence} # <- needs to be concrete type
> end
>
> Instead use
> covec::Vector{Cooccurence{Int, Int,T}}
> or some more complicated parameterisation.
>
> Then, when testing timings you usually do one warm-up to exclude
> compilation time:
>
> GloVe.train!(model, solver) # warm up
> @time 1 # @time needs a warm up too
> @time GloVe.train!(model, solver)
>
> Timings I get:
>
> stock clone from github:
> elapsed time: 0.001617218 seconds (2419024 bytes allocated)
>
> with improvements mentioned above:
> elapsed time: 0.001344645 seconds (2335552 bytes allocated)
>
> with improvements mentioned above and your loop-version:
> elapsed time: 0.00030488 seconds (3632 bytes allocated)
>
> Hope that helps.
>
> On Sun, 2015-05-24 at 19:21, Dominique Luna < [email protected] 
> [[email protected]] > wrote:
>> Loop code
>>
>>
>> # TODO: figure out memory issue
>> function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75)
>> J = 0.0
>> shuffle!(m.covec)
>> vecsize = size(m.W_main, 1)
>> eltype = typeof(m.b_main[1])
>> vm = zeros(eltype, vecsize)
>> vc = zeros(eltype, vecsize)
>> grad_main = zeros(eltype, vecsize)
>> grad_ctx = zeros(eltype, vecsize)
>> for n=1:s.niter
>> # shuffle indices
>> for i = 1:length(m.covec)
>> @inbounds l1 = m.covec[i].i # main index
>> @inbounds l2 = m.covec[i].j # context index
>> @inbounds v = m.covec[i].v
>> #= vm[:] = m.W_main[:, l1] =#
>> #= vc[:] = m.W_ctx[:, l2] =#
>> @inbounds for j = 1:vecsize
>> vm[j] = m.W_main[j, l1]
>> vc[j] = m.W_ctx[j, l2]
>> end
>> diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - log(v)
>> fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * diff
>> J += 0.5 * fdiff * diff
>> fdiff *= s.lrate
>> # inc memory by ~200 MB && running time by 2x
>> #= grad_main[:] = fdiff * m.W_ctx[:, l2] =#
>> #= grad_ctx[:] = fdiff * m.W_main[:, l1] =#
>> @inbounds for j = 1:vecsize
>> grad_main[j] = fdiff * m.W_ctx[j, l2]
>> grad_ctx[j] = fdiff * m.W_main[j, l1]
>> end
>> # Adaptive learning
>> # inc ~ 600MB + 0.75s
>> #= m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, l1]) =#
>> #= m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, l2]) =#
>> #= m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) =#
>> #= m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) =#
>> @inbounds for j = 1:vecsize
>> m.W_main[j, l1] -= grad_main[j] / sqrt(m.W_main_grad[j, l1])
>> m.W_ctx[j, l2] -= grad_ctx[j] / sqrt(m.W_ctx_grad[j, l2])
>> end
>> m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1])
>> m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2])
>> # Gradients
>> fdiff *= fdiff
>> #= m.W_main_grad[:, l1] += grad_main .^ 2 =#
>> #= m.W_ctx_grad[:, l2] += grad_ctx .^ 2 =#
>> #= m.b_main_grad[l1] += fdiff =#
>> #= m.b_ctx_grad[l2] += fdiff =#
>> @inbounds for j = 1:vecsize
>> m.W_main_grad[j, l1] += grad_main[j] ^ 2
>> m.W_ctx_grad[j, l2] += grad_ctx[j] ^ 2
>> end
>> m.b_main_grad[l1] += fdiff
>> m.b_ctx_grad[l2] += fdiff
>> end
>> #= if n % 10 == 0 =#
>> #= println("iteration $n, cost $J") =#
>> #= end =#
>> end
>> end
>>
>>
>>
>> Mixmax Not using Mixmax yet?
>>
>> And the respective timings
>>
>> @time GloVe.train!(model, GloVe.Adagrad(500))
>> 7.097 seconds (96237 k allocations: 1468 MB, 7.01% gc time)
>>
>> Slower and more memory.
>>
>> On Sun, May 24, 2015 at 4:21 AM, Mauro < [email protected] 
>> [[email protected]] > wrote:
>>
>> Loops should run without allocations. Can you post your loop-code?
>>
>> > A[i, :] = 0.5 * B[i, :]
>>
>> To state the obvious, as loop:
>>
>> for j=1:size(A,2)
>> A[i,j] = 0.5 * B[i,j]
>> end
>>
>> this shouldn't allocate, if i is an integer. Unless A and B have
>> different type, then allocation might happen.
>>
>> On Sun, 2015-05-24 at 05:00, Dom Luna < [email protected] 
>> [[email protected]] > wrote:
>> > Reposting this from Gitter chat since it seems this is more active.
>> >
>> > I'm writing a GloVe module to learn Julia.
>> >
>> > How can I avoid memory allocations? My main function deals with a lot of
>> > random indexing in Matrices.
>> >
>> > A[i, :] = 0.5 * B[i, :]
>> >
>> > In this case* i* isn't from a linear sequence. I'm not sure that matters.
>> > Anyway, I’ve done analysis and I know B[i, :] is the issue here since
>> it’s
>> > creating a copy.
>> >
>> > https://github.com/JuliaLang/ julia/blob/master/base/array. jl#L309
> [https://github.com/JuliaLang/julia/blob/master/base/array.jl#L309] makes
>> the
>> > copy
>> >
>> >
>> > I tried to do it via loop but it looks like that doesn’t help either. In
>> > fact, it seems to allocate slight more memory which seems really odd.
>> >
>> > Here’s some of the code, it’s a little messy since I’m commenting
>> different
>> > approaches I’m trying out.
>> >
>> > type Model{T}
>> > W_main::Matrix{T}
>> > W_ctx::Matrix{T}
>> > b_main::Vector{T}
>> > b_ctx::Vector{T}
>> > W_main_grad::Matrix{T}
>> > W_ctx_grad::Matrix{T}
>> > b_main_grad::Vector{T}
>> > b_ctx_grad::Vector{T}
>> > covec::Vector{Cooccurence}
>> > end
>> >
>> > # Each vocab word in associated with a main vector and a context vector.
>> > # The paper initializes the to values [-0.5, 0.5] / vecsize+1 and
>> > # the gradients to 1.0.
>> > #
>> > # The +1 term is for the bias.
>> > function Model(comatrix; vecsize=100)
>> > vs = size(comatrix, 1)
>> > Model(
>> > (rand(vecsize, vs) - 0.5) / (vecsize + 1),
>> > (rand(vecsize, vs) - 0.5) / (vecsize + 1),
>> > (rand(vs) - 0.5) / (vecsize + 1),
>> > (rand(vs) - 0.5) / (vecsize + 1),
>> > ones(vecsize, vs),
>> > ones(vecsize, vs),
>> > ones(vs),
>> > ones(vs),
>> > CoVector(comatrix), # not required in 0.4
>> > )
>> > end
>> >
>> > # TODO: figure out memory issue
>> > # the memory comments are from 500 loop test with vecsize=100
>> > function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75)
>> > J = 0.0
>> > shuffle!(m.covec)
>> >
>> > vecsize = size(m.W_main, 1)
>> > eltype = typeof(m.b_main[1])
>> > vm = zeros(eltype, vecsize)
>> > vc = zeros(eltype, vecsize)
>> > grad_main = zeros(eltype, vecsize)
>> > grad_ctx = zeros(eltype, vecsize)
>> >
>> > for n=1:s.niter
>> > # shuffle indices
>> > for i = 1:length(m.covec)
>> > @inbounds l1 = m.covec[i].i # main index
>> > @inbounds l2 = m.covec[i].j # context index
>> > @inbounds v = m.covec[i].v
>> >
>> > vm[:] = m.W_main[:, l1]
>> > vc[:] = m.W_ctx[:, l2]
>> >
>> > diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] -
>> > log(v)
>> > fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * diff
>> > J += 0.5 * fdiff * diff
>> >
>> > fdiff *= s.lrate
>> > # inc memory by ~200 MB && running time by 2x
>> > grad_main[:] = fdiff * m.W_ctx[:, l2]
>> > grad_ctx[:] = fdiff * m.W_main[:, l1]
>> >
>> > # Adaptive learning
>> > # inc ~ 600MB + 0.75s
>> > #= @inbounds for ii = 1:vecsize =#
>> > #= m.W_main[ii, l1] -= grad_main[ii] /
>> > sqrt(m.W_main_grad[ii, l1]) =#
>> > #= m.W_ctx[ii, l2] -= grad_ctx[ii] / sqrt(m.W_ctx_grad
>> [ii,
>> > l2]) =#
>> > #= m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) =#
>> > #= m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) =#
>> > #= end =#
>> >
>> > m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, l1])
>> > m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, l2])
>> > m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1])
>> > m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2])
>> >
>> > # Gradients
>> > fdiff *= fdiff
>> > m.W_main_grad[:, l1] += grad_main .^ 2
>> > m.W_ctx_grad[:, l2] += grad_ctx .^ 2
>> > m.b_main_grad[l1] += fdiff
>> > m.b_ctx_grad[l2] += fdiff
>> > end
>> >
>> > #= if n % 10 == 0 =#
>> > #= println(“iteration $n, cost $J”) =#
>> > #= end =#
>> > end
>> > end
>> >
>> >
>> > Here’s the entire repo https://github.com/domluna/ GloVe.jl 
>> > [https://github.com/domluna/GloVe.jl] . Might be
>> > helpful.
>> >
>> > I tried doing some loops but it allocates more memory (oddly enough) and
>> > gets slower.
>> >
>> > You’ll notice the word vectors are indexed by column, I changed the
>> > representation to that
>> > seeing if it would make a difference during the loop. It didn’t seem to.
>> >
>> > The memory analysis showed
>> >
>> > Julia Version 0.4.0-dev+4893
>> > Commit eb5da26* (2015-05-19 11:51 UTC)
>> > Platform Info:
>> > System: Darwin (x86_64-apple-darwin14.4.0)
>> > CPU: Intel(R) Core(TM) i5-2557M CPU @ 1.70GHz
>> > WORD_SIZE: 64
>> > BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
>> > LAPACK: libopenblas
>> > LIBM: libopenlibm
>> > LLVM: libLLVM-3.3
>> >
>> > Here model consists of 100x19 Matrices and 100 element vectors, 19 words
>> in
>> > the vocab, 100 element word vector.
>> >
>> > @time GloVe.train!(model, GloVe.Adagrad(500))
>> > 1.990 seconds (6383 k allocations: 1162 MB, 10.82% gc time)
>> >
>> > 0.3 has is a bit slower due to worse gc but same memory.
>> >
>> > Any help would be greatly appreciated!
>>
>>
>>
>>
>> cheers,
>>
>> dom
>>
>>
>> Sent with Mixmax
>>
>> *
>
>
>
>
>
> cheers,
>
> dom
>
>
>
>
> Sent with Mixmax [https://mixmax.com]
>
> [https://app.mixmax.com/api/track/v2/d6R7nmZkcOp1nsKJL/dluna132%40gmail.com/i02bj5ycwV3bydWZsd2bvdGQzJXZzVXLhlGb1pmI/i02bj5ycwV3bydWZsd2bvdGQzJXZzVXLhlGb1pmI?sc=false]

Re: [julia-users] Avoid memory allocations when reading from matrices

Reply via email to