I passed "--check-bounds=no" to julia when launching the REPL to avoid writing @inbounds explicitly.
Simply adding @simd before for loops doesn't seem to be helping, it slightly slows down the code. @nexprs is exactly what I need to avoid redundant code when unrolling loops. I will use this to simplify the code. I will see KernelTools.jl later. At this time, my Julia code runs roughly 3.5x slower than the C code and I'm pleased with this. I'd be glad to move on to adding more features I want at first. :-) Thanks, Yang Zhixuan 在 2015年2月21日星期六 UTC+8下午10:49:57,Tim Holy写道: > > Just to check, in writing out your own version of gemv! you're using > @inbounds > @simd, right? > > The @nexprs macro (documented in the Base.Cartesian section of the manual) > lets you unroll loops manually. Also, see the (currently alpha) > KernelTools.jl > repository for some ideas about improving cache efficiency---perhaps the > @tile > macro will help. > > --Tim > > On Saturday, February 21, 2015 01:17:24 PM Mauro wrote: > > > After all, the C code was optimized intensively in the cache level and > > > all loops were unrolled. > > > > Julia is good at unrolling loops using marcos. > > > > > Maybe I should try to use more computers. Currently my code is > paralleled > > > by using pmap(). I hope the communication overhead will not be a new > > > bottleneck if I run on a local network cluster. > > > > > > Thanks for your help! > > > > > > Regards, Yang Zhixuan > > > > > > 在 2015年2月21日星期六 UTC+8下午2:23:37,Viral Shah写道: > > > > > >> So, where is the performance now compared to the C program? I don't > think > > >> MKL will give you much if you were on the order of 100x slower to > start > > >> with. > > >> > > >> -viral > > >> > > >> On Friday, February 20, 2015 at 8:19:50 PM UTC+5:30, Zhixuan Yang > wrote: > > >>> Mauro, Sean, and Tim, thanks for your help. > > >>> > > >>> Following your suggestions, I removed keyword arguments and split > the > > >>> function to avoid conditional statements. These helped a bit. > > >>> > > >>> But I got a surprising result after replacing BLAS functions with > simple > > >>> for loops, for loops is about 1.5x faster than BLAS calls. My Julia > is > > >>> compiled on my computer with the default configuration (the > > >>> versioninfo() > > >>> is listed below). Do you think it will help to compile a Julia with > a > > >>> faster and more optimized BLAS implementation such as Intel's MKL? > > >>> > > >>> Julia Version 0.3.6-pre+70 > > >>> Commit 638fa02 (2015-02-12 13:59 UTC) > > >>> > > >>> Platform Info: > > >>> System: Darwin (x86_64-apple-darwin14.1.0) > > >>> CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz > > >>> WORD_SIZE: 64 > > >>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) > > >>> LAPACK: libopenblas > > >>> LIBM: libopenlibm > > >>> LLVM: libLLVM-3.3 > > >>> > > >>> Regards, Yang Zhixuan > > >>> > > >>> 在 2015年2月19日星期四 UTC+8下午10:51:20,Zhixuan Yang写道: > > >>> > > >>>> Hello everyone, > > >>>> > > >>>> Recently I'm working on my first Julia project, a word embedding > > >>>> training program similar to Google's word2vec > > >>>> <https://code.google.com/p/word2vec/> (the code of word2vec is > indeed > > >>>> very high-quality, but I want to add more features, so I decided to > > >>>> write a > > >>>> new one). Thanks to Julia's expressiveness, it cost me less than 2 > days > > >>>> to > > >>>> write the entire program. But it runs really slow, about 100x > slower > > >>>> than > > >>>> the C code of word2vec (the algorithm is the same). I've been > trying > > >>>> to > > >>>> optimize my code for several days (adding type annotations, using > BLAS > > >>>> to > > >>>> do computation, eliminating memory allocations ...), but it is > still > > >>>> 30x > > >>>> slower than the C code. > > >>>> > > >>>> The critical part of my program is the following function (it also > > >>>> consumes most of the time according to the profiling result): > > >>>> > > >>>> function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: > > >>>> Int64; α :: Float64 = 0.025, input_gradient :: Union(Nothing, > > >>>> Array{Float64}) = nothing) > > >>>> > > >>>> predict!(c, x) > > >>>> c.outputs[y] -= 1 > > >>>> > > >>>> if input_gradient != nothing > > >>>> > > >>>> # input_gradient = ( c.weights * outputs' )' > > >>>> BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, > input_gradient) > > >>>> > > >>>> end > > >>>> > > >>>> # c.weights -= α * x' * outputs; > > >>>> BLAS.ger!(-α, vec(x), c.outputs, c.weights) > > >>>> > > >>>> end > > >>>> > > >>>> function predict!(c :: LinearClassifier, x :: Array{Float64}) > > >>>> > > >>>> c.outputs = vec(softmax(x * c.weights)) > > >>>> > > >>>> end > > >>>> > > >>>> type LinearClassifier > > >>>> > > >>>> k :: Int64 # number of outputs > > >>>> n :: Int64 # number of inputs > > >>>> weights :: Array{Float64, 2} # k * n weight matrix > > >>>> > > >>>> outputs :: Vector{Float64} > > >>>> > > >>>> end > > >>>> > > >>>> And the entire program can be found here > > >>>> <https://github.com/yangzhixuan/embed>. Could you please check my > code > > >>>> and tell me what I can do to get performance comparable to C. > > >>>> > > >>>> Regards. > > >>>> Yang Zhixuan > >
