Just to check, in writing out your own version of gemv! you're using @inbounds @simd, right?
The @nexprs macro (documented in the Base.Cartesian section of the manual) lets you unroll loops manually. Also, see the (currently alpha) KernelTools.jl repository for some ideas about improving cache efficiency---perhaps the @tile macro will help. --Tim On Saturday, February 21, 2015 01:17:24 PM Mauro wrote: > > After all, the C code was optimized intensively in the cache level and > > all loops were unrolled. > > Julia is good at unrolling loops using marcos. > > > Maybe I should try to use more computers. Currently my code is paralleled > > by using pmap(). I hope the communication overhead will not be a new > > bottleneck if I run on a local network cluster. > > > > Thanks for your help! > > > > Regards, Yang Zhixuan > > > > 在 2015年2月21日星期六 UTC+8下午2:23:37,Viral Shah写道: > > > >> So, where is the performance now compared to the C program? I don't think > >> MKL will give you much if you were on the order of 100x slower to start > >> with. > >> > >> -viral > >> > >> On Friday, February 20, 2015 at 8:19:50 PM UTC+5:30, Zhixuan Yang wrote: > >>> Mauro, Sean, and Tim, thanks for your help. > >>> > >>> Following your suggestions, I removed keyword arguments and split the > >>> function to avoid conditional statements. These helped a bit. > >>> > >>> But I got a surprising result after replacing BLAS functions with simple > >>> for loops, for loops is about 1.5x faster than BLAS calls. My Julia is > >>> compiled on my computer with the default configuration (the > >>> versioninfo() > >>> is listed below). Do you think it will help to compile a Julia with a > >>> faster and more optimized BLAS implementation such as Intel's MKL? > >>> > >>> Julia Version 0.3.6-pre+70 > >>> Commit 638fa02 (2015-02-12 13:59 UTC) > >>> > >>> Platform Info: > >>> System: Darwin (x86_64-apple-darwin14.1.0) > >>> CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz > >>> WORD_SIZE: 64 > >>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) > >>> LAPACK: libopenblas > >>> LIBM: libopenlibm > >>> LLVM: libLLVM-3.3 > >>> > >>> Regards, Yang Zhixuan > >>> > >>> 在 2015年2月19日星期四 UTC+8下午10:51:20,Zhixuan Yang写道: > >>> > >>>> Hello everyone, > >>>> > >>>> Recently I'm working on my first Julia project, a word embedding > >>>> training program similar to Google's word2vec > >>>> <https://code.google.com/p/word2vec/> (the code of word2vec is indeed > >>>> very high-quality, but I want to add more features, so I decided to > >>>> write a > >>>> new one). Thanks to Julia's expressiveness, it cost me less than 2 days > >>>> to > >>>> write the entire program. But it runs really slow, about 100x slower > >>>> than > >>>> the C code of word2vec (the algorithm is the same). I've been trying > >>>> to > >>>> optimize my code for several days (adding type annotations, using BLAS > >>>> to > >>>> do computation, eliminating memory allocations ...), but it is still > >>>> 30x > >>>> slower than the C code. > >>>> > >>>> The critical part of my program is the following function (it also > >>>> consumes most of the time according to the profiling result): > >>>> > >>>> function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: > >>>> Int64; α :: Float64 = 0.025, input_gradient :: Union(Nothing, > >>>> Array{Float64}) = nothing) > >>>> > >>>> predict!(c, x) > >>>> c.outputs[y] -= 1 > >>>> > >>>> if input_gradient != nothing > >>>> > >>>> # input_gradient = ( c.weights * outputs' )' > >>>> BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, input_gradient) > >>>> > >>>> end > >>>> > >>>> # c.weights -= α * x' * outputs; > >>>> BLAS.ger!(-α, vec(x), c.outputs, c.weights) > >>>> > >>>> end > >>>> > >>>> function predict!(c :: LinearClassifier, x :: Array{Float64}) > >>>> > >>>> c.outputs = vec(softmax(x * c.weights)) > >>>> > >>>> end > >>>> > >>>> type LinearClassifier > >>>> > >>>> k :: Int64 # number of outputs > >>>> n :: Int64 # number of inputs > >>>> weights :: Array{Float64, 2} # k * n weight matrix > >>>> > >>>> outputs :: Vector{Float64} > >>>> > >>>> end > >>>> > >>>> And the entire program can be found here > >>>> <https://github.com/yangzhixuan/embed>. Could you please check my code > >>>> and tell me what I can do to get performance comparable to C. > >>>> > >>>> Regards. > >>>> Yang Zhixuan
