You may want to try using a profiler. I recently used the ProfileView.jl <https://github.com/timholy/ProfileView.jl> package to great success.
On Friday, February 20, 2015 at 11:53:56 PM UTC-8, Zhixuan Yang wrote: > > After recompiled an native arch version of Julia and OpenBLAS, it's about > 8x slower than the C code and I think it's near to the highest performance > my code can achieve. After all, the C code was optimized intensively in the > cache level and all loops were unrolled. But my Julia code is much more > flexible and extensible. > > Maybe I should try to use more computers. Currently my code is paralleled > by using pmap(). I hope the communication overhead will not be a new > bottleneck if I run on a local network cluster. > > Thanks for your help! > > Regards, Yang Zhixuan > > 在 2015年2月21日星期六 UTC+8下午2:23:37,Viral Shah写道: >> >> So, where is the performance now compared to the C program? I don't think >> MKL will give you much if you were on the order of 100x slower to start >> with. >> >> -viral >> >> On Friday, February 20, 2015 at 8:19:50 PM UTC+5:30, Zhixuan Yang wrote: >>> >>> Mauro, Sean, and Tim, thanks for your help. >>> >>> Following your suggestions, I removed keyword arguments and split the >>> function to avoid conditional statements. These helped a bit. >>> >>> But I got a surprising result after replacing BLAS functions with simple >>> for loops, for loops is about 1.5x faster than BLAS calls. My Julia is >>> compiled on my computer with the default configuration (the versioninfo() >>> is listed below). Do you think it will help to compile a Julia with a >>> faster and more optimized BLAS implementation such as Intel's MKL? >>> >>> Julia Version 0.3.6-pre+70 >>> Commit 638fa02 (2015-02-12 13:59 UTC) >>> Platform Info: >>> System: Darwin (x86_64-apple-darwin14.1.0) >>> CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz >>> WORD_SIZE: 64 >>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) >>> LAPACK: libopenblas >>> LIBM: libopenlibm >>> LLVM: libLLVM-3.3 >>> >>> >>> Regards, Yang Zhixuan >>> >>> 在 2015年2月19日星期四 UTC+8下午10:51:20,Zhixuan Yang写道: >>>> >>>> Hello everyone, >>>> >>>> Recently I'm working on my first Julia project, a word embedding >>>> training program similar to Google's word2vec >>>> <https://code.google.com/p/word2vec/> (the code of word2vec is indeed >>>> very high-quality, but I want to add more features, so I decided to write >>>> a >>>> new one). Thanks to Julia's expressiveness, it cost me less than 2 days to >>>> write the entire program. But it runs really slow, about 100x slower than >>>> the C code of word2vec (the algorithm is the same). I've been trying to >>>> optimize my code for several days (adding type annotations, using BLAS to >>>> do computation, eliminating memory allocations ...), but it is still 30x >>>> slower than the C code. >>>> >>>> The critical part of my program is the following function (it also >>>> consumes most of the time according to the profiling result): >>>> >>>> function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: >>>> Int64; α :: Float64 = 0.025, input_gradient :: Union(Nothing, >>>> Array{Float64}) = nothing) >>>> predict!(c, x) >>>> c.outputs[y] -= 1 >>>> >>>> if input_gradient != nothing >>>> # input_gradient = ( c.weights * outputs' )' >>>> BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, input_gradient) >>>> end >>>> >>>> # c.weights -= α * x' * outputs; >>>> BLAS.ger!(-α, vec(x), c.outputs, c.weights) >>>> end >>>> >>>> function predict!(c :: LinearClassifier, x :: Array{Float64}) >>>> c.outputs = vec(softmax(x * c.weights)) >>>> end >>>> >>>> type LinearClassifier >>>> k :: Int64 # number of outputs >>>> n :: Int64 # number of inputs >>>> weights :: Array{Float64, 2} # k * n weight matrix >>>> >>>> outputs :: Vector{Float64} >>>> end >>>> >>>> And the entire program can be found here >>>> <https://github.com/yangzhixuan/embed>. Could you please check my code >>>> and tell me what I can do to get performance comparable to C. >>>> >>>> Regards. >>>> Yang Zhixuan >>>> >>>
