After recompiled an native arch version of Julia and OpenBLAS, it's about 8x slower than the C code and I think it's near to the highest performance my code can achieve. After all, the C code was optimized intensively in the cache level and all loops were unrolled. But my Julia code is much more flexible and extensible.
Maybe I should try to use more computers. Currently my code is paralleled by using pmap(). I hope the communication overhead will not be a new bottleneck if I run on a local network cluster. Thanks for your help! Regards, Yang Zhixuan 在 2015年2月21日星期六 UTC+8下午2:23:37,Viral Shah写道: > > So, where is the performance now compared to the C program? I don't think > MKL will give you much if you were on the order of 100x slower to start > with. > > -viral > > On Friday, February 20, 2015 at 8:19:50 PM UTC+5:30, Zhixuan Yang wrote: >> >> Mauro, Sean, and Tim, thanks for your help. >> >> Following your suggestions, I removed keyword arguments and split the >> function to avoid conditional statements. These helped a bit. >> >> But I got a surprising result after replacing BLAS functions with simple >> for loops, for loops is about 1.5x faster than BLAS calls. My Julia is >> compiled on my computer with the default configuration (the versioninfo() >> is listed below). Do you think it will help to compile a Julia with a >> faster and more optimized BLAS implementation such as Intel's MKL? >> >> Julia Version 0.3.6-pre+70 >> Commit 638fa02 (2015-02-12 13:59 UTC) >> Platform Info: >> System: Darwin (x86_64-apple-darwin14.1.0) >> CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz >> WORD_SIZE: 64 >> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) >> LAPACK: libopenblas >> LIBM: libopenlibm >> LLVM: libLLVM-3.3 >> >> >> Regards, Yang Zhixuan >> >> 在 2015年2月19日星期四 UTC+8下午10:51:20,Zhixuan Yang写道: >>> >>> Hello everyone, >>> >>> Recently I'm working on my first Julia project, a word embedding >>> training program similar to Google's word2vec >>> <https://code.google.com/p/word2vec/> (the code of word2vec is indeed >>> very high-quality, but I want to add more features, so I decided to write a >>> new one). Thanks to Julia's expressiveness, it cost me less than 2 days to >>> write the entire program. But it runs really slow, about 100x slower than >>> the C code of word2vec (the algorithm is the same). I've been trying to >>> optimize my code for several days (adding type annotations, using BLAS to >>> do computation, eliminating memory allocations ...), but it is still 30x >>> slower than the C code. >>> >>> The critical part of my program is the following function (it also >>> consumes most of the time according to the profiling result): >>> >>> function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: >>> Int64; α :: Float64 = 0.025, input_gradient :: Union(Nothing, >>> Array{Float64}) = nothing) >>> predict!(c, x) >>> c.outputs[y] -= 1 >>> >>> if input_gradient != nothing >>> # input_gradient = ( c.weights * outputs' )' >>> BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, input_gradient) >>> end >>> >>> # c.weights -= α * x' * outputs; >>> BLAS.ger!(-α, vec(x), c.outputs, c.weights) >>> end >>> >>> function predict!(c :: LinearClassifier, x :: Array{Float64}) >>> c.outputs = vec(softmax(x * c.weights)) >>> end >>> >>> type LinearClassifier >>> k :: Int64 # number of outputs >>> n :: Int64 # number of inputs >>> weights :: Array{Float64, 2} # k * n weight matrix >>> >>> outputs :: Vector{Float64} >>> end >>> >>> And the entire program can be found here >>> <https://github.com/yangzhixuan/embed>. Could you please check my code >>> and tell me what I can do to get performance comparable to C. >>> >>> Regards. >>> Yang Zhixuan >>> >>
