This definitely happens sometimes, especially with smaller matrices. OpenBLAS 
seems to be optimized for larger matrices. It's really good on those larger 
matrices, though.

If you want to try MKL, see 
https://github.com/JuliaLang/julia#intel-compilers-and-math-kernel-libraries

--Tim

On Friday, February 20, 2015 06:49:50 AM Zhixuan Yang wrote:
> Mauro, Sean, and Tim, thanks for your help.
> 
> Following your suggestions, I removed keyword arguments and split the
> function to avoid conditional statements. These helped a bit.
> 
> But I got a surprising result after replacing BLAS functions with simple
> for loops, for loops is about 1.5x faster than BLAS calls. My Julia is
> compiled on my computer with the default configuration (the versioninfo()
> is listed below). Do you think it will help to compile a Julia with a
> faster and more optimized BLAS implementation such as Intel's MKL?
> 
> Julia Version 0.3.6-pre+70
> Commit 638fa02 (2015-02-12 13:59 UTC)
> Platform Info:
>  System: Darwin (x86_64-apple-darwin14.1.0)
>  CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
>  WORD_SIZE: 64
>  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>  LAPACK: libopenblas
>  LIBM: libopenlibm
>  LLVM: libLLVM-3.3
> 
> 
> Regards, Yang Zhixuan
> 
> 在 2015年2月19日星期四 UTC+8下午10:51:20,Zhixuan Yang写道:
> 
> > Hello everyone,
> > 
> > Recently I'm working on my first Julia project, a word embedding training
> > program similar to Google's word2vec <https://code.google.com/p/word2vec/>
> > (the code of word2vec is indeed very high-quality, but I want to add more
> > features, so I decided to write a new one). Thanks to Julia's
> > expressiveness, it cost me less than 2 days to write the entire program.
> > But it runs really slow, about 100x slower than the C code of word2vec
> > (the algorithm is the same).> 
> >  I've been trying to optimize my code for several days (adding type
> > 
> > annotations, using BLAS to do computation, eliminating memory allocations
> > ...), but it is still 30x slower than the C code.
> > 
> > The critical part of my program is the following function (it also
> > consumes most of the time according to the profiling result):
> > 
> > function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: Int64;
> > α :: Float64 = 0.025, input_gradient :: Union(Nothing, Array{Float64}) =
> > nothing)
> > 
> >     predict!(c, x)
> >     c.outputs[y] -= 1
> >     
> >     if input_gradient != nothing
> >     
> >         # input_gradient = ( c.weights * outputs' )'
> >         BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, input_gradient)
> >     
> >     end
> >     
> >     # c.weights -= α * x' * outputs;
> >     BLAS.ger!(-α, vec(x), c.outputs, c.weights)
> > 
> > end
> > 
> > function predict!(c :: LinearClassifier, x :: Array{Float64})
> > 
> >     c.outputs = vec(softmax(x * c.weights))
> > 
> > end
> > 
> > type LinearClassifier
> > 
> >     k :: Int64 # number of outputs
> >     n :: Int64 # number of inputs
> >     weights :: Array{Float64, 2} # k * n weight matrix
> >     
> >     outputs :: Vector{Float64}
> > 
> > end
> > 
> > And the entire program can be found here
> > <https://github.com/yangzhixuan/embed>. Could you please check my code
> > and tell me what I can do to get performance comparable to C.
> > 
> > Regards.
> > Yang Zhixuan

Reply via email to