Just to check, in writing out your own version of gemv! you're using @inbounds 
@simd, right?

The @nexprs macro (documented in the Base.Cartesian section of the manual) 
lets you unroll loops manually. Also, see the (currently alpha) KernelTools.jl 
repository for some ideas about improving cache efficiency---perhaps the @tile 
macro will help.

--Tim

On Saturday, February 21, 2015 01:17:24 PM Mauro wrote:
> > After all, the C code was optimized intensively in the cache level and
> > all loops were unrolled.
> 
> Julia is good at unrolling loops using marcos.
> 
> > Maybe I should try to use more computers. Currently my code is paralleled
> > by using pmap(). I hope the communication overhead will not be a new
> > bottleneck if I run on a local network cluster.
> > 
> > Thanks for your help!
> > 
> > Regards, Yang Zhixuan
> > 
> > 在 2015年2月21日星期六 UTC+8下午2:23:37,Viral Shah写道:
> > 
> >> So, where is the performance now compared to the C program? I don't think
> >> MKL will give you much if you were on the order of 100x slower to start
> >> with.
> >> 
> >> -viral
> >> 
> >> On Friday, February 20, 2015 at 8:19:50 PM UTC+5:30, Zhixuan Yang wrote:
> >>> Mauro, Sean, and Tim, thanks for your help.
> >>> 
> >>> Following your suggestions, I removed keyword arguments and split the
> >>> function to avoid conditional statements. These helped a bit.
> >>> 
> >>> But I got a surprising result after replacing BLAS functions with simple
> >>> for loops, for loops is about 1.5x faster than BLAS calls. My Julia is
> >>> compiled on my computer with the default configuration (the
> >>> versioninfo()
> >>> is listed below). Do you think it will help to compile a Julia with a
> >>> faster and more optimized BLAS implementation such as Intel's MKL?
> >>> 
> >>> Julia Version 0.3.6-pre+70
> >>> Commit 638fa02 (2015-02-12 13:59 UTC)
> >>> 
> >>> Platform Info:
> >>>  System: Darwin (x86_64-apple-darwin14.1.0)
> >>>  CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
> >>>  WORD_SIZE: 64
> >>>  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
> >>>  LAPACK: libopenblas
> >>>  LIBM: libopenlibm
> >>>  LLVM: libLLVM-3.3
> >>> 
> >>> Regards, Yang Zhixuan
> >>> 
> >>> 在 2015年2月19日星期四 UTC+8下午10:51:20,Zhixuan Yang写道:
> >>> 
> >>>> Hello everyone,
> >>>> 
> >>>> Recently I'm working on my first Julia project, a word embedding
> >>>> training program similar to Google's word2vec
> >>>> <https://code.google.com/p/word2vec/> (the code of word2vec is indeed
> >>>> very high-quality, but I want to add more features, so I decided to
> >>>> write a
> >>>> new one). Thanks to Julia's expressiveness, it cost me less than 2 days
> >>>> to
> >>>> write the entire program. But it runs really slow, about 100x slower
> >>>> than
> >>>> the C code of word2vec (the algorithm is the same).  I've been trying
> >>>> to
> >>>> optimize my code for several days (adding type annotations, using BLAS
> >>>> to
> >>>> do computation, eliminating memory allocations ...), but it is still
> >>>> 30x
> >>>> slower than the C code.
> >>>> 
> >>>> The critical part of my program is the following function (it also
> >>>> consumes most of the time according to the profiling result):
> >>>> 
> >>>> function train_one(c :: LinearClassifier, x :: Array{Float64}, y ::
> >>>> Int64; α :: Float64 = 0.025, input_gradient :: Union(Nothing,
> >>>> Array{Float64}) = nothing)
> >>>> 
> >>>>     predict!(c, x)
> >>>>     c.outputs[y] -= 1
> >>>>     
> >>>>     if input_gradient != nothing
> >>>>     
> >>>>         # input_gradient = ( c.weights * outputs' )'
> >>>>         BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, input_gradient)
> >>>>     
> >>>>     end
> >>>>     
> >>>>     # c.weights -= α * x' * outputs;
> >>>>     BLAS.ger!(-α, vec(x), c.outputs, c.weights)
> >>>> 
> >>>> end
> >>>> 
> >>>> function predict!(c :: LinearClassifier, x :: Array{Float64})
> >>>> 
> >>>>     c.outputs = vec(softmax(x * c.weights))
> >>>> 
> >>>> end
> >>>> 
> >>>> type LinearClassifier
> >>>> 
> >>>>     k :: Int64 # number of outputs
> >>>>     n :: Int64 # number of inputs
> >>>>     weights :: Array{Float64, 2} # k * n weight matrix
> >>>>     
> >>>>     outputs :: Vector{Float64}
> >>>> 
> >>>> end
> >>>> 
> >>>> And the entire program can be found here
> >>>> <https://github.com/yangzhixuan/embed>. Could you please check my code
> >>>> and tell me what I can do to get performance comparable to C.
> >>>> 
> >>>> Regards.
> >>>> Yang Zhixuan

Reply via email to