I passed "--check-bounds=no" to julia when launching the REPL to avoid 
writing @inbounds explicitly. 

Simply adding @simd before for loops doesn't seem to be helping, it 
slightly slows down the code.

@nexprs is exactly what I need to avoid redundant code when unrolling 
loops. I will use this to simplify the code.

I will see KernelTools.jl  later. 

At this time,  my Julia code runs roughly 3.5x slower than the C code and 
I'm pleased with this. I'd be glad to move on to adding more features I 
want at first. :-) 

Thanks, Yang Zhixuan

在 2015年2月21日星期六 UTC+8下午10:49:57,Tim Holy写道:
>
> Just to check, in writing out your own version of gemv! you're using 
> @inbounds 
> @simd, right? 
>
> The @nexprs macro (documented in the Base.Cartesian section of the manual) 
> lets you unroll loops manually. Also, see the (currently alpha) 
> KernelTools.jl 
> repository for some ideas about improving cache efficiency---perhaps the 
> @tile 
> macro will help. 
>
> --Tim 
>
> On Saturday, February 21, 2015 01:17:24 PM Mauro wrote: 
> > > After all, the C code was optimized intensively in the cache level and 
> > > all loops were unrolled. 
> > 
> > Julia is good at unrolling loops using marcos. 
> > 
> > > Maybe I should try to use more computers. Currently my code is 
> paralleled 
> > > by using pmap(). I hope the communication overhead will not be a new 
> > > bottleneck if I run on a local network cluster. 
> > > 
> > > Thanks for your help! 
> > > 
> > > Regards, Yang Zhixuan 
> > > 
> > > 在 2015年2月21日星期六 UTC+8下午2:23:37,Viral Shah写道: 
> > > 
> > >> So, where is the performance now compared to the C program? I don't 
> think 
> > >> MKL will give you much if you were on the order of 100x slower to 
> start 
> > >> with. 
> > >> 
> > >> -viral 
> > >> 
> > >> On Friday, February 20, 2015 at 8:19:50 PM UTC+5:30, Zhixuan Yang 
> wrote: 
> > >>> Mauro, Sean, and Tim, thanks for your help. 
> > >>> 
> > >>> Following your suggestions, I removed keyword arguments and split 
> the 
> > >>> function to avoid conditional statements. These helped a bit. 
> > >>> 
> > >>> But I got a surprising result after replacing BLAS functions with 
> simple 
> > >>> for loops, for loops is about 1.5x faster than BLAS calls. My Julia 
> is 
> > >>> compiled on my computer with the default configuration (the 
> > >>> versioninfo() 
> > >>> is listed below). Do you think it will help to compile a Julia with 
> a 
> > >>> faster and more optimized BLAS implementation such as Intel's MKL? 
> > >>> 
> > >>> Julia Version 0.3.6-pre+70 
> > >>> Commit 638fa02 (2015-02-12 13:59 UTC) 
> > >>> 
> > >>> Platform Info: 
> > >>>  System: Darwin (x86_64-apple-darwin14.1.0) 
> > >>>  CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz 
> > >>>  WORD_SIZE: 64 
> > >>>  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) 
> > >>>  LAPACK: libopenblas 
> > >>>  LIBM: libopenlibm 
> > >>>  LLVM: libLLVM-3.3 
> > >>> 
> > >>> Regards, Yang Zhixuan 
> > >>> 
> > >>> 在 2015年2月19日星期四 UTC+8下午10:51:20,Zhixuan Yang写道: 
> > >>> 
> > >>>> Hello everyone, 
> > >>>> 
> > >>>> Recently I'm working on my first Julia project, a word embedding 
> > >>>> training program similar to Google's word2vec 
> > >>>> <https://code.google.com/p/word2vec/> (the code of word2vec is 
> indeed 
> > >>>> very high-quality, but I want to add more features, so I decided to 
> > >>>> write a 
> > >>>> new one). Thanks to Julia's expressiveness, it cost me less than 2 
> days 
> > >>>> to 
> > >>>> write the entire program. But it runs really slow, about 100x 
> slower 
> > >>>> than 
> > >>>> the C code of word2vec (the algorithm is the same).  I've been 
> trying 
> > >>>> to 
> > >>>> optimize my code for several days (adding type annotations, using 
> BLAS 
> > >>>> to 
> > >>>> do computation, eliminating memory allocations ...), but it is 
> still 
> > >>>> 30x 
> > >>>> slower than the C code. 
> > >>>> 
> > >>>> The critical part of my program is the following function (it also 
> > >>>> consumes most of the time according to the profiling result): 
> > >>>> 
> > >>>> function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: 
> > >>>> Int64; α :: Float64 = 0.025, input_gradient :: Union(Nothing, 
> > >>>> Array{Float64}) = nothing) 
> > >>>> 
> > >>>>     predict!(c, x) 
> > >>>>     c.outputs[y] -= 1 
> > >>>>     
> > >>>>     if input_gradient != nothing 
> > >>>>     
> > >>>>         # input_gradient = ( c.weights * outputs' )' 
> > >>>>         BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, 
> input_gradient) 
> > >>>>     
> > >>>>     end 
> > >>>>     
> > >>>>     # c.weights -= α * x' * outputs; 
> > >>>>     BLAS.ger!(-α, vec(x), c.outputs, c.weights) 
> > >>>> 
> > >>>> end 
> > >>>> 
> > >>>> function predict!(c :: LinearClassifier, x :: Array{Float64}) 
> > >>>> 
> > >>>>     c.outputs = vec(softmax(x * c.weights)) 
> > >>>> 
> > >>>> end 
> > >>>> 
> > >>>> type LinearClassifier 
> > >>>> 
> > >>>>     k :: Int64 # number of outputs 
> > >>>>     n :: Int64 # number of inputs 
> > >>>>     weights :: Array{Float64, 2} # k * n weight matrix 
> > >>>>     
> > >>>>     outputs :: Vector{Float64} 
> > >>>> 
> > >>>> end 
> > >>>> 
> > >>>> And the entire program can be found here 
> > >>>> <https://github.com/yangzhixuan/embed>. Could you please check my 
> code 
> > >>>> and tell me what I can do to get performance comparable to C. 
> > >>>> 
> > >>>> Regards. 
> > >>>> Yang Zhixuan 
>
>

Reply via email to