This does look like a nice benchmark. I would love to see what it takes to 
narrow down the gap further. Playing around with it now. Perhaps the threads 
branch is also worth a shot.

-viral



> On 21-Feb-2015, at 1:23 pm, Zhixuan Yang <[email protected]> wrote:
> 
> After recompiled an native arch version of Julia and OpenBLAS, it's about 8x 
> slower than the C code and I think it's near to the  highest performance my 
> code can achieve. After all, the C code was optimized intensively in the 
> cache level and all loops were unrolled. But my Julia code is much more 
> flexible and extensible. 
> 
> Maybe I should try to use more computers. Currently my code is paralleled by 
> using pmap(). I hope the communication overhead will not be a new bottleneck 
> if I run on a local network cluster.
> 
> Thanks for your help! 
> 
> Regards, Yang Zhixuan
> 
> 在 2015年2月21日星期六 UTC+8下午2:23:37,Viral Shah写道:
> So, where is the performance now compared to the C program? I don't think MKL 
> will give you much if you were on the order of 100x slower to start with.
> 
> -viral
> 
> On Friday, February 20, 2015 at 8:19:50 PM UTC+5:30, Zhixuan Yang wrote:
> Mauro, Sean, and Tim, thanks for your help. 
> 
> Following your suggestions, I removed keyword arguments and split the 
> function to avoid conditional statements. These helped a bit. 
> 
> But I got a surprising result after replacing BLAS functions with simple for 
> loops, for loops is about 1.5x faster than BLAS calls. My Julia is compiled 
> on my computer with the default configuration (the versioninfo() is listed 
> below). Do you think it will help to compile a Julia with a faster and more 
> optimized BLAS implementation such as Intel's MKL? 
> 
> Julia Version 0.3.6-pre+70
> Commit 638fa02 (2015-02-12 13:59 UTC)
> Platform Info:
>  System: Darwin (x86_64-apple-darwin14.1.0)
>  CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
>  WORD_SIZE: 64
>  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>  LAPACK: libopenblas
>  LIBM: libopenlibm
>  LLVM: libLLVM-3.3
> 
> 
> 
> Regards, Yang Zhixuan
> 
> 在 2015年2月19日星期四 UTC+8下午10:51:20,Zhixuan Yang写道:
> Hello everyone, 
> 
> Recently I'm working on my first Julia project, a word embedding training 
> program similar to Google's word2vec (the code of word2vec is indeed very 
> high-quality, but I want to add more features, so I decided to write a new 
> one). Thanks to Julia's expressiveness, it cost me less than 2 days to write 
> the entire program. But it runs really slow, about 100x slower than the C 
> code of word2vec (the algorithm is the same).  I've been trying to optimize 
> my code for several days (adding type annotations, using BLAS to do 
> computation, eliminating memory allocations ...), but it is still 30x slower 
> than the C code. 
> 
> The critical part of my program is the following function (it also consumes 
> most of the time according to the profiling result):
> 
> function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: Int64; α 
> :: Float64 = 0.025, input_gradient :: Union(Nothing, Array{Float64}) = 
> nothing)
>     predict!(c, x)
>     c.outputs[y] -= 1
> 
>     if input_gradient != nothing
>         # input_gradient = ( c.weights * outputs' )'
>         BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, input_gradient)
>     end
> 
>     # c.weights -= α * x' * outputs;
>     BLAS.ger!(-α, vec(x), c.outputs, c.weights)
> end
> 
> function predict!(c :: LinearClassifier, x :: Array{Float64})
>     c.outputs = vec(softmax(x * c.weights))
> end
> 
> type LinearClassifier
>     k :: Int64 # number of outputs
>     n :: Int64 # number of inputs
>     weights :: Array{Float64, 2} # k * n weight matrix
> 
>     outputs :: Vector{Float64}
> end
> 
> And the entire program can be found here. Could you please check my code and 
> tell me what I can do to get performance comparable to C. 
> 
> Regards.
> Yang Zhixuan

Reply via email to