Re: [julia-users] Re: tanh() speed / multi-threading

Carlos Becker Sun, 18 May 2014 14:44:45 -0700

Great to see that Tobias' PR rocks ;)

I am still getting a weird segfault, and cannot reproduce it when put to
simpler code.
I will keep working on it, and post it as soon as I nail it.


Tobias: any pointer towards possible incompatibilities of the current state
of the PR?

thanks.


------------------------------------------
Carlos


On Sun, May 18, 2014 at 5:26 PM, Tobias Knopp
<tobias.kn...@googlemail.com>wrote:

> And I am pretty excited that it seems to scale so well at your setup. I
> have only 2 cores so could not see if it scales to more cores.
>
> Am Sonntag, 18. Mai 2014 16:40:18 UTC+2 schrieb Tobias Knopp:
>
>> Well when I started I got segfaullt all the time :-)
>>
>> Could you please send me a minimal code example that segfaults? This
>> would be great! This is the only way we can get this stable.
>>
>> Am Sonntag, 18. Mai 2014 16:35:47 UTC+2 schrieb Carlos Becker:
>>>
>>> Sounds great!
>>> I just gave it a try, and with 16 threads I get 0.07sec which is
>>> impressive.
>>>
>>> That is when I tried it in isolated code. When put together with other
>>> julia code I have, it segfaults. Have you experienced this as well?
>>>  Le 18 mai 2014 16:05, "Tobias Knopp" <tobias...@googlemail.com> a
>>> écrit :
>>>
>>>> sure, the function is Base.parapply though. I had explicitly imported
>>>> it.
>>>>
>>>> In the case of vectorize_1arg it would be great to automatically
>>>> parallelize comprehensions. If someone could tell me where the actual
>>>> looping happens, this would be great. I have not found that yet. Seems to
>>>> be somewhere in the parser.
>>>>
>>>> Am Sonntag, 18. Mai 2014 14:30:49 UTC+2 schrieb Carlos Becker:
>>>>>
>>>>> btw, the code you just sent works as is with your pull request branch?
>>>>>
>>>>>
>>>>> ------------------------------------------
>>>>> Carlos
>>>>>
>>>>>
>>>>> On Sun, May 18, 2014 at 1:04 PM, Carlos Becker <carlos...@gmail.com>wrote:
>>>>>
>>>>>> HI Tobias, I saw your pull request and have been following it
>>>>>> closely, nice work ;)
>>>>>>
>>>>>> Though, in the case of element-wise matrix operations, like tanh,
>>>>>> there is no need for extra allocations, since the buffer should be
>>>>>> allocated only once.
>>>>>>
>>>>>> From your first code snippet, is julia smart enough to pre-compute
>>>>>> i*N/2 ?
>>>>>> In such cases, creating a kind of array view on the original data
>>>>>> would probably be faster, right? (though I don't know how allocations 
>>>>>> work
>>>>>> here).
>>>>>>
>>>>>> For vectorize_1arg_openmp, I was thinking of "hard-coding" it for
>>>>>> known operations such as trigonometric ones, that benefit a lot from
>>>>>> multi-threading.
>>>>>> I know this is a hack, but it is quick to implement and brings an
>>>>>> amazing speed up (8x in the case of the code I posted above).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------
>>>>>> Carlos
>>>>>>
>>>>>>
>>>>>> On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp <
>>>>>> tobias...@googlemail.com> wrote:
>>>>>>
>>>>>>> Hi Carlos,
>>>>>>>
>>>>>>> I am working on something that will allow to do multithreading on
>>>>>>> Julia functions (https://github.com/JuliaLang/julia/pull/6741).
>>>>>>> Implementing vectorize_1arg_openmp is actually a lot less trivial as the
>>>>>>> Julia runtime is not thread safe (yet)
>>>>>>>
>>>>>>> Your example is great. I first got a slowdown of 10 because the
>>>>>>> example revealed a locking issue. With a little trick I now get a 
>>>>>>> speedup
>>>>>>> of 1.75 on a 2 core machine. Not to bad taking into account that memory
>>>>>>> allocation cannot be parallelized.
>>>>>>>
>>>>>>> The tweaked code looks like
>>>>>>>
>>>>>>> function tanh_core(x,y,i)
>>>>>>>
>>>>>>>     N=length(x)
>>>>>>>
>>>>>>>     for l=1:N/2
>>>>>>>
>>>>>>>       y[l+i*N/2] = tanh(x[l+i*N/2])
>>>>>>>
>>>>>>>     end
>>>>>>>
>>>>>>> end
>>>>>>>
>>>>>>>
>>>>>>> function ptanh(x;numthreads=2)
>>>>>>>
>>>>>>>     y = similar(x)
>>>>>>>
>>>>>>>     N = length(x)
>>>>>>>
>>>>>>>     parapply(tanh_core,(x,y), 0:1, numthreads=numthreads)
>>>>>>>
>>>>>>>     y
>>>>>>>
>>>>>>> end
>>>>>>>
>>>>>>>
>>>>>>> I actually want this to be also fast for
>>>>>>>
>>>>>>>
>>>>>>> function tanh_core(x,y,i)
>>>>>>>
>>>>>>>     y[i] = tanh(x[i])
>>>>>>>
>>>>>>> end
>>>>>>>
>>>>>>>
>>>>>>> function ptanh(x;numthreads=2)
>>>>>>>
>>>>>>>     y = similar(x)
>>>>>>>
>>>>>>>     N = length(x)
>>>>>>>
>>>>>>>     parapply(tanh_core,(x,y), 1:N, numthreads=numthreads)
>>>>>>>
>>>>>>>     y
>>>>>>>
>>>>>>> end
>>>>>>>
>>>>>>> Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker:
>>>>>>>
>>>>>>>> now that I think about it, maybe openblas has nothing to do here,
>>>>>>>> since @which tanh(y) leads to a call to vectorize_1arg().
>>>>>>>>
>>>>>>>> If that's the case, wouldn't it be advantageous to have a
>>>>>>>> vectorize_1arg_openmp() function (defined in C/C++) that works for
>>>>>>>> element-wise operations on scalar arrays,
>>>>>>>> multi-threading with OpenMP?
>>>>>>>>
>>>>>>>>
>>>>>>>> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker
>>>>>>>> escribió:
>>>>>>>>>
>>>>>>>>> forgot to add versioninfo():
>>>>>>>>>
>>>>>>>>> julia> versioninfo()
>>>>>>>>> Julia Version 0.3.0-prerelease+2921
>>>>>>>>> Commit ea70e4d* (2014-05-07 17:56 UTC)
>>>>>>>>> Platform Info:
>>>>>>>>>   System: Linux (x86_64-linux-gnu)
>>>>>>>>>   CPU: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
>>>>>>>>>   WORD_SIZE: 64
>>>>>>>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
>>>>>>>>>   LAPACK: libopenblas
>>>>>>>>>   LIBM: libopenlibm
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker
>>>>>>>>> escribió:
>>>>>>>>>>
>>>>>>>>>> This is probably related to openblas, but it seems to be that
>>>>>>>>>> tanh() is not multi-threaded, which hinders a considerable speed
>>>>>>>>>> improvement.
>>>>>>>>>> For example, MATLAB does multi-thread it and gets something
>>>>>>>>>> around 3x speed-up over the single-threaded version.
>>>>>>>>>>
>>>>>>>>>> For example,
>>>>>>>>>>
>>>>>>>>>>   x = rand(100000,200);
>>>>>>>>>>   @time y = tanh(x);
>>>>>>>>>>
>>>>>>>>>> yields:
>>>>>>>>>>   - 0.71 sec in Julia
>>>>>>>>>>   - 0.76 sec in matlab with -singleCompThread
>>>>>>>>>>   - and 0.09 sec in Matlab (this one uses multi-threading by
>>>>>>>>>> default)
>>>>>>>>>>
>>>>>>>>>> Good news is that julia (w/openblas) is competitive with matlab
>>>>>>>>>> single-threaded version,
>>>>>>>>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't
>>>>>>>>>> have any effect on the timings, nor I see higher CPU usage with 
>>>>>>>>>> 'top'.
>>>>>>>>>>
>>>>>>>>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I
>>>>>>>>>> missing?
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>

Re: [julia-users] Re: tanh() speed / multi-threading

Reply via email to