Great to see that Tobias' PR rocks ;) I am still getting a weird segfault, and cannot reproduce it when put to simpler code. I will keep working on it, and post it as soon as I nail it.
Tobias: any pointer towards possible incompatibilities of the current state of the PR? thanks. ------------------------------------------ Carlos On Sun, May 18, 2014 at 5:26 PM, Tobias Knopp <tobias.kn...@googlemail.com>wrote: > And I am pretty excited that it seems to scale so well at your setup. I > have only 2 cores so could not see if it scales to more cores. > > Am Sonntag, 18. Mai 2014 16:40:18 UTC+2 schrieb Tobias Knopp: > >> Well when I started I got segfaullt all the time :-) >> >> Could you please send me a minimal code example that segfaults? This >> would be great! This is the only way we can get this stable. >> >> Am Sonntag, 18. Mai 2014 16:35:47 UTC+2 schrieb Carlos Becker: >>> >>> Sounds great! >>> I just gave it a try, and with 16 threads I get 0.07sec which is >>> impressive. >>> >>> That is when I tried it in isolated code. When put together with other >>> julia code I have, it segfaults. Have you experienced this as well? >>> Le 18 mai 2014 16:05, "Tobias Knopp" <tobias...@googlemail.com> a >>> écrit : >>> >>>> sure, the function is Base.parapply though. I had explicitly imported >>>> it. >>>> >>>> In the case of vectorize_1arg it would be great to automatically >>>> parallelize comprehensions. If someone could tell me where the actual >>>> looping happens, this would be great. I have not found that yet. Seems to >>>> be somewhere in the parser. >>>> >>>> Am Sonntag, 18. Mai 2014 14:30:49 UTC+2 schrieb Carlos Becker: >>>>> >>>>> btw, the code you just sent works as is with your pull request branch? >>>>> >>>>> >>>>> ------------------------------------------ >>>>> Carlos >>>>> >>>>> >>>>> On Sun, May 18, 2014 at 1:04 PM, Carlos Becker <carlos...@gmail.com>wrote: >>>>> >>>>>> HI Tobias, I saw your pull request and have been following it >>>>>> closely, nice work ;) >>>>>> >>>>>> Though, in the case of element-wise matrix operations, like tanh, >>>>>> there is no need for extra allocations, since the buffer should be >>>>>> allocated only once. >>>>>> >>>>>> From your first code snippet, is julia smart enough to pre-compute >>>>>> i*N/2 ? >>>>>> In such cases, creating a kind of array view on the original data >>>>>> would probably be faster, right? (though I don't know how allocations >>>>>> work >>>>>> here). >>>>>> >>>>>> For vectorize_1arg_openmp, I was thinking of "hard-coding" it for >>>>>> known operations such as trigonometric ones, that benefit a lot from >>>>>> multi-threading. >>>>>> I know this is a hack, but it is quick to implement and brings an >>>>>> amazing speed up (8x in the case of the code I posted above). >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------ >>>>>> Carlos >>>>>> >>>>>> >>>>>> On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp < >>>>>> tobias...@googlemail.com> wrote: >>>>>> >>>>>>> Hi Carlos, >>>>>>> >>>>>>> I am working on something that will allow to do multithreading on >>>>>>> Julia functions (https://github.com/JuliaLang/julia/pull/6741). >>>>>>> Implementing vectorize_1arg_openmp is actually a lot less trivial as the >>>>>>> Julia runtime is not thread safe (yet) >>>>>>> >>>>>>> Your example is great. I first got a slowdown of 10 because the >>>>>>> example revealed a locking issue. With a little trick I now get a >>>>>>> speedup >>>>>>> of 1.75 on a 2 core machine. Not to bad taking into account that memory >>>>>>> allocation cannot be parallelized. >>>>>>> >>>>>>> The tweaked code looks like >>>>>>> >>>>>>> function tanh_core(x,y,i) >>>>>>> >>>>>>> N=length(x) >>>>>>> >>>>>>> for l=1:N/2 >>>>>>> >>>>>>> y[l+i*N/2] = tanh(x[l+i*N/2]) >>>>>>> >>>>>>> end >>>>>>> >>>>>>> end >>>>>>> >>>>>>> >>>>>>> function ptanh(x;numthreads=2) >>>>>>> >>>>>>> y = similar(x) >>>>>>> >>>>>>> N = length(x) >>>>>>> >>>>>>> parapply(tanh_core,(x,y), 0:1, numthreads=numthreads) >>>>>>> >>>>>>> y >>>>>>> >>>>>>> end >>>>>>> >>>>>>> >>>>>>> I actually want this to be also fast for >>>>>>> >>>>>>> >>>>>>> function tanh_core(x,y,i) >>>>>>> >>>>>>> y[i] = tanh(x[i]) >>>>>>> >>>>>>> end >>>>>>> >>>>>>> >>>>>>> function ptanh(x;numthreads=2) >>>>>>> >>>>>>> y = similar(x) >>>>>>> >>>>>>> N = length(x) >>>>>>> >>>>>>> parapply(tanh_core,(x,y), 1:N, numthreads=numthreads) >>>>>>> >>>>>>> y >>>>>>> >>>>>>> end >>>>>>> >>>>>>> Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker: >>>>>>> >>>>>>>> now that I think about it, maybe openblas has nothing to do here, >>>>>>>> since @which tanh(y) leads to a call to vectorize_1arg(). >>>>>>>> >>>>>>>> If that's the case, wouldn't it be advantageous to have a >>>>>>>> vectorize_1arg_openmp() function (defined in C/C++) that works for >>>>>>>> element-wise operations on scalar arrays, >>>>>>>> multi-threading with OpenMP? >>>>>>>> >>>>>>>> >>>>>>>> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker >>>>>>>> escribió: >>>>>>>>> >>>>>>>>> forgot to add versioninfo(): >>>>>>>>> >>>>>>>>> julia> versioninfo() >>>>>>>>> Julia Version 0.3.0-prerelease+2921 >>>>>>>>> Commit ea70e4d* (2014-05-07 17:56 UTC) >>>>>>>>> Platform Info: >>>>>>>>> System: Linux (x86_64-linux-gnu) >>>>>>>>> CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz >>>>>>>>> WORD_SIZE: 64 >>>>>>>>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY) >>>>>>>>> LAPACK: libopenblas >>>>>>>>> LIBM: libopenlibm >>>>>>>>> >>>>>>>>> >>>>>>>>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker >>>>>>>>> escribió: >>>>>>>>>> >>>>>>>>>> This is probably related to openblas, but it seems to be that >>>>>>>>>> tanh() is not multi-threaded, which hinders a considerable speed >>>>>>>>>> improvement. >>>>>>>>>> For example, MATLAB does multi-thread it and gets something >>>>>>>>>> around 3x speed-up over the single-threaded version. >>>>>>>>>> >>>>>>>>>> For example, >>>>>>>>>> >>>>>>>>>> x = rand(100000,200); >>>>>>>>>> @time y = tanh(x); >>>>>>>>>> >>>>>>>>>> yields: >>>>>>>>>> - 0.71 sec in Julia >>>>>>>>>> - 0.76 sec in matlab with -singleCompThread >>>>>>>>>> - and 0.09 sec in Matlab (this one uses multi-threading by >>>>>>>>>> default) >>>>>>>>>> >>>>>>>>>> Good news is that julia (w/openblas) is competitive with matlab >>>>>>>>>> single-threaded version, >>>>>>>>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't >>>>>>>>>> have any effect on the timings, nor I see higher CPU usage with >>>>>>>>>> 'top'. >>>>>>>>>> >>>>>>>>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I >>>>>>>>>> missing? >>>>>>>>>> >>>>>>>>> >>>>>> >>>>>