Your "one" function is not type-stable (f::Int->f::Float64). From the 
allocation, two() must also be boxing something, but I'm not sure why. (This 
might be a julia bug.) In any event, because of all this allocation, it's not 
telling you anything about what the actual performance will be when it's 
working properly.

When you see that much allocation on a loop that shouldn't allocate, focus on 
eliminating it before worrying about anything else.

Best,
--Tim

On Thursday, March 03, 2016 03:39:31 PM Erik Schnetter wrote:
> Tomas
> 
> In your example, different threads access nearby elements in the array
> f. They are likely to be in the same cache line, leading to much
> unnecessary inter-CPU communication. This is also called "false
> sharing". Leaving a gap between the elements used by each thread might
> speed things up.
> 
> Your example is also memory-bound, not compute-bound. Most of the time
> will be spent accessing the elements of x, not performing a
> computation. On a two-socket machine, you can expect at most a speedup
> of two.
> 
> Finally, function one is quite simple, and there's a good chance it
> will be vectorized and/or unrolled. I'm less certain about function
> two. If so, you lose between a factor of two or sixteen.
> 
> Distributing work over threads has an overhead. It might be better to
> use an example that has a non-trivial amount of work per thread, e.g.
> sqrt(sqrt(sqrt(...(sqrt(x)))), with ten or hundred sqrt invokations.
> 
> -erik
> 
> On Thu, Mar 3, 2016 at 2:39 PM,  <[email protected]> wrote:
> > Hi All,
> > I would like to ask if someone has an experience with Threads as they are
> > implemented at the moment in the master branch.
> > After the successful compilation  (put JULIA_THREADS=1 to Make.user)
> > I have played with different levels of granularity, but usually the code
> > was slower or more or less the same speed as single threaded version. I
> > have even tried a totally stupid execution like this
> > 
> > using Base.Threads;
> > function one()
> > 
> >   x=randn(1000000);
> >   f=0;
> >   for i in x
> >   
> >     f+=i;
> >   
> >   end
> > 
> > end
> > 
> > function two()
> > 
> >   x=randn(1000000);
> >   f=zeros(nthreads())
> >   @inbounds @threads for i in 1:length(x)
> >   
> >     f[threadid()]+=x[i];
> >   
> >   end
> >   sum(f)
> > 
> > end
> > 
> > one()
> > @time one()
> > 
> > two()
> > @time two()
> > 
> > and the times on my 2013 Macbook air were
> > 
> >   0.068617 seconds (2.00 M allocations: 38.157 MB, 9.72% gc time)
> >   0.394164 seconds (5.72 M allocations: 99.015 MB, 5.00% gc time)
> > 
> > Wov, that is quite poor. I would expect an overhead, but not big like
> > this.
> > 
> > Can anyone suggest, what is going wrong? I have been trying a profiler,
> > but
> > it does not help. It seems that it does not work with Threads at the
> > moment. Or, is it because Threads are still not really supported.
> > 
> > I would like to get speed-up showing in this video
> > https://www.youtube.com/watch?v=GvLhseZ4D8M
> > 
> > Any suggestions welcomed.
> > Tomas

Reply via email to