Re: multithread/concurrency/parallel methods and performance

2018-02-19 Thread Dmitry Olshansky via Digitalmars-d-learn

On Monday, 19 February 2018 at 14:57:22 UTC, SrMordred wrote:
On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky 
wrote:
The operation is trivial and dataset is rather small. In such 
cases SIMD with eg array ops is the way to go:

result[] = values[] * values2[];


Yes, absolutely right :)

I make a simple example to understand why the threads are not 
scaling in the way i thought they would.


Yeah, the world is ugly place where trivial math sometimes 
doesn’t work.


I suggest to:
- run with different number of threads from 1 to n
- vary sizes from 100k to 10m

For your numbers - 400ms / 64 is ~ 6ms, if we divide by # cores 
it’s 6/7 ~ 0.86ms which is a deal smaller then a CPU timeslice.


In essence a single core runs fast b/c it doesn’t wait for all 
others to complete via join easily burning its quota in one go. 
In MT I bet some of overhead comes from not all threads finishing 
(and starting) at once, so the join block in the kernel.


You could run your MT code with strace to see if it hits the 
futex call or some such, if it does that’s where you are getting 
delays. (that’s assuming you are on Linux)


std.parallel version is a bit faster b/c I think it caches 
created threadpool so you don’t start threads anew on each run.


I imagine that, if one core work is done in 200ms a 4 core work 
will be done in 50ms, plus some overhead, since they are 
working on separate block of memory, without need of sync, and 
without false sharing, etc (at least I think i don´t have this 
problem here).


If you had a long queue of small tasks like that and you don’t 
wait to join all threads untill absolutely required you get near 
perfect scalability. (Unless hitting other bottlenecks like RAM).






Re: multithread/concurrency/parallel methods and performance

2018-02-19 Thread SrMordred via Digitalmars-d-learn
On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky 
wrote:
The operation is trivial and dataset is rather small. In such 
cases SIMD with eg array ops is the way to go:

result[] = values[] * values2[];


Yes, absolutely right :)

I make a simple example to understand why the threads are not 
scaling in the way i thought they would.
I imagine that, if one core work is done in 200ms a 4 core work 
will be done in 50ms, plus some overhead, since they are working 
on separate block of memory, without need of sync, and without 
false sharing, etc (at least I think i don´t have this problem 
here).


Re: multithread/concurrency/parallel methods and performance

2018-02-19 Thread SrMordred via Digitalmars-d-learn
On Monday, 19 February 2018 at 05:49:54 UTC, Nicholas Wilson 
wrote:
As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache 
for 32bit) it may be that dealing with the concurrency overhead 
adds a significant amount of overhead.


That 'concurrency overhead' is what i´m not getting.
Since the array is big, dividing it into 6, 7 cores will not 
trash L1 since they are very far from each other, right? Or L2 
cache trashing is also a problem in this case?



_base : 150 ms, 728 μs, and 5 hnsecs
_parallel : 120 ms, 78 μs, and 5 hnsecs
_concurrency : 134 ms, 787 μs, and 4 hnsecs
_thread : 129 ms, 476 μs, and 2 hnsecs



Yes, on my PC I was using -release.

Yet, 150ms for 1 core. 120-134ms of X cores.
Shouldn´t be way faster? I´m trying to understand where the 
overhead is, and if is possible to get rid of it (perfect thread 
scaling).


Re: multithread/concurrency/parallel methods and performance

2018-02-18 Thread Dmitry Olshansky via Digitalmars-d-learn

On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:

I´m experimenting with threads and related recently.
(i´m just started so may be some terrrible mistakes here)

With this base work:

foreach(i ; 0 .. SIZE)
{
results[i] = values1[i] * values2[i];
}

and then with this 3 others methods: parallel, spawn and 
Threads.


this was my results:

_base : 456 ms and 479 us
_parallel : 331 ms, 324 us, and 4 hnsecs
_concurrency : 367 ms, 348 us, and 2 hnsecs
_thread : 369 ms, 565 us, and 3 hnsecs

(code here : https://run.dlang.io/is/2pdmmk )

All methods have minor speedup gains.  I was expecting a lot 
more.

Since I have 7 cores I expected like below 100ms.


The operation is trivial and dataset is rather small. In such 
cases SIMD with eg array ops is the way to go:

result[] = values[] * values2[];


Parallelism is gets more interesting with more expensive 
operations.  You may also try bigger sizes or both.


I´m not seeing false sharing in this case. or i'm wrong?

If someone can expand on this, i'll be grateful.

Thanks!





Re: multithread/concurrency/parallel methods and performance

2018-02-18 Thread Nicholas Wilson via Digitalmars-d-learn

On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:

I´m experimenting with threads and related recently.
(i´m just started so may be some terrrible mistakes here)

With this base work:

foreach(i ; 0 .. SIZE)
{
results[i] = values1[i] * values2[i];
}

and then with this 3 others methods: parallel, spawn and 
Threads.


this was my results:

_base : 456 ms and 479 us
_parallel : 331 ms, 324 us, and 4 hnsecs
_concurrency : 367 ms, 348 us, and 2 hnsecs
_thread : 369 ms, 565 us, and 3 hnsecs

(code here : https://run.dlang.io/is/2pdmmk )

All methods have minor speedup gains.  I was expecting a lot 
more.

Since I have 7 cores I expected like below 100ms.

I´m not seeing false sharing in this case. or i'm wrong?

If someone can expand on this, i'll be grateful.

Thanks!


As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache 
for 32bit) it may be that dealing with the concurrency overhead 
adds a significant amount of overhead. Also the run.dlang.io link 
has no -O flag and thus no optimisations


without -O i get

_base : 323 ms, 92 μs, and 6 hnsecs
_parallel : 276 ms, 649 μs, and 3 hnsecs
_concurrency : 221 ms, 931 μs, and 7 hnsecs
_thread : 212 ms, 277 μs, and 3 hnsecs

with it I get
_base : 150 ms, 728 μs, and 5 hnsecs
_parallel : 120 ms, 78 μs, and 5 hnsecs
_concurrency : 134 ms, 787 μs, and 4 hnsecs
_thread : 129 ms, 476 μs, and 2 hnsecs

with SIZE= 16*1024*1024 without -O i get
_base : 5 secs, 835 ms, 240 μs, and 9 hnsecs
_parallel : 4 secs, 802 ms, 279 μs, and 8 hnsecs
_concurrency : 2 secs, 133 ms, 685 μs, and 3 hnsecs
_thread : 2 secs, 108 ms, 860 μs, and 9 hnsecs

with SIZE= 16*1024*1024 with -O i get
_base : 2 secs, 502 ms, 523 μs, and 4 hnsecs
_parallel : 1 sec, 769 ms, 945 μs, and 3 hnsecs
_concurrency : 1 sec, 362 ms, 747 μs, and 1 hnsec
_thread : 1 sec, 335 ms, 720 μs, and 1 hn




Re: multithread/concurrency/parallel methods and performance

2018-02-18 Thread Jordan Wilson via Digitalmars-d-learn

On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:

I´m experimenting with threads and related recently.
(i´m just started so may be some terrrible mistakes here)

With this base work:

foreach(i ; 0 .. SIZE)
{
results[i] = values1[i] * values2[i];
}

and then with this 3 others methods: parallel, spawn and 
Threads.


this was my results:

_base : 456 ms and 479 us
_parallel : 331 ms, 324 us, and 4 hnsecs
_concurrency : 367 ms, 348 us, and 2 hnsecs
_thread : 369 ms, 565 us, and 3 hnsecs

(code here : https://run.dlang.io/is/2pdmmk )

All methods have minor speedup gains.  I was expecting a lot 
more.

Since I have 7 cores I expected like below 100ms.

I´m not seeing false sharing in this case. or i'm wrong?

If someone can expand on this, i'll be grateful.

Thanks!


It may be due to thread local storage:
https://tour.dlang.org/tour/en/multithreading/thread-local-storage
https://dlang.org/articles/migrate-to-shared.html

I'm not sure though, as I don't know how your results array 
initialised.


Jordan


multithread/concurrency/parallel methods and performance

2018-02-18 Thread SrMordred via Digitalmars-d-learn

I´m experimenting with threads and related recently.
(i´m just started so may be some terrrible mistakes here)

With this base work:

foreach(i ; 0 .. SIZE)
{
results[i] = values1[i] * values2[i];
}

and then with this 3 others methods: parallel, spawn and Threads.

this was my results:

_base : 456 ms and 479 us
_parallel : 331 ms, 324 us, and 4 hnsecs
_concurrency : 367 ms, 348 us, and 2 hnsecs
_thread : 369 ms, 565 us, and 3 hnsecs

(code here : https://run.dlang.io/is/2pdmmk )

All methods have minor speedup gains.  I was expecting a lot more.
Since I have 7 cores I expected like below 100ms.

I´m not seeing false sharing in this case. or i'm wrong?

If someone can expand on this, i'll be grateful.

Thanks!