@mratsim:

> Very nice analysis.

Thank you.

> Nim threadpool actually does this. And it should be the other way around, 
> channels are the building blocks and you build Flowvars on top. And you also 
> need a Channel type that is aware of async frameworks to bridge the async 
> world and the sync/multithreaded world. In Weave, Flowvars are just a ptr 
> Channel.

Difference of opinion, Golang seems to do as is your opinion, Haskel builds 
everything on MVar's, and F# has both async and task parallelism with the Task 
type it gets from DotNet including the FlowVar concept, but with those Task's 
able to work together with Async. F# doesn't have built-in channels, so if 
required on would have to build them from primitives.

My opinion could be changed if the version of FlowVar's I propose truly can't 
work with a version of async, but I'd like to see the implementation as by F# 
researched first.

> Did you look into how they implement data parallelism? If they use something 
> similar to OpenMP they have to pack the inner body of the parallel for into a 
> closure that can be distributed to other threads.

I spent about a week fighting task parallelism as looking into why it was so 
slow but didn't have an immediate need for data parallelism so didn't look much 
into it. However, I see that much of the compilation time is spent doing 
various data flow analysis such that the final ways of doing parallelism is 
dependent on the particular case. RE: data parallelism, the documentation 
states that it isn't guaranteed that each inner element will be performed by a 
separate thread so it seems that some grouping may be taking place and use of 
some internal thread pools may also be taking place, else the feature would be 
almost useless. Documentation for data parallelism shows parallel operations 
being carried out on individual Array element basis, so it isn't taking just 
the naive approach.

All parallelism seems to be implemented by using the pthread library directly 
and not through the use of OpenMP; however, they do manually internally 
generate some sort of closure or closure emulation (remember neither Chapel nor 
low level C has real closures) where the required parameters/captured values 
are passed in as parameters to a function which is the body of the background 
task. It seems to be reasonably intelligent in how it does that for data 
parallelism so that the overheads are greatly reduced.

My interest was more in low level task parallelism as I had an application in 
which I wanted to control execution order and the reasonably simple way I found 
to do it was in assigning tasks to a thread pool and controlling their order 
with sync variable queues.

> There is no reason to have such a high overhead especially in high 
> performance computing. OpenMP is way way way lower.

I agree, and I'm still not 100% sure where all the time is going, just that the 
loss of time went away when I implemented thread pools. Creating individual 
pthreads directly is in the order of about ten times faster. That said, in 
their fancy data flow analysis, Chapel sometimes surprises in that when I did a 
little toy low level test with extremely simple tasks, it was able to give the 
expected boost in multi-threaded performance (boosted by the multiplier of the 
number of real cores) for doing almost no work per task and I think that it 
somehow saw the work was negligible/trivial and automatically made thread pool 
groups out of the tasks.

My actual application was much more complex so that it couldn't see that an 
automatically generated thread pool was what was necessary; thus my fix of 
manually generating thread pools.

> As an example it takes 0.33 ns to do 4 additions on a vectorized 3GHz CPU, 
> i.e. you can do 12 additions per ns, so 12M addition per milliseconds, i.e. 
> with your figures, you would need to work on matrices of size 3000x4000 for 
> parallelization to start being useful, that seems way too large.

As per my discussion above and the documentation examples, Chapel is very much 
specialized for doing Array/Matrix parallel manipulations so it almost 
certainly does something in data parallelism so one would never experience 
these delays even for much smaller matrices than you describe.

> Intel TBB also target tasks in the 10000 cycles range which would translate 
> to 3.33 µs.

Chapel uses pthread mutexes and monitors to implement sync variables (as does 
Nim) and so much of the delay for task parallelism using a thread pool when a 
work queue/channel and result queue/channel are being used will be whatever 
execution delays those impose, which is likely to be the same as for Intel TBB 
or any language using these mechanisms. It is more than adequate when the 
actual task "Chunk" takes about one milliseconds (or much less down to say 100 
microseconds), and I could increase the Chunk to up to about 50 milliseconds if 
I wanted to so the overhead time would be miniscule.

My main lament in this regard as to using Chapel is that it has neither a 
"channels" library (it seems it used to have one as I see references to it in 
on-line searches) nor a "threadpool" library, which would likely have made my 
coding work a little easier.

Reply via email to