@mratsim: > Very nice analysis.
Thank you. > Nim threadpool actually does this. And it should be the other way around, > channels are the building blocks and you build Flowvars on top. And you also > need a Channel type that is aware of async frameworks to bridge the async > world and the sync/multithreaded world. In Weave, Flowvars are just a ptr > Channel. Difference of opinion, Golang seems to do as is your opinion, Haskel builds everything on MVar's, and F# has both async and task parallelism with the Task type it gets from DotNet including the FlowVar concept, but with those Task's able to work together with Async. F# doesn't have built-in channels, so if required on would have to build them from primitives. My opinion could be changed if the version of FlowVar's I propose truly can't work with a version of async, but I'd like to see the implementation as by F# researched first. > Did you look into how they implement data parallelism? If they use something > similar to OpenMP they have to pack the inner body of the parallel for into a > closure that can be distributed to other threads. I spent about a week fighting task parallelism as looking into why it was so slow but didn't have an immediate need for data parallelism so didn't look much into it. However, I see that much of the compilation time is spent doing various data flow analysis such that the final ways of doing parallelism is dependent on the particular case. RE: data parallelism, the documentation states that it isn't guaranteed that each inner element will be performed by a separate thread so it seems that some grouping may be taking place and use of some internal thread pools may also be taking place, else the feature would be almost useless. Documentation for data parallelism shows parallel operations being carried out on individual Array element basis, so it isn't taking just the naive approach. All parallelism seems to be implemented by using the pthread library directly and not through the use of OpenMP; however, they do manually internally generate some sort of closure or closure emulation (remember neither Chapel nor low level C has real closures) where the required parameters/captured values are passed in as parameters to a function which is the body of the background task. It seems to be reasonably intelligent in how it does that for data parallelism so that the overheads are greatly reduced. My interest was more in low level task parallelism as I had an application in which I wanted to control execution order and the reasonably simple way I found to do it was in assigning tasks to a thread pool and controlling their order with sync variable queues. > There is no reason to have such a high overhead especially in high > performance computing. OpenMP is way way way lower. I agree, and I'm still not 100% sure where all the time is going, just that the loss of time went away when I implemented thread pools. Creating individual pthreads directly is in the order of about ten times faster. That said, in their fancy data flow analysis, Chapel sometimes surprises in that when I did a little toy low level test with extremely simple tasks, it was able to give the expected boost in multi-threaded performance (boosted by the multiplier of the number of real cores) for doing almost no work per task and I think that it somehow saw the work was negligible/trivial and automatically made thread pool groups out of the tasks. My actual application was much more complex so that it couldn't see that an automatically generated thread pool was what was necessary; thus my fix of manually generating thread pools. > As an example it takes 0.33 ns to do 4 additions on a vectorized 3GHz CPU, > i.e. you can do 12 additions per ns, so 12M addition per milliseconds, i.e. > with your figures, you would need to work on matrices of size 3000x4000 for > parallelization to start being useful, that seems way too large. As per my discussion above and the documentation examples, Chapel is very much specialized for doing Array/Matrix parallel manipulations so it almost certainly does something in data parallelism so one would never experience these delays even for much smaller matrices than you describe. > Intel TBB also target tasks in the 10000 cycles range which would translate > to 3.33 µs. Chapel uses pthread mutexes and monitors to implement sync variables (as does Nim) and so much of the delay for task parallelism using a thread pool when a work queue/channel and result queue/channel are being used will be whatever execution delays those impose, which is likely to be the same as for Intel TBB or any language using these mechanisms. It is more than adequate when the actual task "Chunk" takes about one milliseconds (or much less down to say 100 microseconds), and I could increase the Chunk to up to about 50 milliseconds if I wanted to so the overhead time would be miniscule. My main lament in this regard as to using Chapel is that it has neither a "channels" library (it seems it used to have one as I see references to it in on-line searches) nor a "threadpool" library, which would likely have made my coding work a little easier.
