Very nice analysis.

A couple of points:

> I like the idea of sync variables (or FlowVar's) being at the top of the 
> available features and the idea of building the channel library on top of 
> such a feature would seem to be a good one as it hides all the complexities 
> of mutexes and conditions behind it.

Nim threadpool actually does this. And it should be the other way around, 
channels are the building blocks and you build Flowvars on top. And you also 
need a Channel type that is aware of async frameworks to bridge the async world 
and the sync/multithreaded world. In Weave, Flowvars are just a `ptr Channel` 
([https://github.com/mratsim/weave/blob/9f0c384f/weave/datatypes/flowvars.nim#L30-L50](https://github.com/mratsim/weave/blob/9f0c384f/weave/datatypes/flowvars.nim#L30-L50))

> Along that line, Chapel has First Class Functions (FCF's) and lambdas 
> (anonymous functions) but neither of these are closures (capable of capturing 
> bindings external to their function blocks and/or input parameters). This 
> puts some severe limitations on things one would commonly do using closures 
> as one is forced to emulate closures using classes, which isn't always 
> possible as in the following discussion.

Did you look into how they implement data parallelism? If they use something 
similar to OpenMP they have to pack the inner body of the parallel for into a 
closure that can be distributed to other threads.

> The real reason for Chapel's existence is its support for multi "locale" 
> parallelism over many computing cores over many locations linked by a 
> specified communication protocol, which is well beyond Nim's goals. 
> Accordingly, Chapel has ways of specifying Array domain maps to where the 
> actual Array's or parts/slices of Array's are actually located across 
> possibly many "locales". Arrays are the main mechanism supported for 
> implementing "data parallelism". This use of multi-"locales" is beyond my 
> direct interest, although I can see that it could be of interest if one had 
> access to such a machine or group of machines, which most of us probably 
> never will.

I expect this is similar to CoArray Fortran: 
[https://en.wikipedia.org/wiki/Coarray_Fortran](https://en.wikipedia.org/wiki/Coarray_Fortran),
 cmacmackin wanted to implement something similar for Nim, 
[https://github.com/mratsim/Arraymancer/issues/419](https://github.com/mratsim/Arraymancer/issues/419)

I'm also interested in expanding Weave for clusters but I don't have access to 
such machines so I can't work on that.

> One seeming disadvantage of the powerful possibility of distributed computing 
> as described above is that task/thread context switching/process starting are 
> slower than for the direct use of pthread (which are part of the means of 
> multi-threading used under-the-covers) and it takes something in the order of 
> 10 to 12 milliseconds per "tasK" at minimum. As what I was trying to 
> accomplish only took about one millisecond for the task work, the simple way 
> of implementing multi-threading meant that it was taking ten times as long to 
> execute as multi-threaded than not. The solution for me was to implement a 
> simple thread pool using the built-in sync variables (or I could have used 
> channels) which worked very well as the implementation of sync variables is 
> very efficient, taking an average of about 2.5 microseconds to process in and 
> out sync'ed queues.

There is no reason to have such a high overhead especially in high performance 
computing. OpenMP is way way way lower.

As n example it takes 0.33 ns to do 4 additions on a vectorized 3GHz CPU, i.e. 
you can do 12 additions per ns, so 12M addition per milliseconds, i.e. with 
your figures, you would need to work on matrices of size 3000x4000 for 
parallelization to start being useful, that seems way too large.

Intel TBB also target tasks in the 10000 cycles range which would translate to 
3.33 µs. 

Reply via email to