Replying mostly to your first reply @GordonBGood > threadpool/spawn/FlowVar is a beautiful modern concept for multi-threading
Agreed > So in a few hours I took my wrapper implementations and re-wrote my required > parts of threadpool to not consider GC'ed data structures in about 150 lines > (including the wrapper implementations), and that "just worked" without > compiler magic or generic code limitations. In fact, this implementation is > almost twice as fast as the original even when it worked and that is in spite > of using atomically reference counted wrappers as a general memory management > destructor control system; the newruntime's ownership on ref's model will be > even faster and I think will work in this applications's use case as > ownership doesn't need to be shared beyond the creating entities. I can post > this code somewhere if anyone is interested. That's also the approach I took in my proof-of-concept. > As to "cycle stealing"/"load balancing", the threadpool/spawn model actually > takes care of that as the underlying platform implementation of threads will > have a scheduler that automatically "time-slices" across all the currently > running threads from the threadpool; it is only necessary to provide > tasks/work to these threads to be in big enough time units so as to be > significantly larger than the overheads of doing the multi-threading. Work-stealing / load balancing is needed. Most parallel tree algorithms lead to unbalanced thread loads. The industry standard benchmark to reproduce those issues is the Unbalanced Tree Search benchmark: [https://sourceforge.net/p/uts-benchmark/wiki/Home](https://sourceforge.net/p/uts-benchmark/wiki/Home)/ > I won't be commenting on your Picasso RFC as, in my mind for everyday users > and me, it is too complex, so I'll let those that perhaps have more > sophisticated needs contribute to it and possibly implement it if it seems to > fill their needs. I think you misunderstood the RFC audience/section or I was not clear enough. The API for library users is spawn/^, you can see 2 examples in my proof-of-concept: * [Fibonacci](https://github.com/mratsim/weave/blob/master/e04_channel_based_work_stealing/tests/fib.nim#L17) (async will be renamed spawn and await will be renamed ^) * [Single producer multi-consumer task loop](https://github.com/mratsim/weave/blob/master/e04_channel_based_work_stealing/tests/spc.nim#L80) The rest of the RFC is about giving a full picture to the Nim community about all the blocks and considerations that should go into writing a multithreading runtime, i.e. those are implementation "details" (as much as I hate the expression). Most of the sophistication is because it is needed to allow naive usage from people who follow tutorials on multithreading with pi/fibonacci example to unbalanced workload to high-performance computing. The simplest implementation with no scheduler moves the complexity from the library writer to the library user who most likely is not an expert of multithreading efficiently. \--- Also the current threadpool/spawn/^ does not work on the "Hello World!" of multithreading which for better or worse is fibonacci (which is even sillier than the pi algorithm), see [https://github.com/nim-lang/Nim/issues/11922](https://github.com/nim-lang/Nim/issues/11922). Note that even GCC's version of OpenMP chokes on that benchmark because it uses a naive global task queue (like Nim's threadpool) that is a contention point, while LLVM's OpenMP uses work-stealing.