It's been a while since the RFC Picasso multithreading runtime ([https://github.com/nim-lang/RFCs/issues/160](https://github.com/nim-lang/RFCs/issues/160) / [https://forum.nim-lang.org/t/5083)](https://forum.nim-lang.org/t/5083\)).
The project is living at [https://github.com/mratsim/weave](https://github.com/mratsim/weave) It's well tested on Linux with 32-bit and 64bit CI, and also on ARM64 with Travis offering a whopping 32 cores undisclosed ARM CPU. Windows is not supported, I'm only lacking a low-level wrapper for [Synchronization Barriers](https://docs.microsoft.com/en-us/windows/win32/sync/synchronization-barriers) OSX should work but somehow it trips some assertions on Travis so your mileage may vary. It offers both task parallelism and data parallelism. The task parallelism API is similar to async/await on Futures, except that you call spawn/sync on Flowvar. The data parallelism API is similar to OpenMP. One important thing, It doesn't support GC-ed types, you need to pass a pointer (example with seq in the README) or use Nim channels. There are a couple of low-level routines that may be of interest: * stackless/queueless depth-first and breadth iterator for binary trees stored in arrays: [https://github.com/mratsim/weave/blob/v0.1.0/weave/datatypes/binary_worker_trees.nim#L91-L161](https://github.com/mratsim/weave/blob/v0.1.0/weave/datatypes/binary_worker_trees.nim#L91-L161). Those are fast and don't require any heap allocation. * Threadsafe memory pool and look-aside buffer. Those can handle efficiently the spawning of trillions of tasks under a couple milliseconds. They can also release memory back to the OS. They might be of interest to game developers that need to handle billions of particles within a single frame. There are 3 markdown files that details the memory challenge with the README providing an overview of the solution: [https://github.com/mratsim/weave/tree/v0.1.0/weave/memory](https://github.com/mratsim/weave/tree/v0.1.0/weave/memory). The actual implementation is short under 500 lines without comments. It is based on state-of-the-art research for Microsoft's Mimalloc (the fastest general purpose allocator) and Snmalloc (a very fast message-passing based allocator). There are 10 benchmarks available to stress several aspects of the runtime, 8 being as fast or much faster than established runtime like Intel TBB or GCC/Clang/Intel OpenMP (and the 2 slow ones being parallel reductions): Name| Parallelism| Notable for stressing ---|---|--- Black & Scholes (Finance)| Data Parallelism| DFS (Depth-First Search)| Task Parallelism| Scheduler Overhead Fibonacci| Task Parallelism| Scheduler Overhead Heat diffusion (Physics)| Task Parallelism| Matrix Multiplication (Cache-Oblivious)| Task Parallelism| Matrix Transposition| Nested Data Parallelism| Nested loop Nqueens| Task Parallelism| Conditional parallelism SPC (Single Task Producer)| Task Parallelism| Load Balancing
