It's been a while since the RFC Picasso multithreading runtime 
([https://github.com/nim-lang/RFCs/issues/160](https://github.com/nim-lang/RFCs/issues/160)
 / [https://forum.nim-lang.org/t/5083)](https://forum.nim-lang.org/t/5083\)).

The project is living at 
[https://github.com/mratsim/weave](https://github.com/mratsim/weave)

It's well tested on Linux with 32-bit and 64bit CI, and also on ARM64 with 
Travis offering a whopping 32 cores undisclosed ARM CPU. Windows is not 
supported, I'm only lacking a low-level wrapper for [Synchronization 
Barriers](https://docs.microsoft.com/en-us/windows/win32/sync/synchronization-barriers)
 OSX should work but somehow it trips some assertions on Travis so your mileage 
may vary.

It offers both task parallelism and data parallelism. The task parallelism API 
is similar to async/await on Futures, except that you call spawn/sync on 
Flowvar. The data parallelism API is similar to OpenMP.

One important thing, It doesn't support GC-ed types, you need to pass a pointer 
(example with seq in the README) or use Nim channels.

There are a couple of low-level routines that may be of interest:

  * stackless/queueless depth-first and breadth iterator for binary trees 
stored in arrays: 
[https://github.com/mratsim/weave/blob/v0.1.0/weave/datatypes/binary_worker_trees.nim#L91-L161](https://github.com/mratsim/weave/blob/v0.1.0/weave/datatypes/binary_worker_trees.nim#L91-L161).
 Those are fast and don't require any heap allocation.
  * Threadsafe memory pool and look-aside buffer. Those can handle efficiently 
the spawning of trillions of tasks under a couple milliseconds. They can also 
release memory back to the OS. They might be of interest to game developers 
that need to handle billions of particles within a single frame. There are 3 
markdown files that details the memory challenge with the README providing an 
overview of the solution: 
[https://github.com/mratsim/weave/tree/v0.1.0/weave/memory](https://github.com/mratsim/weave/tree/v0.1.0/weave/memory).
 The actual implementation is short under 500 lines without comments. It is 
based on state-of-the-art research for Microsoft's Mimalloc (the fastest 
general purpose allocator) and Snmalloc (a very fast message-passing based 
allocator).



There are 10 benchmarks available to stress several aspects of the runtime, 8 
being as fast or much faster than established runtime like Intel TBB or 
GCC/Clang/Intel OpenMP (and the 2 slow ones being parallel reductions):

Name| Parallelism| Notable for stressing  
---|---|---  
Black & Scholes (Finance)| Data Parallelism|   
DFS (Depth-First Search)| Task Parallelism| Scheduler Overhead  
Fibonacci| Task Parallelism| Scheduler Overhead  
Heat diffusion (Physics)| Task Parallelism|   
Matrix Multiplication (Cache-Oblivious)| Task Parallelism|   
Matrix Transposition| Nested Data Parallelism| Nested loop  
Nqueens| Task Parallelism| Conditional parallelism  
SPC (Single Task Producer)| Task Parallelism| Load Balancing

Reply via email to