Introducing an async library inspired by Go in Nim

mratsim Mon, 10 Jun 2024 01:30:27 -0700

@Ahogani

> Well you can have both in the same program but I have never seen a 
> cooperative scheduling system to produce parallelism.


What I mean is that you can use stackful coroutines/fibers AND a multithreading 
runtime to have parallelism.

Now there seem to be a misalignment at what "cooperative" mean.

I use cooperative as in you yield explicitly. Given that only the OS or with 
the OS cooperation (with signals) can preempt execution, most scheduler are 
cooperative.

> Of course, stackless coroutines is more efficient on both memory usage and 
> cpu cost, but only stackful coroutines allows to suspend and resume execution 
> in arbitrary depthness if I am not wrong.

You're not wrong but this is not a problem, I mention it there: 
<https://github.com/mratsim/weave-io-research/blob/master/design/design_1_coroutines.md>

> Coroutines come in flavors: > asymmetric vs symmetric. > An asymmetric 
> coroutine can only hand control back to its current owner (caller or 
> resumer). They have asymmtric power. A symmetric can switch to any other 
> coroutine and not only its caller/resumer. Asymmetric and symmetric can be 
> implemented in terms of each other (with overhead). The difference is in the 
> ergonomics we want by default.

Unless you're doing manual scheduling lua-style, the coroutines are usually 
managed by a scheduler. And that scheduler can simulate jumping to anywhere in 
the stack.

And in practice, jumping anywhere in the stack breaks debugging and exceptions.

@elcritch

> One thing that always seemed better about stackful coroutines vs async 
> futures would be much fewer allocations. Every async call that has any state 
> incurs allocation overhead.

That's not a fundamental problem but more of an engineering problem. Rust tried 
hard to have zero-cost futures with no allocation. But their futures are 
incompatible with the completion-model of IO from io-uring and Windows IOCP (IO 
Completion port) because:

  1. Rust futures are aligned with a readiness-base model where the OS signals 
data is ready, Rust fetches that data, copying from kernel buffer to user-space 
buffer.
  2. To enable OS zero-copy, the completion model requires a buffer from the 
user, which introduces life-time issues, that must be solved by heap allocation.
  3. The OS copies data directly into that buffer and signals completion.



Hence, the best futures can do are single-allocation in the general case. 
Constantine's threadpool and the future Weave-Io have this: 
<https://github.com/mratsim/weave-io/blob/8672f1c/weave_io/crossthread/tasks_flowvars.nim#L14-L52>
 Everything is intrusive:

  * The task has an intrusive linked list field that can be used in MPSC task 
queues
  * The ownership state is intrusive to synchronize handover between producers 
and consumers and deallocating the task. Some tasks have no result and so no 
consumers so producers must dealloc them, and others have a consumer)
  * The result of the future is intrusive



And in the optimal case, if we have guarantee that a task cannot escape, i.e. a 
task and its descendants are finished before we exit a scope (also called 
structured parallelism), we can do zero-alloc with the compiler help by 
allocating the task on the stack. An optimization often called "heap-allocation 
elision" see coroutine heap allocation elision: 
<https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html>.

Note: this is not possible with fibers/stackful coroutines due to them mmap-ing 
their own stack, making them unsuitable for implementing iterators / 
higher-order functions like <https://godbolt.org/g/26viuZ>

@Alogani

> I have been working hard to address some issues in my NimGo library. I have 
> implemented custom stacktrace management in debug mode to allow for better 
> debugging: In case of a raise, it is possible to see in a nested way where 
> the errors occurs and who were the caller :
> 
> The author of paper you gave is the creator of the implementation of 
> stackless coroutines in c++, so it is logical he thinks his model is better. 
> However, the cost of function coloring is IMHO just too high for non 
> performance-sensitive usage. I have also found this paper of one of his pear 
> that disagrees with him, you might find it interesting : 
> <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0866r0.pdf>

I'm aware of the paper. My main issues with fibers in Nim is the handling of 
exceptions, I'm not sure how it works with `try / except` and `try / finally`.

Do note that all languages that tried fibers: Java 0.1, Go and Rust abandoned 
them also because of segmented stacks.

  * Rust: 
<https://web.archive.org/web/20131107015411/https://mail.mozilla.org/pipermail/rust-dev/2013-November/006314.html>
  * Go: 
<https://docs.google.com/document/d/1wAaf1rYoM4S4gtnPh0zOlGzWtrZFQ5suE8qr2sD8uWQ/pub>



In any case, it's good that you made headway regarding exceptions, iirc 
<https://nim-lang.org/docs/coro.html> was not compatible with exceptions.

* * *

You might also want to read through my research, design and implementation 
ideas that I stored into <https://github.com/mratsim/weave-io-research>

While go does not have user-visible coloring, it certainly has 
calling-convention-coloring, meaning you cannot call Go functions from C. The 
grandaddy of multithreading, Cilk, also had this limitation and even generated 
2 versions of each functions, one with custom stack to implement 
continuation-stealing and one with normal ABI that can be called by external.

Dealing with that colling-convention coloring is extremely annoying, see 
<https://words.filippo.io/rustgo/>

Introducing an async library inspired by Go in Nim

Reply via email to