The new scheduler is still far from complete, but there is enough in place that we can start to get a sense for its performance characteristics. I've done a simulation of making message-passing calls into a layout task and have some preliminary numbers.

The new scheduler is a work-stealing scheduler. It generally executes tasks greedily, so when it sends a message and sees that the other task is already waiting to receive, it will context switch directly to the other task and process the message, deferring the running task to a work-stealing deque (where it may be stolen by other threads). The scheduler does not actually do multithreading yet, but it does have the parallel work queue in place. What I've tested is the single-threaded scenario where the script task can context switch directly to layout. This won't be a perfect test because in the full scheduler you might have other threads that can steal the dormant script task and e.g. run the GC, but I think we're in the ballpark of the overall cost model.

This test makes 50000 calls to a no-op layout function, either through a task or as a direct function call, then calculates the average time. The cross-task call amounts to two send/recv pairs. The source is here: https://github.com/brson/rust-sched-bench/blob/master/coroutine-call.rs. Requires my 'io' branch of Rust.

# The numbers

function call ns per call:                                    2
new scheduler cross-task call ns per call:   2217
old scheduler cross-task call ns per call:   14737

The [profile] is encouraging in that most of the hottest functions are not the synchronization and context switching functions that I expect to be our limiting factors, but this performance is still probably on the order we might ultimately expect.

[profile]: https://gist.github.com/brson/5605807

Various observations:

* The accounting we do on the global heap, just atomic atomic inc/dec pairs, has a huge cost. This is the only thing that the Rust global_heap::mallac/free functions do before deferring to libc via fast_ffi and these are the two functions at the profile. We're going to have to turn this off eventually.

* A large amount of time is spent in malloc/free. There are two main offenders here: one is the uv idle callback used in the scheduler, which is trivially optimizable, and the other is the pipes allocation (they allocate on send). I suspect that the protocol between script and layout is bounded so we can use non-allocating message abstraction here (this could be an argument for keeping the pipes compiler but we'll have to think about it more).

* The time hitting thread-local-storage will be minimized once we add split stacks and can avoid the pthread_getspecific FFI call by using the TCB directly. There are some other hotspots relating to FFI calls around TLS, like `rust_get_tls_key` which doesn't need to exist.

* The `run_cleanup_job` function is one I've been worried about and it shows up high in the profile. This function executes a single command after every context switch on behalf of the previous context and contains a virtual call.

* The `shim` functions will go away with pending FFI changes.

* `try_recv` is not fully optimized and currently always costs 2 context switches and an atomic swap with full fence. This can be optimized in a way that would remove two context switches from one of the receives here.

* `send` and `try_recv::anon` are both dominated by atomic swaps.

* The work queue here is implemented with a lock, which in this test is never contested. The final work queue will be a Chase/Lev lock free deque. The costs will be different.

* The scheduler doesn't yet have any facility for waking sleeping schedulers. This will have some performance impact on the receive path.

Finally, for the curious I've listed here the exact sequence of events that happens in a single call to layout. I've left out the calls to thread-local storage because I don't think they are particularly illuminating.

# The request

* script task allocates a new 'oneshot' pipe that layout will use to send the response.
* script task allocates a new buffer as part of the 'stream' send.
* script task places the message payload into the pipe buffer.
* script task does an atomic swap with full fence to take the ~Task pointer of the blocked layout task from the 'stream'.
* script task context switches to the layout task.
* layout task (on behalf of the scheduler) pushes script task onto the work queue, here taking and dropping an uncontested mutex. * layout task (on behalf of the scheduler) schedules an event loop idle callback to check up on the work queue later.
* layout task takes the message payload.
* layout task performs layout.

# The response

* layout task places the response message payload into the 'oneshot' buffer send by script. * layout does an atomic swap with full fence to check the status of the script task and sees that it is not waiting for the response message (it's in the work queue).
* layout goes to wait for the next message from script...
* layout context switches to the scheduler to recv the next request.
* layout (in scheduler context) does an atomic swap with full fence to place it's ~Task pointer into the pipe buffer and discover that no message is available. Layout is now blocked.
* the scheduler drops back into the uv event loop.
* the scheduler wakes up later in response to the idle callback.
* the scheduler takes the work queue lock.
* the scheduler takes the script task out of the work queue.
* the scheduler drops the work queue lock.
* the scheduler context switches to the script task.
* the script task (as part of the unoptimized try_recv) context switches *back* to scheduler context to recv. * the script task (in scheduler context) does an atomic swap with full fence and sees that the response payload is available * the script task context switches back to its own context and takes the response payload.



_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to