[rust-dev] Performance of task switching using the new Rust scheduler

Brian Anderson Sat, 18 May 2013 15:27:35 -0700

The new scheduler is still far from complete, but there is enough inplace that we can start to get a sense for its performancecharacteristics. I've done a simulation of making message-passing callsinto a layout task and have some preliminary numbers.

The new scheduler is a work-stealing scheduler. It generally executestasks greedily, so when it sends a message and sees that the other taskis already waiting to receive, it will context switch directly to theother task and process the message, deferring the running task to awork-stealing deque (where it may be stolen by other threads). Thescheduler does not actually do multithreading yet, but it does have theparallel work queue in place. What I've tested is the single-threadedscenario where the script task can context switch directly to layout.This won't be a perfect test because in the full scheduler you mighthave other threads that can steal the dormant script task and e.g. runthe GC, but I think we're in the ballpark of the overall cost model.

This test makes 50000 calls to a no-op layout function, either through atask or as a direct function call, then calculates the average time. Thecross-task call amounts to two send/recv pairs. The source is here:https://github.com/brson/rust-sched-bench/blob/master/coroutine-call.rs.Requires my 'io' branch of Rust.


# The numbers

function call ns per call:                                    2
new scheduler cross-task call ns per call:   2217
old scheduler cross-task call ns per call:   14737

The [profile] is encouraging in that most of the hottest functions arenot the synchronization and context switching functions that I expect tobe our limiting factors, but this performance is still probably on theorder we might ultimately expect.


[profile]: https://gist.github.com/brson/5605807

Various observations:

* The accounting we do on the global heap, just atomic atomic inc/decpairs, has a huge cost. This is the only thing that the Rustglobal_heap::mallac/free functions do before deferring to libc viafast_ffi and these are the two functions at the profile. We're going tohave to turn this off eventually.

* A large amount of time is spent in malloc/free. There are two mainoffenders here: one is the uv idle callback used in the scheduler, whichis trivially optimizable, and the other is the pipes allocation (theyallocate on send). I suspect that the protocol between script and layoutis bounded so we can use non-allocating message abstraction here (thiscould be an argument for keeping the pipes compiler but we'll have tothink about it more).

* The time hitting thread-local-storage will be minimized once we addsplit stacks and can avoid the pthread_getspecific FFI call by using theTCB directly. There are some other hotspots relating to FFI calls aroundTLS, like `rust_get_tls_key` which doesn't need to exist.

* The `run_cleanup_job` function is one I've been worried about and itshows up high in the profile. This function executes a single commandafter every context switch on behalf of the previous context andcontains a virtual call.


* The `shim` functions will go away with pending FFI changes.

* `try_recv` is not fully optimized and currently always costs 2 contextswitches and an atomic swap with full fence. This can be optimized in away that would remove two context switches from one of the receives here.


* `send` and `try_recv::anon` are both dominated by atomic swaps.

* The work queue here is implemented with a lock, which in this test isnever contested. The final work queue will be a Chase/Lev lock freedeque. The costs will be different.

* The scheduler doesn't yet have any facility for waking sleepingschedulers. This will have some performance impact on the receive path.

Finally, for the curious I've listed here the exact sequence of eventsthat happens in a single call to layout. I've left out the calls tothread-local storage because I don't think they are particularlyilluminating.


# The request

* script task allocates a new 'oneshot' pipe that layout will use tosend the response.

* script task allocates a new buffer as part of the 'stream' send.
* script task places the message payload into the pipe buffer.

* script task does an atomic swap with full fence to take the ~Taskpointer of the blocked layout task from the 'stream'.

* script task context switches to the layout task.

* layout task (on behalf of the scheduler) pushes script task onto thework queue, here taking and dropping an uncontested mutex.* layout task (on behalf of the scheduler) schedules an event loop idlecallback to check up on the work queue later.

* layout task takes the message payload.
* layout task performs layout.

# The response

* layout task places the response message payload into the 'oneshot'buffer send by script.* layout does an atomic swap with full fence to check the status of thescript task and sees that it is not waiting for the response message(it's in the work queue).

* layout goes to wait for the next message from script...
* layout context switches to the scheduler to recv the next request.

* layout (in scheduler context) does an atomic swap with full fence toplace it's ~Task pointer into the pipe buffer and discover that nomessage is available. Layout is now blocked.

* the scheduler drops back into the uv event loop.
* the scheduler wakes up later in response to the idle callback.
* the scheduler takes the work queue lock.
* the scheduler takes the script task out of the work queue.
* the scheduler drops the work queue lock.
* the scheduler context switches to the script task.

* the script task (as part of the unoptimized try_recv) context switches*back* to scheduler context to recv.* the script task (in scheduler context) does an atomic swap with fullfence and sees that the response payload is available* the script task context switches back to its own context and takes theresponse payload.




_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

[rust-dev] Performance of task switching using the new Rust scheduler

Reply via email to