On 11-05-25 01:53 PM, Graydon Hoare wrote:
Hi,

We had a brief discussion on IRC yesterday that ended in me storming off
in a very unprofessional manner. I'd like to publicly apologize for that
behavior, it was not cool and had little to do with the conversation at
hand. My stress level was very high coming into work yesterday, and I
was letting personal life spill into work life. My fault, sorry.

I'd also like to restart (or at least restate) parts of that discussion
here so we can actually get this worked out to everyone's satisfaction
(including Rafael, who clearly has strong feelings on the matter). This
will be a long rambly email full of back-story to set the stage; if you
have specific points to follow-up on, just snip those parts out for your
replies.


Preface
~~~~~~~

We know (or at least, anyone who's poked at it knows) that the tasking
and threading model in rustboot was pretty unsatisfactory. It passed
some interesting tests ("millions of tasks, real cheap!") and had a
number of needs it was trying to meet -- it was not designed in
*complete* ignorance -- but it also imposed strange burdens that we'd
like to dispense with this time around. Hopefully without losing the
good stuff.

So, some background "goals" in order to make this story make sense. The
three primary design pressures are:

(1) Support *isolation*, in some useful sense, between tasks. That is, a
task should be able to reason locally about its data and code without
worrying whether other tasks are mucking with their own data and code.
With the exception of message-IO points and unsafe blocks, which
obviously involve the potential for non-isolated action.

(2) Support *lots* of tasks. Millions. Such that a programmer has no
fear about making "too many" tasks in a system, if it decomposes nicely.
If for no other reason than support for isolation-based local reasoning
(though concurrent, or interleaved, or truly parallel execution is also
nice to exploit whenever it's surfaced this way).

(3) Run at relatively high, but more importantly *predictable*
performance. As few magical parts as possible in the concurrency model.
Take the M:N performance-penalty if necessary to achieve the other 2
goals, so long as there are no random peformance discontinuities in the
model.

Concurrency model is intimately connected to memory model, unwinding,
gc, and several other things; so when I say we're going to be revisiting
design decisions in rustboot's concurrency model, I implicitly include
parts of the memory model too, and other parts.


The Past (rustboot)
~~~~~~~~~~~~~~~~~~~

The "lots of tasks" pressure breaks down into two sub-issues: making
tasks small (in the sense of memory) and making them independently
scheduled. We approached the "small" issue via growable stacks (doubling
vectors with pointer-rewriting) and a very large dose of ugly magic for
doing calls "between stacks" (from rust to C). This had lots of
unfortunate fallout: debuggers and tools got upset, calling back into
rust code from C was mostly impossible, and to support it safely we'd
need to be flushing pointers to the stack and re-reading them
*constantly*, much more than just "make sure values are pinned
somewhere" necessary for GC. We approached the "scheduling" issue by
even *more* magic return-address patching during suspended C calls, and
a custom cooperative scheduler.

The "isolation" pressure was approached by stratifying the heap memory
model into private and shared (between-tasks) memory, with the shared
stuff always immutable and acyclic. Sharing was not always possible --
not between threads and not between processes -- but between
tasks-in-a-thread it could work, and we figured that scenario was
valuable to users. Cheap sending of shared bits between tasks in a
thread. Then we'd do a deep copy when we hit domain boundaries.

But to support that scenario, tasks had to be pinned to threads. That
is, the concurrency scheme in rustboot involved tasks running within
domains (threads or processes; though the latter never materialized),
where the user explicitly constructed domains and injected threads into
the domain. Once spawned in a domain, a task could not leave it (be
migrated to another thread), because it might have pointers into the
"shared memory" of the domain. This pinning to domains has the
unfortunate performance characteristic of *requiring* a user to pay
attention to task:thread assignments; this could be a benefit in some
cases -- explicit control can be good -- but in many cases it seemed
bad, or at least over-complex. It's not just a matter of "having an M:N
scheduler in userspace" (which will, says the literature, always
underperform a 1:1 scheduler with the kernel involved) but also pinning
each individual task in M to a thread in N, such that one task blocking
(or even just monopolizing a core) could block or slow down a whole
group of tasks on the same thread. This is a usability hazard.


The Future (rustc)
~~~~~~~~~~~~~~~~~~

Rustboot is dead (yay!) and we're in the process of working through the
leftover cruft in the runtime and removing parts which were bad ideas
and/or only around to support rustboot's various limitations and design
choices. Rustc doesn't really *have* a tasking system yet -- there are
pieces of the communication layer, but eholk is just getting spawn
working this week -- so we're mostly doing "rewrite this subsystem" work
now. There's been a fair amount of disjointed conversation over this
matter, I'm hoping to consolidate what's on the agenda here.

The "lots of tasks" issue still has two sub-parts: size and scheduling.

We're going to approach "size" via "the other, more standard technique",
which is to use a linked list of stack segments and never move them. We
will most likely reuse the *exact* technique Go is using here, in the
sense of trying to be ABI compatible (at least insofar as this is
possible or makes sense) and possibly even use the same runtime support
library. This approach is easier for LLVM to cope with (there's a GSoC
student implementing it currently), and more tools understand it. It
also makes stack segments recyclable between tasks, which should reduce
overall memory pressure (equivalent to "shrinking" in our other model).
We're also going to use the same "unified" approach to growth and
cross-language calling as Go uses -- just grow into a "sufficiently big"
segment that may be recycled between tasks in between C calls -- and
that may well permit C to call back into rust (assuming it can provide a
task* and can be made to play nice with unwinding and GC; see below).

We're also going to approach "scheduling" via "the other, more standard
technique", which is to use the posix (and before that, system V)
<ucontext.h> schedulable user contexts and (sadly) again our own
scheduler. Where ucontext isn't OS-provided, we'll provide our own
implementation; it's not actually much more than a "save registers to
structure A and load them from structure B" routine anyways, just with a
standard API. And on some OSs -- specifically those where we discover
threads are sufficiently cheap, if running on small stacks -- we're
going to lean much more heavily on the OS thread scheduler. See below.

We're going to approach the "isolation" pressure differently. Rather
than permit tasks to share pointers at all, we'll be shifting to a
stratification of memory based on unique pointers. This will mean that
the only possible kinds of send are "move" and "deep copy". Move will
happen everywhere in-OS-process, deep-copy between processes. Move
semantics -- making a copy while indivisibly de-initializing the source
of the copy -- are going to be the focus of a fair bit of work over the
next while, and we're betting on them for the messaging/isolation system.

Along with minimizing refcounting (and a host of other thorny semantic
issues associated with accidental copying, such as environment capture
and double-execution of destructors) this will permit tasks to migrate
between threads. Or, put more simply, it will permit us to treat threads
as an undifferentiated pool of N workers, and tasks as a pool of M work
units; when a thread blocks in C code it will have no affect on the
other tasks (other than temporarily using up a "large segment" of stack)
and M>N runnable tasks should always "saturate" the threads (thus cores)
with work. Moreover, when we're on an OS that has "really cheap threads"
we can spin up N == M threads and fall into the case of 1:1 scheduling:
back off and let the OS kernel do all the scheduling, have our scheduler
always say "oh, just keep running the same task you're running" every
time it checks at a yield point. On linux, for example, I believe that
this may very well be more optimal than getting in the way with our own
scheduler and ucontext logic. I'm not sure about other OSs but I'd like
to retain this "dial" to be able to dial scheduling back into our
runtime if we're on an OS with horrendously bad or otherwise expensive
kernel threads.

In the process of this change, we'll eliminate the concept of a domain
from the language and runtime, and just model OS processes as OS
processes. We'll still need some runtime support for interacting with
subprocesses, of course, just avoid trying to mix metaphors.

Moving to a pool-of-threads model should also permit leaning on "the
other, more standard technique" for unwinding: the C++ unwinder (or a
large part of it). Since a task blocked-in-C doesn't necessarily block
any other tasks (they can run on other threads) we don't need to
deschedule the unwinder, which was a large part of my concern for how it
might be unusable in this role. We can just let unwinding run to
completion (scheduler yield-points can always opt to not-yield when
unwinding). A secondary concern has to do with double-faulting (failing
within a destructor) but we'll cross that bridge when we come to it; I
doubt the policy to call terminate() in C++ is so hard-wired into the
unwinder that there are no possible ways of overriding it. Opinions on
this welcome.

(Incidentally, the more we drift away from "our own per-frame metadata
tables" towards relying on stock components, the more I think using a
conservative stack scanner for GC root-finding may be perfectly fine,
obviating the need for per-frame GC info and explicit GC-safe points.
But that's not terribly related to choice of concurrency strategy; just
an interesting note for anyone following along the Saga Of Rust Frame Info)

Our argument yesterday hit a breaking point when we were discussing the
relationship between C++ unwind semantics and pthread cancellation. I
think this was actually a red herring: 'fail' could only really map to
'pthread_cancel' in the case that we were hard-wired to a 1:1 scheduling
model, always using a kernel thread for a task, which as I've said I
would like to retain as only as an option (when on an OS with cheap /
fast kernel threads) rather than a semantic requirement. And even if we
*did* swallow that requirement, I was arguing yesterday (and I believe
further googling today supports) the contention that pthread_cancel is
just not a good fit for 'fail' (or kill). It will:

(a) Only cancel at cancellation points, not "any instruction"; so it's
probing the presence of a flag in the pthread library anyways. Not
terribly different, cost-wise, from using our own flag.

(b) Not run the C++ unwinder in a reliable, portable fashion;
on some platforms this works but on some it does not, and boost
has opted to not-present the interface for precisely this reason.

Overall I don't think pthread_cancel was a productive avenue for the
discussion and I'm sorry we wound up in it. I think it's more useful to
talk about ways to make cooperative scheduling (== kill) points cheap
enough to live with -- including options for unsafe loops that omit them
-- even in the degenerate case where they always say "keep running the
task you're running" (1:1 mapping to threads). At least until killed.

Phew! Ok. Done. Comments? Preferences? Field datapoints to contribute?

My main concern is that we have to keep costs and priorities in mind.

Yesterday you described C/C++ as crazy/insane. It so happens that that insanity is used to implement all the kernels that we are targeting. It is used by every program I use (sometimes with an extra interpreter on top) and is used by the compiler framework we selected. If all those are crazy, I would say that that is not sufficient market for sanity.

Programmers like to complain about the hard to reason about semantics of C++, but they do get the job done. Look at the cost we are proposing to get something a bit better: every loop will have a check to know if it should yield or not. Is this really worth it? What is basically doing is trying to give task a similar semantics to what threads already have.

About datapoits. No, I don't have them. It is too early for that. We don't need tasks right now. C, C++ and Java have come a *long* way without them and they have a significant market share.

With the changes on how channel work, tasks are not a fundamentally new concept. You even acknowledge that they could be implemented at 1:1. Given that we must also provide access to raw threads, lets start with that. My proposal is

* Implement OS threads
* Have a notion of a task, which is a language abstraction that provides a subset of the features that are common to all operating systems. You could not, for example, send a signal to a task.
* Implement them now with a 1:1 mapping.

Once the language is more mature, we can consider what is next. It might be another "insanity" that is known to work, like thread pools and continuation style code or it might be user space tasks.

The important point is that once we have web servers, GUI toolkits, image decoders, audio APIs, etc available in rust, we will have data points. It might be that, like what happened to java, using 1:1 is the best. If not, we can then implement M:N as an optimization.

-Graydon

Cheers,
Rafael
_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to