On 18/12/2010 9:03 AM, Sebastian Sylvan wrote:

This thread brings to mind a couple of issues that I'd like to just
float for your consideration.

Sure! Appreciate the comments. Even if it's just a forum to air some design discussion; I realize there's always going to be a bit of an imbalance between what's written down vs. discussed/planned, so I'm happy to shed some light. Should do some roadmap-writing at some point too.

1) Many apps build up massive immutable data structures that need to be
accessed by tons of tasks (e.g. video games, my field). Copying these
values would be prohibitively impractical, so any general-purpose
language /must/ support real sharing of immutable data somehow.
Realistically, before Rust sees mainstream use we'll have >64 cores to
keep busy, so anything that imposes limits on read-only data sharing
between actual /cores/ is going to be a big burden too.

Yeah. There's some truth to this. Games and browser rendering trees both :)

Keep in mind, however, that there are multiple forms of parallelism. Task parallelism is essentially MIMD; we're building task parallelism into the language but it's not the only option. Rust also tracks function purity as an effect; it's quite conceivable that we might introduce an openMP-like parallel for loop, if we played some games with deferring refcounts in the loop body, we could do SIMD-like fork/join on (say) pure loop bodies. Semantically plausible and possibly a better fit than forcing everything into the task model.

I have a couple ideas for the task model as well, of course. More below:

2) I'd also argue that sharing of immutable data is the simpler and more
obvious semantics. Especially when you take destructors into account. I
think you should therefore have to opt-in for "copy, don't share"
semantics as an optimization when you have data that ends up getting
loads of incref/decref operations on them.

The dtor story is a bit in flux at the moment, but we'll probably wind up shifting it around so it's only viable on non-shared values. At that point Rafael is right that it's semantically equivalent (though performance-observable) whether a true copy is made. There's some leeway in implementation, however...

3) The cost of reference counting in a multi-core scenario is a concern.
Rust already limits the number of reference operations using aliases,
which presumably gets rid of a lot of the cost. Is Rust wedded to the
idea of having accrate refcounts at all times? As far as I can tell,
modern ref-counting algorithms seem to be about on par with modern GC
algorithms for performance (just barely) even in the context of
multithreading, but they achieve this feat through deferred ref-counting
and things like that. Should Rust not do that instead? I kind of think
refcounts may be the way forward in the future, because the cost of
GC'ing in a ref-count system is proportional to the amount of actual
garbage, rather than being proportional to the size of the heap, but it
seems like the consensus on this issue in the literature is that
refcounts are only performant when they're not being constantly kept up
to date. Am I wrong?

You make three points here and I'll address them independently.

1. multi-threaded RC operations are going to be atomic -> expensive. Yes. We're trying to avoid atomic RC, though as Rafael points out we can conceivably fall back to it in cases where we have performance evidence that the hurt will be less than the hurt of doing a full copy. There's a knob to turn. But I'd prefer not to turn it at all. There are a couple other options (more below, honest!)

2. It's true that on massive heaps, the GC-research consensus I've seen lately is that you want to do primary RC and secondary cycle collection, due to walking the heap less. And that fits with our model, generally, in a performance-characteristic sense.

3. DRC is definitely an option we've discussed. There are multiple schemes for it that have different relationships with whatever other GC it's interacting with. It's not (IMO) essential that we "always have 100% accurate refcounts at all times", merely that we can say with certainty when we will next have them. Similar to our preemption story; we're going to say that 2 tasks on 1 thread will only preempt at particular (well defined, easily controlled) points, not "any single instruction".

4) If the cost of atomic ref counts are still too high, perhaps all
allocations should be task-local unless you explicitly tag it as being
shared (with task-local data not being allowed over channels)? This
seems conceptually simple and impossible to mess up. You tag shared data
in some special way and only for that data you pay the extra cost of
more expensive ref-counting. If you forget to tag something as shared,
then you'll get a compile-time error when you try to send it to another
task. Note that you're still sharing data here, so you support the
scenario in 1), you're just not incurring the cost for everything. I'd
prefer if this wasn't necessary (i.e. if refcount operations could be
statically or dynamically elided often enough that any immutable data
can be sent over a channel), but you could always make the "shared"
keyword a no-op in the future if that materializes.

Finally, the "more below" issue! Yes. Assuming we're just talking independent control paths (so *some* task-parallelism) and assuming we want to avoid heavy atomic RC (both also possibilities, but discussed above) we can *also* twiddle the semantics of sharing a bit to talk about thread-sharing vs. non-thread-sharing.

The scheme you propose would entail two things I'd prefer to avoid (though it'd work): lots of atomic RC on the shared bits, and another layer in the memory-layer system (the "shared") layer.

I'll describe a scheme I'd prefer to explore, you tell me if you think it'd be tolerable:

We make a native data type called pinned[T]. A pinned[T] value holds a T value as well as a list of viewers (in C code; pinned[T] is opaque to rust clients). When you put a T value into a pinned[T], the system walks the data structure (once) and does a detachment operation (make sure each node is singly referenced, copying as required; necessary for the 'freeze' operator to work) then writes the magic constant-refcount (already existing; to handle compile-time constant data) into every rc field in the structure. The structure is now "pinned": it can be safely shared with multiple concurrent readers. It's more than just frozen; uninformed parties will think it's in read-only memory!

From a given pinned[T] value, multiple view[T] values can be manufactured (helper function, also native). A view[T] is atomically added to the pinned[T] 'viewer' list, and when a pinned[T] is destructed it enters an "expiring" state that walks the viewer list, invalidates all inactive views, then waits for the last active view to end, and destructs the T. All view/pinned synchronization is atomic (or effectively so; at best carefully-reasoned lockless C code).

Meanwhile, if I send a view[T] to some other thread, that thread can pull (via an iterator / one-shot reference-returning accessor, as Dave and Patrick have been discussing) an &option[T] out of the view[T]. If the underlying pinned[T] is dead, the view[T] has been invalidated, then the option[T] will come back none. Sorry. No data to view. But if it comes back as &some[T](?t) then the viewing thread can work with the target 't' data "as though it's const". No rc traffic to keep reconciled. It's working as though the data is a compile-time constant in read-only memory.

There are atomic operations in this scheme, but only at the "root" of the data structure, the pinned[T] / view[T] values do atomic ops, everything else works with aliases-and-constants while the view[T] is "in use".

This scheme has the added advantage that you can do "dataflow" multi-version publish/subscribe with it as well: the pinned value can be updated to a new one and the viewers -- if they're pulling from a view[T] via an iterator -- could just "get the next version" next time around the loop, after the writer updates.

Again, as I said up top, there are multiple forms of parallelism, and I'm not sure it'll be necessary to force everything into the MIMD task-parallel model. I want to support the task-parallel variant *well*, because even when running serially/multiplexed, I think it's an essential ingredient in correctness: isolating tasks as a basic mechanism for decoupling their effects, isolating their failures. But it's not the only way; I've sketched in this email some variants we explore to support any/all of:

SIMD - some kind of openMP-like pure-parallel loop
MISD - some kind of pin/view, publish/subscribe dataflow model
MIMD - existing task model

(And of course lonely old serial SISD, which we already do just fine)

-Graydon
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to