I've been thinking and sketching a lot about vectors & arrays lately.
There are a _lot_ of design pressures on this part of the language, but
our current approach is clearly unsatisfactory:
- Existing vecs are always unique. Sometimes you want shared, but
boxing them as @[] causes double-indirection, feels awkward.
- Existing vecs are always in the heap. So:
- There's no way to alloca one (in stack)
- There's no way to allocate one in constant memory
- There's no way to allocate one inside another structure
- The latter is particularly egregious when it comes to C interop.
C has the idea of a dense block of fixed-size, int[10] is
10-ints-long. We have no way to model this in rust right now.
I think we need to deal with this stew of disappointments in our vectors
soon, probably in the 0.2 - 0.3 timeframe. I think with the appearance
of region pointers we have a reasonable leg to stand on in terms of
formulating lifecycles. Here's my proposal:
- [int] means (&int,len), it's a slice
- str means ((&u8,len) : is_utf8(*)), a textual slice
- the region system is used to relate the visibility of [t] and str
to their region-of-allocation, the same way it relates normal &t
values. The additional 'length' integer on a slice is used only
by unsafe code such as that implementing operator[] and iteration.
- [int/10] and str/10 mean "fixed-size, 10-elements long"
and are _interior values_. The number is part of the type.
The type's size is N * sizeof(T). We define const-expressions and
integer-const-expressions as part of this work, so that number
can in some cases be symbolic.
- [int/~] is a heap-resident unique vector, like [int] today
- [int/@] is a heap-resident shared vector, like old [int]
- str/~ and str/@ are the same for strings
- The compiler open-codes all of this
- [int] <: [int/10]
[int] <: [int/~]
[int] <: [int/@]
str <: str/10
str <: str/~
str <: str/@
- In other words, you can cast anything down to the "slice" form,
the same way you can cast any of the closures to fn() type
- Literals are subtle; assume p,q,r are dynamic vals:
- [1,2,3,4,5] -- constant memory, type [int]
- [1,2,3,4,5]/5 -- constant memory, type [int/5]
- [1,2,3,4,5]/_ -- same, type [int/5], type-size inferred
- "hello" -- constant memory, type str
- "hello"/5 -- constant memory, type str/5
- "hello"/_ -- same, type str/5, type-size inferred
- [p,q,r] -- alloca'ed memory, type [int]
- [p,q,r]/3 -- alloca'ed memory, type [int/3]
- [p,q,r]/_ -- same, type [int/3], type-size inferred
- [p,q,r]/@ -- heap-resident shared, type [int/@]
- [p,q,r]/~ -- heap-resident unique, type [int/~]
- The compiler has to know about all these types. It's a bit of a
bestiary but it means all the use-cases are satisfied and I think
it composes well and is congruent with the rest of the language.
- Things like foo[i], +, += and such are pushed down to library code,
use operator overloading. The compiler only does lifecycle:
literals, allocation, casting, passing and destruction.
Positives of this approach:
- The common representation is cheapest, is a subtype of all, and is
the default type of the easiest-to-write literal forms. Most people
will write 'let x = "hello";' and get x:str inferred. And it'll be
cheap to pass, and constant-memory. The next-most-common (vectors of
dynamic values) are alloca'ed, but also passed as slices, cheap.
- It's possible to denote the aggregate, not just a pointer-to-it,
iff you fix its size. If you can't commit to a size, you can't
denote the aggregate, can only get a pointer-to-it.
- In particular, you can now translate C array types directly to a
rust type. { x: int, y: [int/15], z: int } is a dense structure.
- Pointers-to-fixed-sized-types are still possible: &[int/1024] is
a pointer. As is *[int/1024]. Orthogonality is maintained.
- Prefix box operators ~ and @ still mean what they always mean; there
is no weirdness that arises by trying to make ~[] special and then
confront the inconsistency of ~T != ~[] when T == []. Like with
fn closures, we make the "how you manage your associated storage" a
syntactically distinct component of the tycon.
- Heap-resident versions exist; like in any data structure, you have
to discuss uniquieness or shared-ness to some extent. If you want a
record with 3 strings you can send over a channel, it has to be of
type {x:str/~, y:str/~, z:str/~}.
So, thoughts, likes, dislikes?
Are the semantics over-complex? I think we really have a large number of
use-cases and there's no way we can hit them all with a single
abstraction. Shared-heap != unique-heap != constant != alloca !=
borrowed region, and fixed-size != variable-size.
Is the trailing slash absolutely hideous? There are a few other unused
ASCII symbols in the type grammar but this was the nicest-looking I
could see (that didn't collide with something else). It's also possible
to write them as str/10, str@ and str~, say, and similarly [int/10],
[int]~ and [int]@. That's fewer slashes but a bit more visual ambiguity
if you have a leading ~ or @ as well.
-Graydon
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev