On 11-04-15 1:36 AM, Graydon Hoare wrote:
I am replying to you since you replied privately to me; if you meant to
reply to list, say so and I'll send this back there. Not sure if you
wanted this to be a private diversion in the conversation..

No, sorry about that. A blame Thunderbird (or the list setup?) :-)

Yes, we have this is a problem for both implementations. On linux at
least we can also use a global thread local variable to hold the C stack
for the OS thread. I have no idea what we should do for OS X in here.

No TLS on OSX? I gather they have a rather different threading system
than linux (and windows); I don't know the exact details.

I don't think any released version has it. There was some activity in LLVM, so I think they are implementing it.

This is important research to do before jumping to conclusions:
different OSs provide quite different facilities, some cheap, some
expensive, and with very different scalability curves. Win32 looks like
it eats 12kb of kernel stack for each thread -- not free, though not
huge -- but I have no idea what the scheduler does when a few thousand
of them issue sync IO requests at once. OSX will cap an 8gb machine to
12,500 threads (weirdly: http://support.apple.com/kb/HT3854), not sure
if those are posix or mach threads. There are things to study.

We need to quantify costs and qualify portability if we're going to have
such horse-trading arguments about implementation effort and focus.

I suspect we won't get stacks smaller than a page. Tasks can be smaller,
but then if we're recycling stack segments between tasks-or-threads,
pages might make the most sense anyways.

Yes, my gut feeling is that managing space smaller than a page would not be very efficient.

I agree in here. It might just be that the decision of having no user
visible global mutable state makes tasks that can share refcounted
objects too limited to be useful. We might have to choose one.

Yeah. And be careful to differentiate uniquely owned from merely
immutable. Immutable stuff still needs to be collected somehow if it's
DAG-shaped or worse.

I understand the abstraction. In fact, I like it. The problem is that it
is not what we are implementing. You can implement it by giving the
language enough information to move tasks (expensive for all
implementation options I can think of) or by exposing the problem to the
user (undesirable in general, still needed in unsafe blocks).

I disagree. The abstraction is "a coroutine that's properly multiplexed
so long as you don't hit the blocking OS interfaces directly", and it's
properly implemented by soft coroutines with shared memory. Better yet,
you can share immutable DAG-shaped stuff between them, within a thread,
using non-atomic refcounting. And if you want to hit the blocking OS
interfaces directly, use a thread. That is: change

"spawn foo()"

to

"spawn thread foo()"

It's not so bad, eh? If a user sees too much jank or stalled tasks or
something, they can analyze the problem, decide, and possibly insert the
word 'thread'.

OK. We are talking about different abstractions. Mine would be "behaves like a regular thread, but is cheaper to create". Having two "run concurrently" things with different semantics for what can happen in parallel does add to the "cognitive burder" as you like to put it.

Can we just give them better tools? Consider what rust would look like
right now without tasks.

* Creating a thread is more expensive than creating a task.
* Scheduling would work as expected
* Passing data in a safe way would imply
* Immutable data *or*
* Copy *or*
* Move semantics if we can figure out how to do it.
* Use of an unsafe block is a natural extension.

That's exactly what "spawn thread foo()" gives you. I'm not suggesting
we take it away. Merely that we maintain a 3rd option that's lighter
still, that gives you:

* Free sharing of immutable substructures between tasks.
* Potentially (though I'll grant, not necessarily) cheaper spawn
and kill events, other per-task overheads, switching speed
(probably depends on platform, good to quantify).
* Easier concurrent debugging, if you like, since the scheduler is
down in userspace. Longer timeslices, manual yielding, etc.

Keep in mind, 3 is not the limit of options I'm hoping to support. A
GCD-like thing that uses frozen-unique values it owns and saturates the
CPU using a queue of worker callbacks is definitely reasonable. As is a
"parallel for loop" that does fork/join, or a vectorizing / SIMD
construct. Parallelism and concurrency structures are a bit
heterogeneous, just like control and data structures :)

It can be an addition, but it hardly looks crucial. I disagree with number 3. Using regular threads gives us access to regular debuging tools like helgrind.

I would love to have both at a reasonable price, but the proposed
implementation is not there. It looks like it when you write the code,
but you do not really have multiplexed sequential threads. You can get
basic (not memory mapped) IO multiplexing with a lot of code, but that
is it.

It's worth differentiating cases. Not every task is doing mmap'ed I/O,
and that's the only case where this abstraction doesn't hold up.
Everything else we can indirect through an AIO interface, with the added
benefit that we can make the user-facing side of it much less crazy than
the C API.

(And .. I think you're rarely doing mmap'ed network I/O anyways)

So, is it really worth it to add an abstraction that

* blocks you from OS services for threads
* has corner cases when it can block other threads
* has a lot more expensive regular io model (two threads switches at least)
* has hard to reason about cases of which two task can run in parallel. This is particularly true if we try to abstract the task creation in a library.

Yes, but given our target marked (build the best browser), the
implementation cost is a go/no-go.

The best browser may well have lots of internal actions that are not
doing mmap'ed I/O. Coroutines make sense for them.

The argument seems a bit like you want it both ways. You're insisting
that the costs of doing non-mmap'ed I/O (that is, copying buffers) is
always unacceptable, but then saying it's .. ok to limit ourselves to a
tasking model that may require deep-copying a lot of messages to
communicate. What gives? Can't we just give the user a vocabulary to
decide when they want which kinds of copying?

It is a model where it is very natural to move from a simple copying to using moves to pass ownership back and forth to unsafe blocks to the very critical parts.

I would like tasks if they had the abstractions explained above. That way it would be easy to go from tasks to threads or back depending on which is more efficient for each case.

As it stands it is hard for me to see a case where tasks is the correct solution. It would have to be a case of traditional coroutines, where one task does a bit of work, passes it to another task to do a bit more, maybe get it back, etc, correct?

Now, if it is known that only one "worker" (thread or task) is looking at the data at a time, why wouldn't move semantics apply (and avoid the deep copying)?

That is why my suggestion right now is to go with just threads (and
stack linking to make creating them cheap). If it turns out that users
have to use unsafe block too often because they couldn't get a safe move
and copying is too expensive, we know need to provide something else.

I'm not fond of this strategy. I think that if we don't spend the time
making "something else" work from the beginning, nobody ever will.
System architecture changes will get *harder* over time, not easier.

Implementing something might get harder, but at least we will know it is needed.

-Graydon

Cheers,
Rafael
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to