On Wed, 2008-09-17 at 13:44 -0700, Evan Laforge wrote:
systems that don't use an existing user-space thread library (such as
Concurrent Haskell or libthread [1]) emulate user-space threads by
keeping a pool of processors and re-using them (e.g., IIUC Apache does
this).
Your response seems to be yet another argument that processes are too
expensive to be used the same way as threads. In my mind pooling vs
new-creation is only relevant to process vs thread in the performance
aspects. The fact that people use thread-pools means that they think
that even thread-creation is too expensive. The central aspect in my
mind is a default share-everything, or default share-nothing. One is
much easier to reason about and encourages writing systems that have
less shared-memory contention.
This is similar to the plan9 conception of processes. You have a
generic rfork() call that takes flags that say what to share with your
parent: namespace, environment, heap, etc. Thus the only difference
between a thread and a process is different flags to rfork().
As I mentioned, Plan 9 also has a user-space thread library, similar to
Concurrent Haskell.
Under the covers, I believe linux is similar, with its clone() call.
The fast context switching part seems orthogonal to me. Why is it
that getting the OS involved for context switches kills the
performance?
Read about CPU architecture.
Is it that the ghc RTS can switch faster because it
knows more about the code it's running (i.e. the OS obviously couldn't
switch on memory allocations like that)? Or is jumping up to kernel
space somehow expensive by nature?
Yes. Kernel code is very different on the bare metal from userspace
code; RTS code of course is not at all different. Switching processes
in the kernel requires an interrupt or a system call. Both of those
require the processor to dump the running process's state so it can be
restored later (userspace thread-switching does the same thing, but it
doesn't dump as much state because it doesn't need to be as conservative
about what it saves).
And why does the OS need so many
more K to keep track of a thread than the RTS?
An OS thread (Linux/Plan 9) stores:
* Stack (definitely a stack pointer and stored registers (> 40 bytes on
i686) and includes a special set of page tables on Plan 9)
* FD set (even if it's the same as the parent thread, you need to keep a
pointer to it
* uid/euid/gid/egid (Plan 9 I think omits euid and egid)
* Namespace (Plan 9 only; again, you need at least a pointer even if
it's the same as the parent process)
* Priority
* Possibly other things I can't think of right now
A Concurrent Haskell thread stores:
* Stack
* Allocation area (4KB)