Re: [rust-dev] Idea: Memcpyed stacks for green tasks
On Sun, Jan 26, 2014 at 1:26 PM, Matthieu Monrocq matthieu.monr...@gmail.com wrote: On Sun, Jan 26, 2014 at 3:31 AM, Brian Anderson bander...@mozilla.comwrote: On 01/25/2014 08:58 AM, Bill Myers wrote: Stack management for green tasks has been based in the past first on segmented stacks and then on standard large stacks. However, I just realized that there is a third alternative which might well be better than both of those. The idea is very simple: a green task would run on a large stack like now, but when it is descheduled, a memory buffer of the size of its used stack space is allocated, and its stack data is copied there; when it is rescheduled, the stack data is copied back to the original address. That's a really clever idea. There are though a number of situations where we unsafely read from or write into other task's stacks that would need to be considered carefully. Would it be possible to statically eliminate this risk by preventing the promotion of such affected tasks as greenlets ? Or is there today too many ways to read into a greenlet stack ? We *already* statically prevent tasks from reading/writing each others' stacks, the issue here is code which uses `unsafe { }` and does it anyways, based on particular assumptions about how stacks are implemented which would no longer be valid. The solution would be to (considering it carefully) fix this unsafe code to no longer rely on these assumptions. -- Matthieu ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Idea: Memcpyed stacks for green tasks
On Sat, Jan 25, 2014 at 9:26 PM, Vadim vadi...@gmail.com wrote: Hi Bill, If I understand this right, what you are proposing is the approach of Python's greenlets library, which I already tried to pitch herehttp://mail.mozilla.org/pipermail/rust-dev/2013-November/006544.html. I think that the biggest problem with this approach, in its' raw form, is that greenlets are bound to a specific OS thread, which would cause all sorts of headache. For example: - It would be difficult to balance greenlet tasks between several OS threads, - If async I/O requests for two greenlets, bound to the same OS thread, were to complete simultaneously, you would not be able to resume them in parallel, - If a thread were to become blocked for a significant amount of time, all the greenlets bound to it become un-resumable, and so on... Greelets would be much more palatable if their stacks were position-independent. This might be possible to achieve with a suitable LLVM transformhttp://lists.cs.uiuc.edu/pipermail/llvmdev/2014-January/069662.html, though I am not yet sure how portable this would be. Totally not an expert in this area, but I'm wondering two things: - Once you hoist all variables whose address is taken out into this secondary stack frame, if you have a pointer on the stack to an object on the stack who *also*, in turn, has its address taken, then I believe you would end up with a pointer in the secondary stack frame to an object in the secondary stack frame, i.e. the secondary stack frame itself would not be position independent. Does the secondary stack frame always stay in place, so we don't really care about this? - Instead of heap-allocating a secondary stack frame, could another potential solution be to use similar infrastructure as for DWARF-based unwinding and/or precise GC tracing of the stack, in other words to emit a side table of the positions of pointers into the stack in each stack frame, and then when relocating the stack, to consult this side table and manually fix-up all of the affected pointers? In this case, aside from the presence of the side tables, the cost is borne only when actually relocating the stack. cheers, Vadim On Sat, Jan 25, 2014 at 8:58 AM, Bill Myers bill_my...@outlook.comwrote: Stack management for green tasks has been based in the past first on segmented stacks and then on standard large stacks. However, I just realized that there is a third alternative which might well be better than both of those. The idea is very simple: a green task would run on a large stack like now, but when it is descheduled, a memory buffer of the size of its used stack space is allocated, and its stack data is copied there; when it is rescheduled, the stack data is copied back to the original address. The copying can be performed either by memcpy (perhaps with a specialized memcpy that assumes alignment and always uses SIMD instruction to copy with no checks), or perhaps by using a compression algorithm designed for maximum speed, such as Snappy or maybe a new ad-hoc design. This allows to have the best possible memory efficiency and thus the maximum possible number of tasks, even better than segmented stacks due to precise sizing and optional compression, while not adding any overhead to code that does not block, either in terms of code generation complexity or runtime cost. There is of course an additional runtime cost when blocking, but blocking already often involves both system calls and context switching, and workloads with lots of blocking green tasks are probably I/O bound anyway; furthermore, stack data is probably in the CPU cache, so there shouldn't be much penalty. The only significant drawback is that two green tasks running from the same large stack (note that stacks cannot be easily switched, as that would invalidate borrowed pointers on the stack to the stack) cannot ever run concurrently; however, by making the number of large stacks a large enough multiple of the number of CPU cores, the probability of reduced parallelism due to this issue can be made as small as desired. In general, one would use an adaptive approach, where this kind of copying would start to kick in only once the number of green tasks becomes large; when not copying, one would just optionally use madvise(MADV_DONTNEED) to trim unused whole pages of the stack. Also, if the data to be copied is large (several OS pages), one may use mremap() or similar facilities to move the address space mappings instead of the data itself. Some unsafe code that assumes that tasks can access each other's stacks will break (probably only the new Mutex implementation), but that can be fixed by putting the data in the task structure instead of the stack. Overall, this should allow writing something like a parallel network server or web crawler in Rust in a natural blocking style, while being fully confident in its scalability, which is something
Re: [rust-dev] Idea: Memcpyed stacks for green tasks
On Sun, Jan 26, 2014 at 2:02 PM, Gábor Lehel glaebho...@gmail.com wrote: On Sat, Jan 25, 2014 at 9:26 PM, Vadim vadi...@gmail.com wrote: Hi Bill, If I understand this right, what you are proposing is the approach of Python's greenlets library, which I already tried to pitch herehttp://mail.mozilla.org/pipermail/rust-dev/2013-November/006544.html. I think that the biggest problem with this approach, in its' raw form, is that greenlets are bound to a specific OS thread, which would cause all sorts of headache. For example: - It would be difficult to balance greenlet tasks between several OS threads, - If async I/O requests for two greenlets, bound to the same OS thread, were to complete simultaneously, you would not be able to resume them in parallel, - If a thread were to become blocked for a significant amount of time, all the greenlets bound to it become un-resumable, and so on... Greelets would be much more palatable if their stacks were position-independent. This might be possible to achieve with a suitable LLVM transformhttp://lists.cs.uiuc.edu/pipermail/llvmdev/2014-January/069662.html, though I am not yet sure how portable this would be. Totally not an expert in this area, but I'm wondering two things: - Once you hoist all variables whose address is taken out into this secondary stack frame, if you have a pointer on the stack to an object on the stack who *also*, in turn, has its address taken, then I believe you would end up with a pointer in the secondary stack frame to an object in the secondary stack frame, i.e. the secondary stack frame itself would not be position independent. Does the secondary stack frame always stay in place, so we don't really care about this? (Thinking about this further, this was a silly question, because the primary stack frame itself would have pointers into the secondary one, so _obviously_ the latter cannot move.) ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
[rust-dev] Idea: Memcpyed stacks for green tasks
Stack management for green tasks has been based in the past first on segmented stacks and then on standard large stacks. However, I just realized that there is a third alternative which might well be better than both of those. The idea is very simple: a green task would run on a large stack like now, but when it is descheduled, a memory buffer of the size of its used stack space is allocated, and its stack data is copied there; when it is rescheduled, the stack data is copied back to the original address. The copying can be performed either by memcpy (perhaps with a specialized memcpy that assumes alignment and always uses SIMD instruction to copy with no checks), or perhaps by using a compression algorithm designed for maximum speed, such as Snappy or maybe a new ad-hoc design. This allows to have the best possible memory efficiency and thus the maximum possible number of tasks, even better than segmented stacks due to precise sizing and optional compression, while not adding any overhead to code that does not block, either in terms of code generation complexity or runtime cost. There is of course an additional runtime cost when blocking, but blocking already often involves both system calls and context switching, and workloads with lots of blocking green tasks are probably I/O bound anyway; furthermore, stack data is probably in the CPU cache, so there shouldn't be much penalty. The only significant drawback is that two green tasks running from the same large stack (note that stacks cannot be easily switched, as that would invalidate borrowed pointers on the stack to the stack) cannot ever run concurrently; however, by making the number of large stacks a large enough multiple of the number of CPU cores, the probability of reduced parallelism due to this issue can be made as small as desired. In general, one would use an adaptive approach, where this kind of copying would start to kick in only once the number of green tasks becomes large; when not copying, one would just optionally use madvise(MADV_DONTNEED) to trim unused whole pages of the stack. Also, if the data to be copied is large (several OS pages), one may use mremap() or similar facilities to move the address space mappings instead of the data itself. Some unsafe code that assumes that tasks can access each other's stacks will break (probably only the new Mutex implementation), but that can be fixed by putting the data in the task structure instead of the stack. Overall, this should allow writing something like a parallel network server or web crawler in Rust in a natural blocking style, while being fully confident in its scalability, which is something that is not really possible at the moment. Once this is in place, the best practices for Rust programs would become: 1. Use native tasks for tasks whose number is bounded at compile time and small 2. Use green tasks for tasks whose number is unbounded or large One could even automate this to some extent by spawning by default a native task the first C times a given proc() code address is used to start a task (where C = number of CPU cores), and green tasks afterwards. What do you think? ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Idea: Memcpyed stacks for green tasks
Hi Bill, If I understand this right, what you are proposing is the approach of Python's greenlets library, which I already tried to pitch herehttp://mail.mozilla.org/pipermail/rust-dev/2013-November/006544.html. I think that the biggest problem with this approach, in its' raw form, is that greenlets are bound to a specific OS thread, which would cause all sorts of headache. For example: - It would be difficult to balance greenlet tasks between several OS threads, - If async I/O requests for two greenlets, bound to the same OS thread, were to complete simultaneously, you would not be able to resume them in parallel, - If a thread were to become blocked for a significant amount of time, all the greenlets bound to it become un-resumable, and so on... Greelets would be much more palatable if their stacks were position-independent. This might be possible to achieve with a suitable LLVM transformhttp://lists.cs.uiuc.edu/pipermail/llvmdev/2014-January/069662.html, though I am not yet sure how portable this would be. cheers, Vadim On Sat, Jan 25, 2014 at 8:58 AM, Bill Myers bill_my...@outlook.com wrote: Stack management for green tasks has been based in the past first on segmented stacks and then on standard large stacks. However, I just realized that there is a third alternative which might well be better than both of those. The idea is very simple: a green task would run on a large stack like now, but when it is descheduled, a memory buffer of the size of its used stack space is allocated, and its stack data is copied there; when it is rescheduled, the stack data is copied back to the original address. The copying can be performed either by memcpy (perhaps with a specialized memcpy that assumes alignment and always uses SIMD instruction to copy with no checks), or perhaps by using a compression algorithm designed for maximum speed, such as Snappy or maybe a new ad-hoc design. This allows to have the best possible memory efficiency and thus the maximum possible number of tasks, even better than segmented stacks due to precise sizing and optional compression, while not adding any overhead to code that does not block, either in terms of code generation complexity or runtime cost. There is of course an additional runtime cost when blocking, but blocking already often involves both system calls and context switching, and workloads with lots of blocking green tasks are probably I/O bound anyway; furthermore, stack data is probably in the CPU cache, so there shouldn't be much penalty. The only significant drawback is that two green tasks running from the same large stack (note that stacks cannot be easily switched, as that would invalidate borrowed pointers on the stack to the stack) cannot ever run concurrently; however, by making the number of large stacks a large enough multiple of the number of CPU cores, the probability of reduced parallelism due to this issue can be made as small as desired. In general, one would use an adaptive approach, where this kind of copying would start to kick in only once the number of green tasks becomes large; when not copying, one would just optionally use madvise(MADV_DONTNEED) to trim unused whole pages of the stack. Also, if the data to be copied is large (several OS pages), one may use mremap() or similar facilities to move the address space mappings instead of the data itself. Some unsafe code that assumes that tasks can access each other's stacks will break (probably only the new Mutex implementation), but that can be fixed by putting the data in the task structure instead of the stack. Overall, this should allow writing something like a parallel network server or web crawler in Rust in a natural blocking style, while being fully confident in its scalability, which is something that is not really possible at the moment. Once this is in place, the best practices for Rust programs would become: 1. Use native tasks for tasks whose number is bounded at compile time and small 2. Use green tasks for tasks whose number is unbounded or large One could even automate this to some extent by spawning by default a native task the first C times a given proc() code address is used to start a task (where C = number of CPU cores), and green tasks afterwards. What do you think? ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Idea: Memcpyed stacks for green tasks
On 1/25/14 8:58 AM, Bill Myers wrote: Stack management for green tasks has been based in the past first on segmented stacks and then on standard large stacks. I assume this is incompatible with work stealing and task migration between threads? Patrick ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Idea: Memcpyed stacks for green tasks
I assume this is incompatible with work stealing and task migration between threads? It is compatible, assuming that the number of large stacks is sufficiently larger than the number of threads. Basically, each green task can only run on a specific large stack, but as long as you aren't unlucky that all runnable green tasks you'd steal are running on a large stack that has already a running green task, then you can steal one. The probability of that obviously decreases the more large stacks you use; large stacks can only consume address space (if you wipe them with madvise(MADV_DONTNEED) or similar when unused), so one can reasonably have 256 8MB large stacks on 32-bit, for instance. So for instance, if you had an HT 4-core machine with 8 system threads running green tasks and 1024 large stacks, the probability of a green task that is not blocked but not executable (because its large stack is in use) is 7/256; obviously if K tasks are not blocked, then the probability of having none executable becomes (7/256)^K (i.e. exponentially lower). On 64-bit, the probability can be made arbitrarily low, although you consume more kernel memory to store the metadata associated with the memory mappings for the large stacks. It would require to rearchitecture the work stealing code to work with a two-level data structure and first steal a large stack with at least one non-blocked task, and then run the task, rather than directly stealing the task. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Idea: Memcpyed stacks for green tasks
Interesting: my proposal appears to be indeed a generalization of the greenlet approach. Specifically, while the greenlet proposal seems to only use one large stack per native thread, I'm suggesting to use multiple large stacks that can be stolen by other threads, which does mitigate the issues stemming from non-position dependent stacks (they are not eliminated, but they have low probability of happening). It's also indeed possible to fully eliminate those issues by autoboxing everything whose address is taken, but that would have a potentially large performance impact on non-blocking code, while merely copying the stack only imposes a performance penalty to code that blocks. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Idea: Memcpyed stacks for green tasks
The ratio of native threads to stacks and of stacks to tasks can actually be used to characterize all systems discussed. (stacks/thread, tasks/stacks) (1, 1) = current Rust native tasks (1, N) = Python greenlets (N, 1) = current Rust green tasks (N, M) = proposal in my original mail ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Idea: Memcpyed stacks for green tasks
To me, issues such as large memcpys/system calls on context switches, having tasks not be able to compute concurrently with random other tasks, possibly having to autobox everything, etc. sound like they would spawn equal or greater complexity and performance loss than just using split stacks. If you really wanted to avoid wasting memory, couldn't you allow new tasks to steal memory below the stack pointer of swapped-out tasks, changing that task's stack bottom pointer in the process? ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Idea: Memcpyed stacks for green tasks
On 01/25/2014 08:58 AM, Bill Myers wrote: Stack management for green tasks has been based in the past first on segmented stacks and then on standard large stacks. However, I just realized that there is a third alternative which might well be better than both of those. The idea is very simple: a green task would run on a large stack like now, but when it is descheduled, a memory buffer of the size of its used stack space is allocated, and its stack data is copied there; when it is rescheduled, the stack data is copied back to the original address. That's a really clever idea. There are though a number of situations where we unsafely read from or write into other task's stacks that would need to be considered carefully. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev