@Araq, @mratsim: As promised, I'm posting some code that shows the current difficulty of bypassing the overly simplistic (and performance costly) default behaviour for the current channels and threadpool libraries of deepCopy'ing all GC'ed ref's so they won't be prematurely destroyed if they go out of scope in the thread where they were created.
I was wrong previously in thinking that my difficulties in making this work were due to generic processes and thus nested templates; in fact, I think the problems were due to all the "cruft and compiler magic" not considering recursive algorithms where there may be threads spawned from within threads ad nauseum, which is also likely the problem that @mratsim has had in running his benchmarks. Accordingly, this benchmark **wraps absolutely everything that has to do with GC** (just in case) and should be able to handle just about an reasonable level of recursion, although I'm not sure about what nesting levels of "toDispose" lists generated by the protect/dispose pairs will do to the stack - there isn't much I can do about those without computer magic anyway, and we certainly don't want to add more of that if it isn't necessary. The code linked below implements a little benchmark to cycle through spawning 10,000 (trivial) tasks from the threadpool using a manual closure implemented customizable iterator through closures and includes a "polymorphic" converter function closure parameter. It is close to what I require to cleanly implement my version of the "Ultimate Sieve of Eratosthens in Nim" algorithm, which does require the ability to nest and recursively spawn threads. I've divided the code into modules by functionality (the file tabs across the top of the source code section) for ready reference and so you can see that the actual benchmark is fairly trivial; most of the code is there to make deepCopy unnecessary by preserving the GC'ed ref's in the current Nim provided ways. I've tried to make the code concise and elegant, but the need to do this is **UGLY**. However, it is likely the easiest to implement "Plan B" if the "newruntime" doesn't work out rather than the huge project of implementing a multi-threaded GC - just make the extra support modules available as one or more libraries. This [link on Wandbox](https://wandbox.org/permlink/TSMrMyVVcikS9Bty) is the runnable code. It is run in full release mode at about 400 milliseconds on aIntel Xeon Sandy Bridge CPU at 2.5 GHz for which we are given the use of three threads, two of which likely share a core (Hyper Threaded). Thus, there are about a 2.5 billion total cycles used across all available threads, which means that there are about 250 thousand cycles or about 100 microseconds per thread spawn including overheads. This sounds like a lot but actually isn't bad considering it takes something like ten to a hundred times as long to do this by "spinning up a new thread" for every task. Now, if "newruntime" does work out and for algorithms such as mine where single ownership is adequate, much of this code would just "go away": owned ref's would replace "RCRef", no wrappers would be required for closures or for ref's because they would be owned, and the channels and threadpool libraries could be re-written to be much simpler without the "cruft and compiler magic", **not depending on using global resources** , and thus written to be completely recursive if necessary. All that would be left would be the benchmark itself, and even that would be much simpler, more concise, and more elegant through not having to call into the extra wrappers. It should also be a little (perhaps twice) as fast through not having the GC fighting us in the background and the more direct forms of code. I had started work on converting this to use my own emulation of owned ref's (since release builds won't run with threading and newruntime on at the same time), but don't think I'll pursue it as emulating the new closures is quite hard without compiler help and there is little point if we are soon to have newruntime run for threads, which looks to be imminent. I'll reserve the effort for when that happens, as I think it is reasonably clear from this work how much easier that could make working across threads!