There was some discussion around here a while back about the possibility of using thread-local heaps in the standard GC. This was rejected largely because of the complexity it would add when casting to shared/immutable.
I'm wondering if it would be a good idea to allow memory to be explicitly allocated as thread-local through a separate GC. Such a GC would be designed from the ground up to assume thread-local data and would never be used to allocate in standard Phobos or Druntime functions. It would simply be a Phobos module, something like std.localgc. The only way to use it would be to explicitly call something like ThreadLocal.malloc, or pass it as a parameter to something that needs an allocator. The collector would (unsafely) assume that you always maintain at least one pointer to all thread-locally allocated data on either the relevant thread's stack, the thread-local heap or in thread-local storage. The global heap, __gshared storage and other threads' stacks would not be scanned. A major issue I see is interfacing such a GC with the regular GC such that pointers from the thread-local memory to shared memory are dealt with properly, without being excessively conservative. The thread-local GC would likely use core.stdc.malloc() to allocate large blocks of memory, and would need a way to signal to the shared GC what blocks might contain pointers without synchronizing on every update. If this sounds like a good idea, maybe I'll start prototyping it. Overall, the idea is that thread-local heaps are an optimization that should be done explicitly when/if you need it, not something that needs to be built deep into the language runtime.
