Though isn't the compiler compilation mostly single threaded or highly parallel? Modern CPU's are great at "caching" atomics but what happens when there's cache contention?
I'd be curious to see how a multi-threaded work queue would do. Perhaps a synthetic or real test on something like Mummy would be good.