Going to take a wild guess, but as core.atomic.casImpl will never be inlined anywhere with DMD, due to it's inline assembly, you have the cost of building and destroying a stack frame, the cost of passing the args in, moving them into registers, saving potentially trashed registers, etc. every time it even attempts to acquire a lock, and the GC uses a single global lock for just about everything. As you can imagine, I suspect this is far from optimal, and, if I remember right, GDC uses intrinsics for the atomic operations.
On 5/5/14, Atila Neves via Digitalmars-d <digitalmars-d@puremagic.com> wrote: > On Sunday, 4 May 2014 at 17:01:23 UTC, safety0ff wrote: >> On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu >> wrote: >>> On 5/3/14, 2:42 PM, Atila Neves wrote: >>>> gdc gave _very_ different results. I had to use different >>>> modules >>>> because at some point tests started failing, but with gdc the >>>> threaded >>>> version runs ~3x faster. >>>> >>>> On my own unit-threaded benchmarks, running the UTs for >>>> Cerealed over >>>> and over again was only slightly slower with threads than >>>> without. With >>>> dmd the threaded version was nearly 3x slower. >>> >>> Sounds like a severe bug in dmd or dependents. -- Andrei >> >> This reminds me of when I was parallelizing a project euler >> solution: atomic access was so much slower on DMD that it made >> performance worse than the single threaded version for one >> stage of the program. >> >> I know that std.parallelism does make use of core.atomic under >> the hood, so this may be a factor when using DMD. > > Funny you should say that, a friend of mine tried porting a > lock-free algorithm of his from Java to D a few weeks ago. The D > version ran 3 orders of magnitude slower. Then I tried gdc and > ldc on his code. ldc produced code running at around 80% of the > speed of the Java version, fdc was around 30%. But dmd... >