Going to take a wild guess, but as core.atomic.casImpl will never be
inlined anywhere with DMD, due to it's inline assembly, you have the
cost of building and destroying a stack frame, the cost of passing the
args in, moving them into registers, saving potentially trashed
registers, etc. every time it even attempts to acquire a lock, and the
GC uses a single global lock for just about everything. As you can
imagine, I suspect this is far from optimal, and, if I remember right,
GDC uses intrinsics for the atomic operations.

On 5/5/14, Atila Neves via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> On Sunday, 4 May 2014 at 17:01:23 UTC, safety0ff wrote:
>> On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu
>> wrote:
>>> On 5/3/14, 2:42 PM, Atila Neves wrote:
>>>> gdc gave _very_ different results. I had to use different
>>>> modules
>>>> because at some point tests started failing, but with gdc the
>>>> threaded
>>>> version runs ~3x faster.
>>>>
>>>> On my own unit-threaded benchmarks, running the UTs for
>>>> Cerealed over
>>>> and over again was only slightly slower with threads than
>>>> without. With
>>>> dmd the threaded version was nearly 3x slower.
>>>
>>> Sounds like a severe bug in dmd or dependents. -- Andrei
>>
>> This reminds me of when I was parallelizing a project euler
>> solution: atomic access was so much slower on DMD that it made
>> performance worse than the single threaded version for one
>> stage of the program.
>>
>> I know that std.parallelism does make use of core.atomic under
>> the hood, so this may be a factor when using DMD.
>
> Funny you should say that, a friend of mine tried porting a
> lock-free algorithm of his from Java to D a few weeks ago. The D
> version ran 3 orders of magnitude slower. Then I tried gdc and
> ldc on his code. ldc produced code running at around 80% of the
> speed of the Java version, fdc was around 30%. But dmd...
>

Reply via email to