Am Fri, 30 May 2014 15:54:57 +0000 schrieb "Ola Fosheim Grøstad" <ola.fosheim.grostad+dl...@gmail.com>:
> On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote: > > simplicity. But as soon as I added a single CAS I was already > > over the time that TCMalloc needs. That way I learned that CAS > > is not as cheap as it looks and the fastest allocators work > > thread local as long as possible. > > 22 cycles latency if on a valid cacheline? > + overhead of going to memory > > Did you try to add explicit prefetch, maybe that would help? > > Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5 > cycles on Haswell). You need instructions to fill the pipeline > between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM > and do a lot of testing on different CPUs. Explicit prefetching, > lock free strategies etc are tricky to get right. Get it wrong > and it is worse than the naive implementation. I'm on a Core 2 Duo. But this doesn't sound like I want to try it. core.atomic is as low as I wanted to go. Anyway I deleted that code when I realized just how fast allocation is with TCMalloc already. And that's a general purpose allocator. -- Marco