31-Jul-2013 22:20, Walter Bright пишет:
On 7/31/2013 8:26 AM, Dmitry Olshansky wrote:
Ouch... to boot it's always aligned by word size, so
key % sizeof(size_t) == 0
...
rendering lower 2-3 bits useless, that would make straight slice lower
bits
approach rather weak :)

Yeah, I realized that, too. Gotta shift it right 3 or 4 bits.

And that helped a bit... Anyhow after doing a bit more pervasive integer hash power of 2 tables stand up to their promise.

The pull that reaps the minor speed benefit over the original (~2% speed gain!):
https://github.com/D-Programming-Language/dmd/pull/2436

Not bad given that _aaGetRValue takes only a fraction of time itself.

I failed to see much of any improvement on Win32 though, allocations are dominating the picture.

And sharing the joy of having a nice sampling profiler, here is what AMD CodeAnalyst have to say (top X functions by CPU clocks not halted).

Original DMD:

Function         CPU clocks      DC accesses     DC misses
RTLHeap::Alloc   49410   520     3624
Obj::ledata      10300   1308    3166
Obj::fltused     6464    3218    6
cgcs_term        4018    1328    626
TemplateInstance::semantic       3362    2396    26
Obj::byte        3212    506     692
vsprintf         3030    3060    2
ScopeDsymbol::search     2780    1592    244
_pformat         2506    2772    16
_aaGetRvalue     2134    806     304
memmove  1904    1084    28
strlen   1804    486     36
malloc   1282    786     40
Parameter::foreach       1240    778     34
StringTable::search      952     220     42
MD5Final         918     318    

Variation of DMD with pow-2 tables:

Function         CPU clocks      DC accesses     DC misses
RTLHeap::Alloc   51638   552     3538
Obj::ledata      9936    1346    3290
Obj::fltused     7392    2948    6
cgcs_term        3892    1292    638
TemplateInstance::semantic       3724    2346    20
Obj::byte        3280    548     676
vsprintf         3056    3006    4
ScopeDsymbol::search     2648    1706    220
_pformat         2560    2718    26
memcpy   2014    1122    46
strlen   1694    494     32
_aaGetRvalue     1588    658     278
Parameter::foreach       1266    658     38
malloc   1198    758     44
StringTable::search      970     214     24
MD5Final         866     274     2


This underlies the point that DMC RTL allocator is the biggest speed detractor. It is "followed" by ledata (could it be due to linear search inside?) and surprisingly the tiny Obj::fltused is draining lots of cycles (is it called that often?).

--
Dmitry Olshansky

Reply via email to