On 8/2/2013 6:16 AM, Dmitry Olshansky wrote:
31-Jul-2013 22:20, Walter Bright пишет:
On 7/31/2013 8:26 AM, Dmitry Olshansky wrote:
Ouch... to boot it's always aligned by word size, so
key % sizeof(size_t) == 0
...
rendering lower 2-3 bits useless, that would make straight slice lower
bits
approach rather weak :)
Yeah, I realized that, too. Gotta shift it right 3 or 4 bits.
And that helped a bit... Anyhow after doing a bit more pervasive integer hash
power of 2 tables stand up to their promise.
The pull that reaps the minor speed benefit over the original (~2% speed gain!):
https://github.com/D-Programming-Language/dmd/pull/2436
2% is worth taking.
Not bad given that _aaGetRValue takes only a fraction of time itself.
I failed to see much of any improvement on Win32 though, allocations are
dominating the picture.
And sharing the joy of having a nice sampling profiler, here is what AMD
CodeAnalyst have to say (top X functions by CPU clocks not halted).
Original DMD:
Function CPU clocks DC accesses DC misses
RTLHeap::Alloc 49410 520 3624
Obj::ledata 10300 1308 3166
Obj::fltused 6464 3218 6
cgcs_term 4018 1328 626
TemplateInstance::semantic 3362 2396 26
Obj::byte 3212 506 692
vsprintf 3030 3060 2
ScopeDsymbol::search 2780 1592 244
_pformat 2506 2772 16
_aaGetRvalue 2134 806 304
memmove 1904 1084 28
strlen 1804 486 36
malloc 1282 786 40
Parameter::foreach 1240 778 34
StringTable::search 952 220 42
MD5Final 918 318
Variation of DMD with pow-2 tables:
Function CPU clocks DC accesses DC misses
RTLHeap::Alloc 51638 552 3538
Obj::ledata 9936 1346 3290
Obj::fltused 7392 2948 6
cgcs_term 3892 1292 638
TemplateInstance::semantic 3724 2346 20
Obj::byte 3280 548 676
vsprintf 3056 3006 4
ScopeDsymbol::search 2648 1706 220
_pformat 2560 2718 26
memcpy 2014 1122 46
strlen 1694 494 32
_aaGetRvalue 1588 658 278
Parameter::foreach 1266 658 38
malloc 1198 758 44
StringTable::search 970 214 24
MD5Final 866 274 2
This underlies the point that DMC RTL allocator is the biggest speed detractor.
It is "followed" by ledata (could it be due to linear search inside?) and
surprisingly the tiny Obj::fltused is draining lots of cycles (is it called that
often?).
It's not fltused() that is taking up time, it is the static function following
it. The sampling profiler you're using is unaware of non-global function names.