On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:

Weird as mine is inlined with -Cpcoreavx -O4, with no special handling for 0. But that does put some things on shaky ground. Maybe zero the result before hand?

Same here.

I looked up popcnt and found nothing about not setting if zero. (E.g. https://www.felixcloutier.com/x86/popcnt )

I meanwhile also ran on my Ryzen 4800H laptop and updated the version on the web with the stats. The stats for the  long string are about as fast as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen should be faster in any way?!?), but the short one (40 bytes) is significantly faster. What I don't get is why the assembler version seems systematically faster even for the short code. The generated asm is nearly the same.

Also notable is that on this machine with popcnt (-Cpcoreavx), the popcnt version is as fast as the add function within error margins, so probably popcnt instruction is faster(lower latency) and thus less of a bottleneck on this machine.  Note that the POP() function is half the size, so that makes it better for newer machines.

---------

Note that I test on Windows, so it might be that the "two times load" is a difference somehow due to different codegeneration on windows


----------------------------------------
About UTF8LengthFast()

Well, before I get to this, I noted something weird.....

2 runs, compiled with the same compiler ( 3.2.3 ), and the same settings, with the only difference: -gw3 or not -gw3 => And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / reproducible.

I also have seen this, while working on the code. And indeed mainly with the "fast" one. It also explains why the assembler is always consistent, it suffers less from detail code changes when I e.g. update FPC from git, and thus different alignment. (assuming that the section starts are always aligned)

Alignment. 16 vs 32 bit. Can that make a difference?
According to: https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache

Seems to be a problem of the Skylake and later archs, which I no longer have. The i7 is too old, and the others are AMD.


_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to