I neglected to include -Cpcoreavx, that was my bad. I'll try again.
According to Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Vol 2B, Page 4-391. The zero flag is set if the source is zero,
and cleared otherwise. Regarding an undefined result, I got confused
with the BSF and BSR commands, sorry. I guess I was more tired than I
thought! POPCNT returns zero for a zero input.
Gareth aka. Kit
On 04/01/2022 16:03, Marco van de Voort via fpc-devel wrote:
On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:
Weird as mine is inlined with -Cpcoreavx -O4, with no special
handling for 0. But that does put some things on shaky ground. Maybe
zero the result before hand?
Same here.
I looked up popcnt and found nothing about not setting if zero. (E.g.
https://www.felixcloutier.com/x86/popcnt )
I meanwhile also ran on my Ryzen 4800H laptop and updated the version
on the web with the stats. The stats for the long string are about as
fast as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen
should be faster in any way?!?), but the short one (40 bytes) is
significantly faster. What I don't get is why the assembler version
seems systematically faster even for the short code. The generated asm
is nearly the same.
Also notable is that on this machine with popcnt (-Cpcoreavx), the
popcnt version is as fast as the add function within error margins, so
probably popcnt instruction is faster(lower latency) and thus less of
a bottleneck on this machine. Note that the POP() function is half
the size, so that makes it better for newer machines.
---------
Note that I test on Windows, so it might be that the "two times load"
is a difference somehow due to different codegeneration on windows
----------------------------------------
About UTF8LengthFast()
Well, before I get to this, I noted something weird.....
2 runs, compiled with the same compiler ( 3.2.3 ), and the same
settings, with the only difference: -gw3 or not -gw3
=> And the speed differed. 600 (with dwarf) vs 700 (no dwarf) /
reproducible.
I also have seen this, while working on the code. And indeed mainly
with the "fast" one. It also explains why the assembler is always
consistent, it suffers less from detail code changes when I e.g.
update FPC from git, and thus different alignment. (assuming that the
section starts are always aligned)
Alignment. 16 vs 32 bit. Can that make a difference?
According to:
https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache
Seems to be a problem of the Skylake and later archs, which I no
longer have. The i7 is too old, and the others are AMD.
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel