Fwd: [Lazarus] Faster than popcnt

J. Gareth Moreton via fpc-devel Tue, 04 Jan 2022 08:16:03 -0800

I neglected to include -Cpcoreavx, that was my bad.  I'll try again.

According to Intel® 64 and IA-32 Architectures Software Developer’sManual, Vol 2B, Page 4-391. The zero flag is set if the source is zero,and cleared otherwise. Regarding an undefined result, I got confusedwith the BSF and BSR commands, sorry. I guess I was more tired than Ithought! POPCNT returns zero for a zero input.


Gareth aka. Kit

On 04/01/2022 16:03, Marco van de Voort via fpc-devel wrote:

On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:
Weird as mine is inlined with -Cpcoreavx -O4, with no specialhandling for 0. But that does put some things on shaky ground. Maybezero the result before hand?
Same here.
I looked up popcnt and found nothing about not setting if zero. (E.g.https://www.felixcloutier.com/x86/popcnt )
I meanwhile also ran on my Ryzen 4800H laptop and updated the versionon the web with the stats. The stats for the long string are about asfast as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzenshould be faster in any way?!?), but the short one (40 bytes) issignificantly faster. What I don't get is why the assembler versionseems systematically faster even for the short code. The generated asmis nearly the same.
Also notable is that on this machine with popcnt (-Cpcoreavx), thepopcnt version is as fast as the add function within error margins, soprobably popcnt instruction is faster(lower latency) and thus less ofa bottleneck on this machine. Note that the POP() function is halfthe size, so that makes it better for newer machines.
---------
Note that I test on Windows, so it might be that the "two times load"is a difference somehow due to different codegeneration on windows
----------------------------------------
About UTF8LengthFast()

Well, before I get to this, I noted something weird.....
2 runs, compiled with the same compiler ( 3.2.3 ), and the samesettings, with the only difference: -gw3 or not -gw3=> And the speed differed. 600 (with dwarf) vs 700 (no dwarf) /reproducible.
I also have seen this, while working on the code. And indeed mainlywith the "fast" one. It also explains why the assembler is alwaysconsistent, it suffers less from detail code changes when I e.g.update FPC from git, and thus different alignment. (assuming that thesection starts are always aligned)
Alignment. 16 vs 32 bit. Can that make a difference?
According to:https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache
Seems to be a problem of the Skylake and later archs, which I nolonger have. The i7 is too old, and the others are AMD.
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

Reply via email to