Fwd: [Lazarus] Faster than popcnt

Martin Frb via fpc-devel Tue, 04 Jan 2022 07:31:56 -0800

On 04/01/2022 10:31, Marco van de Voort via fpc-devel wrote:

Weird as mine is inlined with -Cpcoreavx -O4, with no special handlingfor 0. But that does put some things on shaky ground. Maybe zero theresult before hand?


Same here.

----------------------------------------
About UTF8LengthFast()

Well, before I get to this, I noted something weird.....

2 runs, compiled with the same compiler ( 3.2.3 ), and the samesettings, with the only difference: -gw3 or not -gw3=> And the speed differed. 600 (with dwarf) vs 700 (no dwarf) /reproducible.


So then

=> I compiled the app, with the same 3.2.3 (and I compiled WITHOUT -a / though with -al I get the same results)

=> I compiled once with -gw3 , and once without dwarf info
=> I used objdump to dis-assemble the exe.
=> I diffed (and searched for 0x101010.... (only used in the ...Fast code)
--> _*The assembler is identical.*_
Yet one is faster. (the one WITH dwarf)

The calling code in the main body is the same too (as far as I couldsee), except that the address of the callee is different (but it is just20 calls per measurement)


I did those runs OUTSIDE the IDE.
So no debugger in the background.

Win64 / 64 bit
Core I7 8600K

Using 3.3.1 the speed is equal. Never mind if dwarf is generated or not.
(I did not compare the asm for that...)

---------------------

Clinging to straws, there is one (maybe) diff in the 3.2.3 with/withoutdwarf assembler.

*** I am totally out of my depth here ****

Alignment. 16 vs 32 bit. Can that make a difference?

According to:https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache

The Decoded ICache consists of 32 sets. Each set contains eightWays. Each Way can hold up to six micro-ops.
All micro-ops in a Way represent instructions which are staticallycontiguous in the code and have their EIPs within _*the same aligned32-byte region*_.


So the alignment of the 2 procedures differs  by 16 bytes
The proc entry is at
With dwarf 100001870
without   100001860  // actually this is 32byte aligned (but slower)

Yet, maybe it matters which statements in the big loop happen to fallinto the same 32byte block???


The loop starting with
   for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do

With DWARF (faster):
   1000018f0:    49 83 c2 01              add    $0x1,%r10
   1000018f4:    4c 8b 19                 mov    (%rcx),%r11
   1000018f7:    4d 89 d8                 mov    %r11,%r8
   1000018fa:    48 bf 80 80 80 80 80     movabs $0x8080808080808080,%rdi
   100001901:    80 80 80
   100001904:    49 21 f8                 and    %rdi,%r8
   100001907:    49 c1 e8 07              shr    $0x7,%r8
   10000190b:    49 f7 d3                 not    %r11
   10000190e:    49 c1 eb 06              shr    $0x6,%r11
   100001912:    4d 21 d8                 and    %r11,%r8
   100001915:    4c 89 c3                 mov    %r8,%rbx
   100001918:    49 bb 01 01 01 01 01     movabs $0x101010101010101,%r11
   10000191f:    01 01 01
   100001922:    4d 0f af c3              imul   %r11,%r8
   100001926:    49 c1 e8 38              shr    $0x38,%r8
   10000192a:    4c 01 c6                 add    %r8,%rsi
   10000192d:    48 83 c1 08              add    $0x8,%rcx
   100001931:    4d 39 d1                 cmp    %r10,%r9

100001934: 7f ba jg 1000018f0<P$PROGRAM_$$_UTF8LENGTHFAST$PCHAR$INT64$$INT64+0x80>


WITHOUT:
   1000018e0:    49 83 c2 01              add    $0x1,%r10
   1000018e4:    4c 8b 19                 mov    (%rcx),%r11
   1000018e7:    4d 89 d8                 mov    %r11,%r8
   1000018ea:    48 bf 80 80 80 80 80     movabs $0x8080808080808080,%rdi
   1000018f1:    80 80 80
   1000018f4:    49 21 f8                 and    %rdi,%r8
   1000018f7:    49 c1 e8 07              shr    $0x7,%r8
   1000018fb:    49 f7 d3                 not    %r11
   1000018fe:    49 c1 eb 06              shr    $0x6,%r11
   100001902:    4d 21 d8                 and    %r11,%r8
   100001905:    4c 89 c3                 mov    %r8,%rbx
   100001908:    49 bb 01 01 01 01 01     movabs $0x101010101010101,%r11
   10000190f:    01 01 01
   100001912:    4d 0f af c3              imul   %r11,%r8
   100001916:    49 c1 e8 38              shr    $0x38,%r8
   10000191a:    4c 01 c6                 add    %r8,%rsi
   10000191d:    48 83 c1 08              add    $0x8,%rcx
   100001921:    4d 39 d1                 cmp    %r10,%r9
   100001924:    7f ba                    jg     0x1000018e0


----------------------------------------
*** written before I got into the above......
-------

About UTF8LengthFast()

I notice 3 differences.

Though I only compare the O4 result, and have no idea what ispre-peephole and post-peephole.

1) As you say: different registers (but the same statements, in the sameorder).

No idea if that affects the CPU.

2) One extra statement in 3.3.1

    movq    %r10,%r8 //// <<<<<<<<<<<<<<< not in 3.2.3
    movq    $72340172838076673,%r10
    imulq    %r10,%r8

3) I do have "dwarf" enabled. Even though, at O4 that is not expected todo any good.

I noted that in 3.3.1 this leads to way more asm-labels than in 3.2.3.

Those labels are only referred to by dwarf-line info (some asmstatements, are reported to be in the "begin" line. Even so, they areclearly not.

- It could be a result of the peephole opt.

- But it could also be that the peephole is affected by the presence ofthose labels.

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

Reply via email to