Fwd: [Lazarus] Faster than popcnt

Marco van de Voort via fpc-devel Tue, 04 Jan 2022 01:31:20 -0800


On 4-1-2022 01:06, J. Gareth Moreton via fpc-devel wrote:

Prepare for a lot of technical rambling!
This is just an analysis of the compilation of utf8lentest.lpr, notany of the System units. Notably, POPCNT isn't called directly, butinstead goes through the System unit via "call fpc_popcnt_qword" onboth 3.2.x and 3.3.1. A future study of "fpc_popcnt_qword" mightyield some interesting information.


Did you pass -Cpcoreavx    ?

# [381] Result += PopCnt(qword(nx));
    popcntq    %r10,%r8
    addq    %r8,%rbx

In this situation, one can replace %r11 with %r13 in the 3rd and 4thinstructions so the massive constant isn't assigned twice:

The large constant(s) should be hoisted out of the loop. This is done inthe assembler code:


.Lj11:
    movq    (%rcx),%r8
    movq    %r8,%r10
    andq    %rdx,%r10
    notq    %r8
    shrq    $7,%r10
    shrq    $6,%r8
    andq    %r8,%r10

subq %r8,%rax // subtract from rax=length(s). Iowalgorithm is slightly different. don't sum

// correction butdirectly subtract from length. Saves bytecount from being a live valuetill the end


     popcntq    %r10,%r8
    addq    $8,%rcx
    decq    %r11
    jne    .Lj11

As far as I can see the assembler version is about 25-33% faster for lowcounts (40 bytes) . For larger strings, the SSE2 code kicks in and thatis a solid factor 2. (270 vs 538 for the fastest "add", see table inthe comments of the source)

Note that I use pop/fast because they are simpler and I only use for lowcounts. The SSE2 version is more like the ADD variant.

For actual reduction of pipeline stalls, I'm not quite sure how smartthe Intel CPU firmware is and if rearranging instructions wouldactually improve throughput or not. For example, on 3.2.x, there arethese four instructions that configure parameters and preserve %rax:

I tried for above code on godbolt, attached is the analysis. (both forthe asm loop, the popcnt loop from my code and the popcnt with themovabs removed to test hoisting)

Note my 3.3.1 (from 29 dec) doesn't seem to generate two loads. With mylimited assembler abilities I'd say it is not the hoisting that makesthe manual asm loop faster as I first thought, but moving the NOT up towhere it is.

Also, POPCNT is handled in the System unit and is not inlined, mainlybecause of how an input of zero is handled (the function has to return255 in the result, whereas Intel's POPCNT instruction sets the zeroflag and leaves the result undefined).

Weird as mine is inlined with -Cpcoreavx -O4, with no special handlingfor 0. But that does put some things on shaky ground. Maybe zero theresult before hand?

manual asm :
--------------------------------------

.Lj47:
                add     rsi,1
                mov     r8,qword [rcx]
                mov     r10,r8
                mov     rdi,-9187201950435737472
                and     r10,rdi
                shr     r10,7
                not     r8
                shr     r8,6
                and     r10,r8
                mov     r11,r10
                popcnt  r8,r10
                add     rbx,r8
                add     rcx,8
                cmp     rsi,r9
                jnge    .Lj47

llvm-mca: -mcpu=ivybridge

Iterations:        100
Instructions:      1200
Total Cycles:      378
Total uOps:        1200

Dispatch Width:    4
uOps Per Cycle:    3.17
IPC:               3.17
Block RThroughput: 3.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      5     0.50    *                   mov   r8, qword ptr [rcx]
 1      1     0.33                        mov   r10, r8
 1      1     0.33                        and   r10, rdx
 1      1     0.33                        not   r8
 1      1     0.50                        shr   r10, 7
 1      1     0.50                        shr   r8, 6
 1      1     0.33                        and   r10, r8
 1      1     0.33                        sub   rax, r8
 1      3     1.00                        popcnt        r8, r10
 1      1     0.33                        add   rcx, 8
 1      1     0.33                        dec   r11
 1      1     1.00                        jne   .Lrestqwords

Resources:
[0]   - SBDivider
[1]   - SBFPDivider
[2]   - SBPort0
[3]   - SBPort1
[4]   - SBPort4
[5]   - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  
 -      -     3.66   3.67    -     3.67   0.50   0.50   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  Instructions:
 -      -      -      -      -      -     0.50   0.50   mov     r8, qword ptr 
[rcx]
 -      -     0.21   0.63    -     0.16    -      -     mov     r10, r8
 -      -     0.35   0.61    -     0.04    -      -     and     r10, rdx
 -      -     0.50   0.16    -     0.34    -      -     not     r8
 -      -     0.52    -      -     0.48    -      -     shr     r10, 7
 -      -     0.51    -      -     0.49    -      -     shr     r8, 6
 -      -     0.87   0.08    -     0.05    -      -     and     r10, r8
 -      -     0.03   0.10    -     0.87    -      -     sub     rax, r8
 -      -      -     1.00    -      -      -      -     popcnt  r8, r10
 -      -     0.47   0.36    -     0.17    -      -     add     rcx, 8
 -      -     0.20   0.73    -     0.07    -      -     dec     r11
 -      -      -      -      -     1.00    -      -     jne     .Lrestqwords


popcnt  trunk -O4 -Cpcoreavx

.Lj47:
                add     rsi,1
                mov     r8,qword ptr [rcx]
                mov     r10,r8
                mov     rdi,-9187201950435737472
                and     r10,rdi
                shr     r10,7
                not     r8
                shr     r8,6
                and     r10,r8
                mov     r11,r10
                popcnt  r8,r10
                add     rbx,r8
                add     rcx,8
                cmp     rsi,r9
                jnge    .Lj47


llvm-mca:

Iterations:        100
Instructions:      1500
Total Cycles:      477
Total uOps:        1500

Dispatch Width:    4
uOps Per Cycle:    3.14
IPC:               3.14
Block RThroughput: 3.8

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.33                        add   rsi, 1
 1      5     0.50    *                   mov   r8, qword ptr [rcx]
 1      1     0.33                        mov   r10, r8
 1      1     0.33                        movabs        rdi, 
-9187201950435737472
 1      1     0.33                        and   r10, rdi
 1      1     0.50                        shr   r10, 7
 1      1     0.33                        not   r8
 1      1     0.50                        shr   r8, 6
 1      1     0.33                        and   r10, r8
 1      1     0.33                        mov   r11, r10
 1      3     1.00                        popcnt        r8, r10
 1      1     0.33                        add   rbx, r8
 1      1     0.33                        add   rcx, 8
 1      1     0.33                        cmp   rsi, r9
 1      1     1.00                        jl    .Lj47

Resources:
[0]   - SBDivider
[1]   - SBFPDivider
[2]   - SBPort0
[3]   - SBPort1
[4]   - SBPort4
[5]   - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  
 -      -     4.66   4.67    -     4.67   0.50   0.50   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  Instructions:
 -      -     0.09   0.67    -     0.24    -      -     add     rsi, 1
 -      -      -      -      -      -     0.50   0.50   mov     r8, qword ptr 
[rcx]
 -      -     0.08   0.51    -     0.41    -      -     mov     r10, r8
 -      -     0.25   0.74    -     0.01    -      -     movabs  rdi, 
-9187201950435737472
 -      -     0.59   0.16    -     0.25    -      -     and     r10, rdi
 -      -     0.83    -      -     0.17    -      -     shr     r10, 7
 -      -     0.01   0.24    -     0.75    -      -     not     r8
 -      -     0.33    -      -     0.67    -      -     shr     r8, 6
 -      -     0.40    -      -     0.60    -      -     and     r10, r8
 -      -     0.91   0.01    -     0.08    -      -     mov     r11, r10
 -      -      -     1.00    -      -      -      -     popcnt  r8, r10
 -      -     1.00    -      -      -      -      -     add     rbx, r8
 -      -     0.09   0.75    -     0.16    -      -     add     rcx, 8
 -      -     0.08   0.59    -     0.33    -      -     cmp     rsi, r9
 -      -      -      -      -     1.00    -      -     jl      .Lj47


when removing/hoisting the movabs:

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  Instructions:
 -      -     0.16   0.32    -     0.52    -      -     add     rsi, 1
 -      -      -      -      -      -     0.50   0.50   mov     r8, qword ptr 
[rcx]
 -      -     0.51   0.01    -     0.48    -      -     mov     r10, r8
 -      -     0.67   0.33    -      -      -      -     and     r10, rdi
 -      -     1.00    -      -      -      -      -     shr     r10, 7
 -      -     0.01    -      -     0.99    -      -     not     r8
 -      -     0.33    -      -     0.67    -      -     shr     r8, 6
 -      -     0.32   0.68    -      -      -      -     and     r10, r8
 -      -     0.64   0.36    -      -      -      -     mov     r11, r10
 -      -      -     1.00    -      -      -      -     popcnt  r8, r10
 -      -     0.01   0.98    -     0.01    -      -     add     rbx, r8
 -      -     0.18   0.49    -     0.33    -      -     add     rcx, 8
 -      -     0.50   0.17    -     0.33    -      -     cmp     rsi, r9
 -      -      -      -      -     1.00    -      -     jl      .Lj47

and block rthroughput decreases from 3.8 to 3.5. The asm has 3.0, probably 
because the NOT r8 
is moved up between the R10 operations

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

Reply via email to