On 4-1-2022 01:06, J. Gareth Moreton via fpc-devel wrote:
Prepare for a lot of technical rambling!

This is just an analysis of the compilation of utf8lentest.lpr, not any of the System units.  Notably, POPCNT isn't called directly, but instead goes through the System unit via "call fpc_popcnt_qword" on both 3.2.x and 3.3.1.  A future study of "fpc_popcnt_qword" might yield some interesting information.

Did you pass -Cpcoreavx    ?

# [381] Result += PopCnt(qword(nx));
    popcntq    %r10,%r8
    addq    %r8,%rbx


In this situation, one can replace %r11 with %r13 in the 3rd and 4th instructions so the massive constant isn't assigned twice:

The large constant(s) should be hoisted out of the loop. This is done in the assembler code:

.Lj11:
    movq    (%rcx),%r8
    movq    %r8,%r10
    andq    %rdx,%r10
    notq    %r8
    shrq    $7,%r10
    shrq    $6,%r8
    andq    %r8,%r10
    subq    %r8,%rax               // subtract from rax=length(s). Iow algorithm is slightly different. don't sum

                                                // correction but directly subtract from length. Saves bytecount from being a live value till the end

     popcntq    %r10,%r8
    addq    $8,%rcx
    decq    %r11
    jne    .Lj11

As far as I can see the assembler version is about 25-33% faster for low counts (40 bytes) . For larger strings, the SSE2 code kicks in and that is a solid factor 2.  (270 vs 538 for the fastest "add", see table in the comments of the source)

Note that I use pop/fast because they are simpler and I only use for low counts. The SSE2 version is more like the ADD variant.

For actual reduction of pipeline stalls, I'm not quite sure how smart the Intel CPU firmware is and if rearranging instructions would actually improve throughput or not.  For example, on 3.2.x, there are these four instructions that configure parameters and preserve %rax:

I tried for above code on godbolt, attached is the analysis. (both for the asm loop, the popcnt loop from my code and the popcnt with the movabs removed to test hoisting)

Note my 3.3.1 (from 29 dec) doesn't seem to generate two loads. With my limited assembler abilities I'd say it is not the hoisting that makes the manual asm loop faster as I first thought, but moving the NOT up to where it is.

Also, POPCNT is handled in the System unit and is not inlined, mainly because of how an input of zero is handled (the function has to return 255 in the result, whereas Intel's POPCNT instruction sets the zero flag and leaves the result undefined).

Weird as mine is inlined with -Cpcoreavx -O4, with no special handling for 0. But that does put some things on shaky ground. Maybe zero the result before hand?

manual asm :
--------------------------------------

.Lj47:
                add     rsi,1
                mov     r8,qword [rcx]
                mov     r10,r8
                mov     rdi,-9187201950435737472
                and     r10,rdi
                shr     r10,7
                not     r8
                shr     r8,6
                and     r10,r8
                mov     r11,r10
                popcnt  r8,r10
                add     rbx,r8
                add     rcx,8
                cmp     rsi,r9
                jnge    .Lj47

llvm-mca: -mcpu=ivybridge

Iterations:        100
Instructions:      1200
Total Cycles:      378
Total uOps:        1200

Dispatch Width:    4
uOps Per Cycle:    3.17
IPC:               3.17
Block RThroughput: 3.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      5     0.50    *                   mov   r8, qword ptr [rcx]
 1      1     0.33                        mov   r10, r8
 1      1     0.33                        and   r10, rdx
 1      1     0.33                        not   r8
 1      1     0.50                        shr   r10, 7
 1      1     0.50                        shr   r8, 6
 1      1     0.33                        and   r10, r8
 1      1     0.33                        sub   rax, r8
 1      3     1.00                        popcnt        r8, r10
 1      1     0.33                        add   rcx, 8
 1      1     0.33                        dec   r11
 1      1     1.00                        jne   .Lrestqwords

Resources:
[0]   - SBDivider
[1]   - SBFPDivider
[2]   - SBPort0
[3]   - SBPort1
[4]   - SBPort4
[5]   - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  
 -      -     3.66   3.67    -     3.67   0.50   0.50   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  Instructions:
 -      -      -      -      -      -     0.50   0.50   mov     r8, qword ptr 
[rcx]
 -      -     0.21   0.63    -     0.16    -      -     mov     r10, r8
 -      -     0.35   0.61    -     0.04    -      -     and     r10, rdx
 -      -     0.50   0.16    -     0.34    -      -     not     r8
 -      -     0.52    -      -     0.48    -      -     shr     r10, 7
 -      -     0.51    -      -     0.49    -      -     shr     r8, 6
 -      -     0.87   0.08    -     0.05    -      -     and     r10, r8
 -      -     0.03   0.10    -     0.87    -      -     sub     rax, r8
 -      -      -     1.00    -      -      -      -     popcnt  r8, r10
 -      -     0.47   0.36    -     0.17    -      -     add     rcx, 8
 -      -     0.20   0.73    -     0.07    -      -     dec     r11
 -      -      -      -      -     1.00    -      -     jne     .Lrestqwords


popcnt  trunk -O4 -Cpcoreavx

.Lj47:
                add     rsi,1
                mov     r8,qword ptr [rcx]
                mov     r10,r8
                mov     rdi,-9187201950435737472
                and     r10,rdi
                shr     r10,7
                not     r8
                shr     r8,6
                and     r10,r8
                mov     r11,r10
                popcnt  r8,r10
                add     rbx,r8
                add     rcx,8
                cmp     rsi,r9
                jnge    .Lj47


llvm-mca:

Iterations:        100
Instructions:      1500
Total Cycles:      477
Total uOps:        1500

Dispatch Width:    4
uOps Per Cycle:    3.14
IPC:               3.14
Block RThroughput: 3.8

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.33                        add   rsi, 1
 1      5     0.50    *                   mov   r8, qword ptr [rcx]
 1      1     0.33                        mov   r10, r8
 1      1     0.33                        movabs        rdi, 
-9187201950435737472
 1      1     0.33                        and   r10, rdi
 1      1     0.50                        shr   r10, 7
 1      1     0.33                        not   r8
 1      1     0.50                        shr   r8, 6
 1      1     0.33                        and   r10, r8
 1      1     0.33                        mov   r11, r10
 1      3     1.00                        popcnt        r8, r10
 1      1     0.33                        add   rbx, r8
 1      1     0.33                        add   rcx, 8
 1      1     0.33                        cmp   rsi, r9
 1      1     1.00                        jl    .Lj47

Resources:
[0]   - SBDivider
[1]   - SBFPDivider
[2]   - SBPort0
[3]   - SBPort1
[4]   - SBPort4
[5]   - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  
 -      -     4.66   4.67    -     4.67   0.50   0.50   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  Instructions:
 -      -     0.09   0.67    -     0.24    -      -     add     rsi, 1
 -      -      -      -      -      -     0.50   0.50   mov     r8, qword ptr 
[rcx]
 -      -     0.08   0.51    -     0.41    -      -     mov     r10, r8
 -      -     0.25   0.74    -     0.01    -      -     movabs  rdi, 
-9187201950435737472
 -      -     0.59   0.16    -     0.25    -      -     and     r10, rdi
 -      -     0.83    -      -     0.17    -      -     shr     r10, 7
 -      -     0.01   0.24    -     0.75    -      -     not     r8
 -      -     0.33    -      -     0.67    -      -     shr     r8, 6
 -      -     0.40    -      -     0.60    -      -     and     r10, r8
 -      -     0.91   0.01    -     0.08    -      -     mov     r11, r10
 -      -      -     1.00    -      -      -      -     popcnt  r8, r10
 -      -     1.00    -      -      -      -      -     add     rbx, r8
 -      -     0.09   0.75    -     0.16    -      -     add     rcx, 8
 -      -     0.08   0.59    -     0.33    -      -     cmp     rsi, r9
 -      -      -      -      -     1.00    -      -     jl      .Lj47


when removing/hoisting the movabs:

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  Instructions:
 -      -     0.16   0.32    -     0.52    -      -     add     rsi, 1
 -      -      -      -      -      -     0.50   0.50   mov     r8, qword ptr 
[rcx]
 -      -     0.51   0.01    -     0.48    -      -     mov     r10, r8
 -      -     0.67   0.33    -      -      -      -     and     r10, rdi
 -      -     1.00    -      -      -      -      -     shr     r10, 7
 -      -     0.01    -      -     0.99    -      -     not     r8
 -      -     0.33    -      -     0.67    -      -     shr     r8, 6
 -      -     0.32   0.68    -      -      -      -     and     r10, r8
 -      -     0.64   0.36    -      -      -      -     mov     r11, r10
 -      -      -     1.00    -      -      -      -     popcnt  r8, r10
 -      -     0.01   0.98    -     0.01    -      -     add     rbx, r8
 -      -     0.18   0.49    -     0.33    -      -     add     rcx, 8
 -      -     0.50   0.17    -     0.33    -      -     cmp     rsi, r9
 -      -      -      -      -     1.00    -      -     jl      .Lj47

and block rthroughput decreases from 3.8 to 3.5. The asm has 3.0, probably 
because the NOT r8 
is moved up between the R10 operations
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to