Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread J. Gareth Moreton via fpc-devel
It's why I like going for optimisations that try to reduce code size 
without sacrificing speed, because of reducing the number of 16-byte or 
32-byte sections.  Anyhow, back to work with optimising!


Gareth aka. Kit

On 04/01/2022 19:33, Martin Frb via fpc-devel wrote:

On 04/01/2022 18:43, Jonas Maebe via fpc-devel wrote:

On 03/01/2022 12:54, Martin Frb via fpc-devel wrote:
not sure if this is of interest to you, but I see you do a lot on 
the optimizer


It's very likely unrelated to anything the optimiser does or does not 
do, and more regarding random differences in code layout. Charlie 
posted the following video on core just yesterday, and it touches on 
exactly this subject: https://www.youtube.com/watch?v=r-TLSBdHe1A


Choice quote: code layout and environment variables can produce up to 
40% differences in performance, which is more than what even the best 
optimizing compilers can achieve do in most cases.


Interesting...

And yes, see my previous post. It seems to be which "sub-section" of a 
loop falls into a 32 byte aligned 32 byte block.
It's not even the entire loop (not about the begin of the loop), but a 
certain code block within.


This also goes along with one optimization that (even though still 
chance) in my test improved the timing (both worst and best time, 
though those are only *my" worst/best)

=> reducing the byte size of the loop code.
That way there are less 32byte sections.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Martin Frb via fpc-devel

On 04/01/2022 18:43, Jonas Maebe via fpc-devel wrote:

On 03/01/2022 12:54, Martin Frb via fpc-devel wrote:
not sure if this is of interest to you, but I see you do a lot on the 
optimizer


It's very likely unrelated to anything the optimiser does or does not 
do, and more regarding random differences in code layout. Charlie 
posted the following video on core just yesterday, and it touches on 
exactly this subject: https://www.youtube.com/watch?v=r-TLSBdHe1A


Choice quote: code layout and environment variables can produce up to 
40% differences in performance, which is more than what even the best 
optimizing compilers can achieve do in most cases.


Interesting...

And yes, see my previous post. It seems to be which "sub-section" of a 
loop falls into a 32 byte aligned 32 byte block.
It's not even the entire loop (not about the begin of the loop), but a 
certain code block within.


This also goes along with one optimization that (even though still 
chance) in my test improved the timing (both worst and best time, though 
those are only *my" worst/best)

=> reducing the byte size of the loop code.
That way there are less 32byte sections.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Jonas Maebe via fpc-devel

On 03/01/2022 12:54, Martin Frb via fpc-devel wrote:
not sure if this is of interest to you, but I see you do a lot on the 
optimizer


It's very likely unrelated to anything the optimiser does or does not 
do, and more regarding random differences in code layout. Charlie posted 
the following video on core just yesterday, and it touches on exactly 
this subject: https://www.youtube.com/watch?v=r-TLSBdHe1A


Choice quote: code layout and environment variables can produce up to 
40% differences in performance, which is more than what even the best 
optimizing compilers can achieve do in most cases.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Marco van de Voort via fpc-devel


On 4-1-2022 17:15, J. Gareth Moreton via fpc-devel wrote:

I neglected to include -Cpcoreavx, that was my bad.  I'll try again.

According to Intel® 64 and IA-32 Architectures Software Developer’s 
Manual, Vol 2B, Page 4-391.  The zero flag is set if the source is 
zero, and cleared otherwise.  Regarding an undefined result, I got 
confused with the BSF and BSR commands, sorry.  I guess I was more 
tired than I thought!  POPCNT returns zero for a zero input.


Ok, that's what I thought.

I played a bit by adding code alignments to loops in the SSE code, but 
it only seems to slow the core loop rather than accelerate it (align 
before the branch location and/or branch target)


Did you have any thoughts about moving up the NOT instruction ?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread J. Gareth Moreton via fpc-devel

I neglected to include -Cpcoreavx, that was my bad.  I'll try again.

According to Intel® 64 and IA-32 Architectures Software Developer’s 
Manual, Vol 2B, Page 4-391.  The zero flag is set if the source is zero, 
and cleared otherwise.  Regarding an undefined result, I got confused 
with the BSF and BSR commands, sorry.  I guess I was more tired than I 
thought!  POPCNT returns zero for a zero input.


Gareth aka. Kit

On 04/01/2022 16:03, Marco van de Voort via fpc-devel wrote:

On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:


Weird as mine is inlined with -Cpcoreavx -O4, with no special 
handling for 0. But that does put some things on shaky ground. Maybe 
zero the result before hand?


Same here.


I looked up popcnt and found nothing about not setting if zero. (E.g. 
https://www.felixcloutier.com/x86/popcnt )


I meanwhile also ran on my Ryzen 4800H laptop and updated the version 
on the web with the stats. The stats for the  long string are about as 
fast as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen 
should be faster in any way?!?), but the short one (40 bytes) is 
significantly faster. What I don't get is why the assembler version 
seems systematically faster even for the short code. The generated asm 
is nearly the same.


Also notable is that on this machine with popcnt (-Cpcoreavx), the 
popcnt version is as fast as the add function within error margins, so 
probably popcnt instruction is faster(lower latency) and thus less of 
a bottleneck on this machine.  Note that the POP() function is half 
the size, so that makes it better for newer machines.


-

Note that I test on Windows, so it might be that the "two times load" 
is a difference somehow due to different codegeneration on windows





About UTF8LengthFast()

Well, before I get to this, I noted something weird.

2 runs, compiled with the same compiler ( 3.2.3 ), and the same 
settings, with the only difference: -gw3 or not -gw3
=> And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / 
reproducible.


I also have seen this, while working on the code. And indeed mainly 
with the "fast" one. It also explains why the assembler is always 
consistent, it suffers less from detail code changes when I e.g. 
update FPC from git, and thus different alignment. (assuming that the 
section starts are always aligned)



Alignment. 16 vs 32 bit. Can that make a difference?
According to: 
https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache


Seems to be a problem of the Skylake and later archs, which I no 
longer have. The i7 is too old, and the others are AMD.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Marco van de Voort via fpc-devel

On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:


Weird as mine is inlined with -Cpcoreavx -O4, with no special 
handling for 0. But that does put some things on shaky ground. Maybe 
zero the result before hand?


Same here.


I looked up popcnt and found nothing about not setting if zero. (E.g. 
https://www.felixcloutier.com/x86/popcnt )


I meanwhile also ran on my Ryzen 4800H laptop and updated the version on 
the web with the stats. The stats for the  long string are about as fast 
as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen should 
be faster in any way?!?), but the short one (40 bytes) is significantly 
faster. What I don't get is why the assembler version seems 
systematically faster even for the short code. The generated asm is 
nearly the same.


Also notable is that on this machine with popcnt (-Cpcoreavx), the 
popcnt version is as fast as the add function within error margins, so 
probably popcnt instruction is faster(lower latency) and thus less of a 
bottleneck on this machine.  Note that the POP() function is half the 
size, so that makes it better for newer machines.


-

Note that I test on Windows, so it might be that the "two times load" is 
a difference somehow due to different codegeneration on windows





About UTF8LengthFast()

Well, before I get to this, I noted something weird.

2 runs, compiled with the same compiler ( 3.2.3 ), and the same 
settings, with the only difference: -gw3 or not -gw3
=> And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / 
reproducible.


I also have seen this, while working on the code. And indeed mainly with 
the "fast" one. It also explains why the assembler is always consistent, 
it suffers less from detail code changes when I e.g. update FPC from 
git, and thus different alignment. (assuming that the section starts are 
always aligned)



Alignment. 16 vs 32 bit. Can that make a difference?
According to: 
https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache


Seems to be a problem of the Skylake and later archs, which I no longer 
have. The i7 is too old, and the others are AMD.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Martin Frb via fpc-devel
@Marco: havent played with popcnt => it could benefit from the "const to 
var" too.


So I played around a bit...

Of course, all this is intel only

1)
var
  Mask8, Mask1: qword;

  Mask8 := EIGHTYMASK;
  Mask1 := ONEMASK;

And the constant no longer is assigned inside the loop.
Also makes the loop  shorter.

=> improves speed

2)
  //for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do
  //for i := (ByteCount-cnt) div sizeof(PtrInt) - 1 downto 0 do
  i := (ByteCount-cnt) div sizeof(PtrInt) ;
  repeat
    
  dec(i);
  until i = 0;


The orig:
  for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do
// r9 is reserved to hold the upper bound
.Lj26:
    addq    $1,%r10
...
    cmpq    %r10,%r9
    jnle    .Lj26

Since the counter var "i" is not accessed in the loop, its value does 
not matter.

So
  for i := (ByteCount-cnt) div sizeof(PtrInt) - 1 downto 0 do
// no longer needs to store an upper bound, but still has an extra 
"test" since the "subq" is at the start of the loop

.Lj26:
    subq    $1,%r10
...
    testq    %r10,%r10
    jnle    .Lj26

// "repeat " , and there no longer is a "test"
And that again reduced the loop size.
And apparently just below a critical point, as time get a little better 
again


-
WITH the constants moved to var:
orig for : 547
downto:547
repeat: 516

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Martin Frb via fpc-devel

On 04/01/2022 10:31, Marco van de Voort via fpc-devel wrote:



Weird as mine is inlined with -Cpcoreavx -O4, with no special handling 
for 0. But that does put some things on shaky ground. Maybe zero the 
result before hand?


Same here.


About UTF8LengthFast()

Well, before I get to this, I noted something weird.

2 runs, compiled with the same compiler ( 3.2.3 ), and the same 
settings, with the only difference: -gw3 or not -gw3
=> And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / 
reproducible.


So then
=> I compiled the app, with the same 3.2.3  (and I compiled WITHOUT -a  
/ though with -al I get the same results)

=> I compiled once with -gw3 , and once without dwarf info
=> I used objdump to dis-assemble the exe.
=> I diffed (and searched for 0x101010 (only used in the ...Fast code)
--> _*The assembler is identical.*_
Yet one is faster. (the one WITH dwarf)
The calling code in the main body is the same too (as far as I could 
see), except that the address of the callee is different (but it is just 
20 calls per measurement)


I did those runs OUTSIDE the IDE.
So no debugger in the background.

Win64 / 64 bit
Core I7 8600K

Using 3.3.1 the speed is equal. Never mind if dwarf is generated or not.
(I did not compare the asm for that...)

-
Clinging to straws, there is one (maybe) diff in the 3.2.3 with/without 
dwarf assembler.

*** I am totally out of my depth here 

Alignment. 16 vs 32 bit. Can that make a difference?
According to: 
https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache
    The Decoded ICache consists of 32 sets. Each set contains eight 
Ways. Each Way can hold up to six micro-ops.


    All micro-ops in a Way represent instructions which are statically 
contiguous in the code and have their EIPs within _*the same aligned 
32-byte region*_.


So the alignment of the 2 procedures differs  by 16 bytes
The proc entry is at
With dwarf 11870
without   11860  // actually this is 32byte aligned (but slower)

Yet, maybe it matters which statements in the big loop happen to fall 
into the same 32byte block???


The loop starting with
   for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do

With DWARF (faster):
   118f0:    49 83 c2 01      add    $0x1,%r10
   118f4:    4c 8b 19     mov    (%rcx),%r11
   118f7:    4d 89 d8     mov    %r11,%r8
   118fa:    48 bf 80 80 80 80 80     movabs $0x8080808080808080,%rdi
   11901:    80 80 80
   11904:    49 21 f8     and    %rdi,%r8
   11907:    49 c1 e8 07      shr    $0x7,%r8
   1190b:    49 f7 d3     not    %r11
   1190e:    49 c1 eb 06      shr    $0x6,%r11
   11912:    4d 21 d8     and    %r11,%r8
   11915:    4c 89 c3     mov    %r8,%rbx
   11918:    49 bb 01 01 01 01 01     movabs $0x101010101010101,%r11
   1191f:    01 01 01
   11922:    4d 0f af c3      imul   %r11,%r8
   11926:    49 c1 e8 38      shr    $0x38,%r8
   1192a:    4c 01 c6     add    %r8,%rsi
   1192d:    48 83 c1 08      add    $0x8,%rcx
   11931:    4d 39 d1     cmp    %r10,%r9
   11934:    7f ba        jg 118f0 



WITHOUT:
   118e0:    49 83 c2 01      add    $0x1,%r10
   118e4:    4c 8b 19     mov    (%rcx),%r11
   118e7:    4d 89 d8     mov    %r11,%r8
   118ea:    48 bf 80 80 80 80 80     movabs $0x8080808080808080,%rdi
   118f1:    80 80 80
   118f4:    49 21 f8     and    %rdi,%r8
   118f7:    49 c1 e8 07      shr    $0x7,%r8
   118fb:    49 f7 d3     not    %r11
   118fe:    49 c1 eb 06      shr    $0x6,%r11
   11902:    4d 21 d8     and    %r11,%r8
   11905:    4c 89 c3     mov    %r8,%rbx
   11908:    49 bb 01 01 01 01 01     movabs $0x101010101010101,%r11
   1190f:    01 01 01
   11912:    4d 0f af c3      imul   %r11,%r8
   11916:    49 c1 e8 38      shr    $0x38,%r8
   1191a:    4c 01 c6     add    %r8,%rsi
   1191d:    48 83 c1 08      add    $0x8,%rcx
   11921:    4d 39 d1     cmp    %r10,%r9
   11924:    7f ba        jg 0x118e0



*** written before I got into the above..
---

About UTF8LengthFast()

I notice 3 differences.
Though I only compare the O4 result, and have no idea what is 
pre-peephole and post-peephole.


1) As you say: different registers (but the same statements, in the same 
order).

No idea if that affects the CPU.

2) One extra statement in 3.3.1

    movq    %r10,%r8  <<< not in 3.2.3
    movq    $72340172838076673,%r10
    imulq    %r10,%r8

3) I do have "dwarf" enabled. 

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-04 Thread Marco van de Voort via fpc-devel


On 4-1-2022 01:06, J. Gareth Moreton via fpc-devel wrote:

Prepare for a lot of technical rambling!

This is just an analysis of the compilation of utf8lentest.lpr, not 
any of the System units.  Notably, POPCNT isn't called directly, but 
instead goes through the System unit via "call fpc_popcnt_qword" on 
both 3.2.x and 3.3.1.  A future study of "fpc_popcnt_qword" might 
yield some interesting information.


Did you pass -Cpcoreavx    ?

# [381] Result += PopCnt(qword(nx));
    popcntq    %r10,%r8
    addq    %r8,%rbx



In this situation, one can replace %r11 with %r13 in the 3rd and 4th 
instructions so the massive constant isn't assigned twice:


The large constant(s) should be hoisted out of the loop. This is done in 
the assembler code:


.Lj11:
    movq    (%rcx),%r8
    movq    %r8,%r10
    andq    %rdx,%r10
    notq    %r8
    shrq    $7,%r10
    shrq    $6,%r8
    andq    %r8,%r10
    subq    %r8,%rax   // subtract from rax=length(s). Iow 
algorithm is slightly different. don't sum


    // correction but 
directly subtract from length. Saves bytecount from being a live value 
till the end


 popcntq    %r10,%r8
    addq    $8,%rcx
    decq    %r11
    jne    .Lj11

As far as I can see the assembler version is about 25-33% faster for low 
counts (40 bytes) . For larger strings, the SSE2 code kicks in and that 
is a solid factor 2.  (270 vs 538 for the fastest "add", see table in 
the comments of the source)


Note that I use pop/fast because they are simpler and I only use for low 
counts. The SSE2 version is more like the ADD variant.


For actual reduction of pipeline stalls, I'm not quite sure how smart 
the Intel CPU firmware is and if rearranging instructions would 
actually improve throughput or not.  For example, on 3.2.x, there are 
these four instructions that configure parameters and preserve %rax:


I tried for above code on godbolt, attached is the analysis. (both for 
the asm loop, the popcnt loop from my code and the popcnt with the 
movabs removed to test hoisting)


Note my 3.3.1 (from 29 dec) doesn't seem to generate two loads. With my 
limited assembler abilities I'd say it is not the hoisting that makes 
the manual asm loop faster as I first thought, but moving the NOT up to 
where it is.


Also, POPCNT is handled in the System unit and is not inlined, mainly 
because of how an input of zero is handled (the function has to return 
255 in the result, whereas Intel's POPCNT instruction sets the zero 
flag and leaves the result undefined).


Weird as mine is inlined with -Cpcoreavx -O4, with no special handling 
for 0. But that does put some things on shaky ground. Maybe zero the 
result before hand?


manual asm :
--

.Lj47:
add rsi,1
mov r8,qword [rcx]
mov r10,r8
mov rdi,-9187201950435737472
and r10,rdi
shr r10,7
not r8
shr r8,6
and r10,r8
mov r11,r10
popcnt  r8,r10
add rbx,r8
add rcx,8
cmp rsi,r9
jnge.Lj47

llvm-mca: -mcpu=ivybridge

Iterations:100
Instructions:  1200
Total Cycles:  378
Total uOps:1200

Dispatch Width:4
uOps Per Cycle:3.17
IPC:   3.17
Block RThroughput: 3.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1][2][3][4][5][6]Instructions:
 1  5 0.50*   mov   r8, qword ptr [rcx]
 1  1 0.33mov   r10, r8
 1  1 0.33and   r10, rdx
 1  1 0.33not   r8
 1  1 0.50shr   r10, 7
 1  1 0.50shr   r8, 6
 1  1 0.33and   r10, r8
 1  1 0.33sub   rax, r8
 1  3 1.00popcntr8, r10
 1  1 0.33add   rcx, 8
 1  1 0.33dec   r11
 1  1 1.00jne   .Lrestqwords

Resources:
[0]   - SBDivider
[1]   - SBFPDivider
[2]   - SBPort0
[3]   - SBPort1
[4]   - SBPort4
[5]   - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23

Resource pressure per iteration:
[0][1][2][3][4][5][6.0]  [6.1]  
 -  - 3.66   3.67- 3.67   0.50   0.50   

Resource pressure by instruction:
[0][1][2][3][4][5][6.0]  [6.1]  Instructions:
 -  -  -  -  -  - 0.50   0.50   mov r8, qword ptr 
[rcx]
 -  - 0.21   0.63- 0.16-  - mov r10, r8
 -  - 0.35   0.61- 0.04-  - and r10, rdx
 -  - 

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-03 Thread J. Gareth Moreton via fpc-devel

Prepare for a lot of technical rambling!

This is just an analysis of the compilation of utf8lentest.lpr, not any 
of the System units.  Notably, POPCNT isn't called directly, but instead 
goes through the System unit via "call fpc_popcnt_qword" on both 3.2.x 
and 3.3.1.  A future study of "fpc_popcnt_qword" might yield some 
interesting information.


From what I've seen so far, "for i := 1 to cnt do" is implemented 
differently, although not in a way that makes it slower, and there are a 
few extra MOV instructions that shouldn't cause a speed decrease thanks 
to parallel execution.  Similarly, alignment hints now take the form of 
".p2align 4,,10; .p2align 3" rather than ".balign 8,0x90", which aligns 
code to a 16-byte boundary rather than an 8-byte boundary and uses 
multi-byte NOP instructions rather than a large number of single-byte 
NOP instructions (machine code 0x90), although there's every chance I 
configured things incorrectly.


The difficult part is seeing through the different registers, since 
3.3.1 is assigning registers differently to 3.2.x.  I've attached the 
two dumps if you want to compare them yourself (built under -O4).


From what I can observe, the performance loss seems to be due to 
changes in code generation causing more pipeline stalls rather than any 
particular peephole optimisation.  There is room for improvement though 
- for example, both compilers have snippets of code akin to the following:


    movq    %r12,%r10
    shrq    $8,%r10
    movq    $71777214294589695,%r11
    andq    %r11,%r10
    movq    %r12,%r11
    movq    $71777214294589695,%r13
    andq    %r13,%r11
    leaq    (%r10,%r11),%r12

In this situation, one can replace %r11 with %r13 in the 3rd and 4th 
instructions so the massive constant isn't assigned twice:


    movq    %r12,%r10
    shrq    $8,%r10
    movq    $71777214294589695,%r13
    andq    %r13,%r10
    movq    %r12,%r11
    andq    %r13,%r11
    leaq    (%r10,%r11),%r12

And if %r13 isn't used again in its current form (in this particular 
snippet, it isn't... it gets assigned a new constant a few instructions 
later), one can do some clever rearranging:


    movq    %r12,%r10
    shrq    $8,%r10
    movq    $71777214294589695,%r11
    andq    %r11,%r10
    andq    %r12,%r11
    leaq    (%r10,%r11),%r12

Granted this is finding new peephole optimisations rather than fixing 
the pipeline stalls, but I can see quite a few places where a 
rearrangement of registers can produce smaller and more efficient code 
by removing redundant transfers.


There is a slight inefficiency in a nested function prologue - under 
3.2.x, it's this:


.section .text.n_p$program$_$testasmutf8length_$$_fin$0008,"x"
    .balign 16,0x90
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$0008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$0008
    pushq    %rbp
.seh_pushreg %rbp
    movq    %rcx,%rbp
    leaq    -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
    leaq    -8(%rbp),%rcx
    call    fpc_ansistr_decr_ref
    leaq    -24(%rbp),%rcx
    call    fpc_ansistr_decr_ref

... whereas under 3.3.1...

.section .text.n_p$program$_$testasmutf8length_$$_fin$0008,"ax"
    .balign 16,0x90
.globl    P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$0008
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$0008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$0008
    pushq    %rbp
.seh_pushreg %rbp
    movq    %rcx,%rbp
    leaq    -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
    subq    $24,%rcx
    call    fpc_ansistr_decr_ref
    leaq    -8(%rbp),%rcx
    call    fpc_ansistr_decr_ref

In this case, the one in 3.2.x is slightly better and could be improved 
by removing "movq %rcx,%rbp" completely, since the value is never used 
and is not recorded in the SEH directives (also, the string objects are 
decremented in the opposite order, with 3.2.x doing -8(%rbp) first, but 
3.3.1 doing -24(%rbp) first instead).  I think this may be due to me 
putting in code in the peephole optimizer that avoids playing around 
with the function prologue. (3.2.x doesn't have the .globl symbol though)


For actual reduction of pipeline stalls, I'm not quite sure how smart 
the Intel CPU firmware is and if rearranging instructions would actually 
improve throughput or not.  For example, on 3.2.x, there are these four 
instructions that configure parameters and preserve %rax:


    movq    %rax,%rsi
    movl    -12(%rbp),%edx
    movl    $2,%r8d
    leaq    -24(%rbp),%rcx

All's okay so far, but on 3.3.1, the ordering is different:

    movq    %rax,%rsi
    movl    -12(%rbp),%edx
    leaq    -24(%rbp),%rcx
    movl    $2,%r8d

If one assumes the instructions are executed in-order, there's a greater 
risk of a pipeline stall if there's only one AGU (Address Generation 
Unit) available, in which case, rearranging the instructions to the 
following might produce better results in regards to port allocation:


    movl    -12(%rbp),%edx
    movq    %rax,%rsi
    leaq    -24(%rbp),%rcx
    movl    

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-03 Thread J. Gareth Moreton via fpc-devel
Interesting - thank you.  Will be interesting to study the assembler 
output to see what's going on.


I'm honoured that I've become the go-to person when optimisation is 
concerned!


Gareth aka. Kit

On 03/01/2022 11:54, Martin Frb via fpc-devel wrote:

Hi Gareth,

not sure if this is of interest to you, but I see you do a lot on the 
optimizer


While testing the attached, I found that one of the functions was 
notable slower when compiled with 3.3.1 (compared to 3.2.3).

So maybe something you are interested in looking at?

The Code in "Utf8LengthFash" (fst) went from around 600ms to 700ms.

3.3.1 from Dec 10th
3.2.3 from Dec 9th

Core I7 8700K
-O4 -Cpcoreavx2

fpc 3.2.3 /   fpc 3.3.1

fst 594   fst 688
fst 578   fst 703
fst 578   fst 687
fst 562   fst 688

pop 485   pop 485
pop 500   pop 500
pop 500   pop 484
pop 484   pop 500

add 594   add 422
add 578   add 438
add 578   add 437
add 594   add 453

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-03 Thread Marco van de Voort via fpc-devel


On 3-1-2022 12:54, Martin Frb via fpc-devel wrote:



fpc 3.2.3 /   fpc 3.3.1

fst 594   fst 688
fst 578   fst 703
fst 578   fst 687
fst 562   fst 688




Fyi, the latest asm version (+fst/pop/add/naieve) is at

http://www.stack.nl/~marcov/utf8lentest.lpr

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

2022-01-03 Thread Martin Frb via fpc-devel

Hi Gareth,

not sure if this is of interest to you, but I see you do a lot on the 
optimizer


While testing the attached, I found that one of the functions was 
notable slower when compiled with 3.3.1 (compared to 3.2.3).

So maybe something you are interested in looking at?

The Code in "Utf8LengthFash" (fst) went from around 600ms to 700ms.

3.3.1 from Dec 10th
3.2.3 from Dec 9th

Core I7 8700K
-O4 -Cpcoreavx2

fpc 3.2.3 /   fpc 3.3.1

fst 594   fst 688
fst 578   fst 703
fst 578   fst 687
fst 562   fst 688

pop 485   pop 485
pop 500   pop 500
pop 500   pop 484
pop 484   pop 500

add 594   add 422
add 578   add 438
add 578   add 437
add 594   add 453//
// (C) 2021 Martin Friebe and Marco van de Voort.
// attempt to accelerate utf8lengthfast which is a length(s) in utf8 codepoints 
without integrity checking
//
// 4 versions.
// - Original,
// - with popcount and
// - the "add" variant that accumulates 127 iterations of ptrints and only adds
// the intermeidates outside that loop
// - a SSE2 version loosely inspired by the add variant combined with
//the core of an existing (branchless) binarization routine for the 
main loop.

{$mode objfpc}{$H+}
{$asmmode intel}
{$coperators on}
{define asmdebug}

uses SysUtils,StrUtils;

const   
  mask3   :  array[0..15] of byte  = (   $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0);

  mask4   :  array[0..15] of byte  = (   $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80);

  
  mask2   :  array[0..15] of byte  = (   $1,$1,$1,$1,
 $1,$1,$1,$1,
 $1,$1,$1,$1,
 $1,$1,$1,$1);

// Integer arguments are passed in registers RCX, RDX, R8, and R9.
// Floating point arguments are passed in XMM0L, XMM1L, XMM2L, and XMM3L.
// volatile: RAX, RCX, RDX, R8, R9, R10, R11
// nonvolatile RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered 
nonvolatile
// volatile xmm0-xmm3 (params) en xmm4,5
// https://msdn.microsoft.com/en-us/library/ms235286.aspx

{$ifdef asmdebug}
function asmutf8length(const s : pchar;len:integer,res:pbyte):int64;
{$else}
function asmutf8length(const s : pchar;len:integer):int64;{$ifndef 
Windows}assembler; nostackframe;{$endif}
{$endif}

{$ifdef Windows}
begin
{$endif}
 asm
  // tuning for short strings:
  // --
  {$ifndef Windows}
  // we can't use [s] as an alias for the pointer parameter, because the non 
assembler procedure on Windows
  // changes that into a stack reference. FPC doesn't support non volatile 
frame management for assembler procs like Delphi does.
  mov rcx,s // rdi
  mov edx,len   // rsi
  {$endif}
  test rax,rax
  je @theend
  cmp rdx,128 // threshold between long and short.
  jl @restbytes

  mov rax,rdx
  mov r10,rcx
  and r10,15
  mov r9,16
  sub r9,r10
  and r9,15
  test r9,r9
  je @nopreloop
  sub rdx,r9
@preloop:  // roughly 2 cycles per iteration on ivy bridge
  movzx r11d, byte [rcx]// unaligned bytes after sse loop
  mov r10,r11
  shr r10,7
  not r11
  shr r11,6
  and r10,r11
  sub rax,r10
  inc rcx
  dec r9
  jne @preloop
@nopreloop:

  mov r9,rdx
  and r9,15
  shr rdx,4
  pxor xmm5,xmm5   // always zero
  pxor xmm6,xmm6   // dword counts

  // using broadcast etc raises requirements? -> use constant loads.

  movdqu xmm1,[rip+mask3]
  movdqu xmm2,[rip+mask4]
  movdqu xmm3,[rip+mask2]

  test rdx,rdx
  je @restbytes

@outer:
  mov r10,127   // max iterations per inner loop
  cmp r10,rdx   // more or less left?
  jl @last  // more
  mov r10,rdx   // less
  @last:
  sub rdx,r10// iterations left - iterations to do

  pxor xmm4,xmm4

// process 127 iterations (limit of signed int8)

@inner:// +/- 2.2 cycles per iteration for 16 bytes on ivy 
bridge
  movdqu xmm0, [rcx]
  pand  xmm0,xmm1  // mask out top 2 bits
  pcmpeqb xmm0,xmm2// compare with $80. 
  pand  xmm0,xmm3  // change to $1 per byte.
  paddb  xmm4,xmm0 // add to cumulative

  add rcx,16
  dec r10
  jne @inner


  // SSSE3 vertical adds might help this, but increase CPU reqs.

  movdqa xmm0,xmm4

  PUNPCKLBW xmm0,xmm5   // zero extend to words
  PUNPCKHBW xmm4,xmm5
  paddw xmm0,xmm4   // add, now 8 16-bit words.

  movdqa xmm4,xmm0
  PUNPCKLWD xmm0,xmm5   // zero extend to dwords
  paddd xmm6,xmm0
  PUNPCKHWD  xmm4,xmm5
  paddd xmm6,xmm4   // add both L and H to cumulative 4x dword xmm6 
reg
  test rdx,rdx
  jne @outer

  MOVHLPS  xmm4,xmm6