Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-12-09 Thread J. Gareth Moreton via fpc-devel
Any word on testing and approval on the Assembly-level Common 
Subexpression Eliminator (using Sliding Window) over at 
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/191? I 
know there's room for improvement in the future for sure.  I just hope 
it won't be prone to breaking like the one Jonas mentioned.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-03-27 Thread J. Gareth Moreton via fpc-devel

It's done!

https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/191

Whitepaper can be found as an attachment in that merge request. Go ham 
on trying to break it!


It will likely need refactoring later, especially as I want to port it 
to AArch64 at some point to see how it affects code generation on that 
platform.


Gareth aka. Kit



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-03-11 Thread J. Gareth Moreton via fpc-devel
So far I haven't managed to do very accurate timing measurements - I'll 
try to resolve that and get more accurate figures, especially as I'm 
still writing up the whitepaper explaining everything before I make a 
merge request.  Compilation time I'm not hugely concerned about since 
this sliding window feature is only enabled on -O3 and higher, but the 
faster I can get it, the better, especially as this family of Pascal 
compilers pride themselves on compilation speed.


Gareth aka. Kit

On 11/03/2022 12:43, Flávio Etrusco via fpc-devel wrote:

Kudos! In the end, did it make much of a difference in the compilation time?

Em sex., 25 de fev. de 2022 às 02:08, J. Gareth Moreton via fpc-devel
 escreveu:

I did it!

After a good week of work and getting things wrong, I finally found a
solution that works nicely and is extensible, at least for x86.  A bit
of refactoring and it can be ported to other platforms.  I'm just
running the test suites to see if I can break things now.  Honestly the
hard part was making sure all the registers were tracked properly.

https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window

Please feel free to test it and break it!

It will likely undergo adjustments over time to refactor bits and
pieces, add extra instructions (especially as it doesn't support SHRX
and SHLX yet, for example) and see if I can make the sliding window more
efficient - I increased it in size from 8 to 16 and then 32 entries,
since even at 16, some optimisations were missed in the RTL, but this
depends on a number of factors.

It's gotten some pretty good optimisations.  On x86_64-win64 under -O4,
the three lazarus binaries... before:

lazarus.exe:

313,990,206
313,873,982

lazbuild.exe

59,704,391
59,725,895

startlazarus.exe

27,471,147
27,461,419

And the compiler itself, ppcx64.exe:

3,777,536
3,766,272

Remember that though I call the component a "sliding window", it's only
because it's very similar to how LZ77 finds start points for run-length
encoding, and the comments and variable names mention run-length
encoding as it scans sequences of identical instructions.  However, at
no point is it actually performing data compression, and the end-result
is common subexpression elimination.  All the space savings are from the
removal of redundant sequences of instructions, giving both a space and
a speed boost.

Almost every source file in the compiler and the RTL shows some kind of
improvement.  A lot of them are just redundant pointer deallocations, so
this will help with cache misses and the like - that aside though, here
are a couple of my favourites... one from dbgdwarf -
TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:

Before:

  ...
.Lj682:
  leaq(,%r13,8),%rcx
  movq120(%rdi),%rax
  cqto
  idivq%rcx
  imulq%r13,%rax
  movq%rax,%r12
  addq56(%rbp),%r12
  leaq(,%r13,8),%rcx
  movq120(%rdi),%rax
  cqto
  idivq%rcx
  movq%rdx,%rsi
  cmpb$0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

After:

  ...
.Lj682:
  leaq(,%r13,8),%rcx
  movq120(%rdi),%rax
  cqto
  idivq%rcx
  imulq%r13,%rax
  movq%rax,%r12
  addq56(%rbp),%r12
  movq%rdx,%rsi
  cmpb$0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

This one has successfully removed an IDIV instruction because the
algorithm was able to detect that the subroutine wanted the remainder in
%edx, and it was still available from the first IDIV call because it
hadn't been overwritten, and neither %r13 nor %rdi had changed values so
the references are the same.

This one is from SysUtils' IsLeapYear function and is one I have
personally wanted to optimise further ever since I first saw its
disassembly after I implemented the fast "x mod const = 0" algorithm.

Before:

  ...
  imulw$23593,%cx,%ax
  rorw$2,%ax
  cmpw$655,%ax
  ja.Lj6979
  imulw$23593,%cx,%ax
  rorw$4,%ax
  cmpw$163,%ax
  setbeb%al
  ret
.Lj6979:
  ...

After:

  ...
  imulw$23593,%cx,%ax
  rorw$2,%ax
  cmpw$655,%ax
  ja.Lj6979
  rorw$2,%ax
  cmpw$163,%ax
  setbeb%al
  ret
.Lj6979:
  ...

In this case, the RLE doesn't produce an exact match since the end
result of %ax is different, but the important point to note is that as
long as %cx doesn't change value (which it doesn't). the first sequence
can be transformed into the second sequence via a "rorw $2,%ax"
instruction, so the end result is that "imulw $23593,%cx,%ax; rorw
$4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of
%ax, thus removing a multiplication.

Because this has been a mammoth undertaking and is quite a major
addition to the compiler, I'm going to hold off making a merge request
until I write a document on it, like I've done in the past with some of
the other larger optimisations I've made.  The branch is 

Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-03-11 Thread Flávio Etrusco via fpc-devel
Kudos! In the end, did it make much of a difference in the compilation time?

Em sex., 25 de fev. de 2022 às 02:08, J. Gareth Moreton via fpc-devel
 escreveu:
>
> I did it!
>
> After a good week of work and getting things wrong, I finally found a
> solution that works nicely and is extensible, at least for x86.  A bit
> of refactoring and it can be ported to other platforms.  I'm just
> running the test suites to see if I can break things now.  Honestly the
> hard part was making sure all the registers were tracked properly.
>
> https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window
>
> Please feel free to test it and break it!
>
> It will likely undergo adjustments over time to refactor bits and
> pieces, add extra instructions (especially as it doesn't support SHRX
> and SHLX yet, for example) and see if I can make the sliding window more
> efficient - I increased it in size from 8 to 16 and then 32 entries,
> since even at 16, some optimisations were missed in the RTL, but this
> depends on a number of factors.
>
> It's gotten some pretty good optimisations.  On x86_64-win64 under -O4,
> the three lazarus binaries... before:
>
> lazarus.exe:
>
> 313,990,206
> 313,873,982
>
> lazbuild.exe
>
> 59,704,391
> 59,725,895
>
> startlazarus.exe
>
> 27,471,147
> 27,461,419
>
> And the compiler itself, ppcx64.exe:
>
> 3,777,536
> 3,766,272
>
> Remember that though I call the component a "sliding window", it's only
> because it's very similar to how LZ77 finds start points for run-length
> encoding, and the comments and variable names mention run-length
> encoding as it scans sequences of identical instructions.  However, at
> no point is it actually performing data compression, and the end-result
> is common subexpression elimination.  All the space savings are from the
> removal of redundant sequences of instructions, giving both a space and
> a speed boost.
>
> Almost every source file in the compiler and the RTL shows some kind of
> improvement.  A lot of them are just redundant pointer deallocations, so
> this will help with cache misses and the like - that aside though, here
> are a couple of my favourites... one from dbgdwarf -
> TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:
>
> Before:
>
>  ...
> .Lj682:
>  leaq(,%r13,8),%rcx
>  movq120(%rdi),%rax
>  cqto
>  idivq%rcx
>  imulq%r13,%rax
>  movq%rax,%r12
>  addq56(%rbp),%r12
>  leaq(,%r13,8),%rcx
>  movq120(%rdi),%rax
>  cqto
>  idivq%rcx
>  movq%rdx,%rsi
>  cmpb$0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
>
> After:
>
>  ...
> .Lj682:
>  leaq(,%r13,8),%rcx
>  movq120(%rdi),%rax
>  cqto
>  idivq%rcx
>  imulq%r13,%rax
>  movq%rax,%r12
>  addq56(%rbp),%r12
>  movq%rdx,%rsi
>  cmpb$0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
>
> This one has successfully removed an IDIV instruction because the
> algorithm was able to detect that the subroutine wanted the remainder in
> %edx, and it was still available from the first IDIV call because it
> hadn't been overwritten, and neither %r13 nor %rdi had changed values so
> the references are the same.
>
> This one is from SysUtils' IsLeapYear function and is one I have
> personally wanted to optimise further ever since I first saw its
> disassembly after I implemented the fast "x mod const = 0" algorithm.
>
> Before:
>
>  ...
>  imulw$23593,%cx,%ax
>  rorw$2,%ax
>  cmpw$655,%ax
>  ja.Lj6979
>  imulw$23593,%cx,%ax
>  rorw$4,%ax
>  cmpw$163,%ax
>  setbeb%al
>  ret
> .Lj6979:
>  ...
>
> After:
>
>  ...
>  imulw$23593,%cx,%ax
>  rorw$2,%ax
>  cmpw$655,%ax
>  ja.Lj6979
>  rorw$2,%ax
>  cmpw$163,%ax
>  setbeb%al
>  ret
> .Lj6979:
>  ...
>
> In this case, the RLE doesn't produce an exact match since the end
> result of %ax is different, but the important point to note is that as
> long as %cx doesn't change value (which it doesn't). the first sequence
> can be transformed into the second sequence via a "rorw $2,%ax"
> instruction, so the end result is that "imulw $23593,%cx,%ax; rorw
> $4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of
> %ax, thus removing a multiplication.
>
> Because this has been a mammoth undertaking and is quite a major
> addition to the compiler, I'm going to hold off making a merge request
> until I write a document on it, like I've done in the past with some of
> the other larger optimisations I've made.  The branch is available at
> the link at the top of this e-mail though if anyone wants to take a look
> at it.
>
> One final note is that this optimisation is rather slow and specialist,
> so much so that I've added a new optimizer switch named 'cs_opt_asmcse'
> ("Assembly CSE", to differentiate it from "Node CSE"); this is enabled
> by default with -O3, and 

Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-03-07 Thread J. Gareth Moreton via fpc-devel
Hey everyone.  I'm having a bit of difficulty with writing the 
whitepaper for this feature since I'm struggling to find the words to 
explain what I've done and why I've done it the way I've done it.  Of 
course it could mean it's too complicated, but I'm also possibly very 
out of practice or going senile.  I'll keep trying.


Gareth aka. Kit

On 27/02/2022 23:37, J. Gareth Moreton via fpc-devel wrote:
I will need to refactor this feature later on, especially with names, 
because the data structure is not a true sliding window because it 
only operates over a subset of instructions that are added to a list 
based on whether they fit the criteria of what I call 'seed 
instructions'.  In future, the sliding window may be replaced 
completely.  But for now it's just semantics and hopefully all will be 
explained in the whitepaper once I've finished it.


Gareth aka. Kit




--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-02-27 Thread J. Gareth Moreton via fpc-devel
I will need to refactor this feature later on, especially with names, 
because the data structure is not a true sliding window because it only 
operates over a subset of instructions that are added to a list based on 
whether they fit the criteria of what I call 'seed instructions'.  In 
future, the sliding window may be replaced completely.  But for now it's 
just semantics and hopefully all will be explained in the whitepaper 
once I've finished it.


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-02-27 Thread J. Gareth Moreton via fpc-devel
All tests passed successfully, including an extra addition that shaved 
another 5kb off the compiler!  Now the hard part... writing that 
whitepaper for everyone, since I think this is one of those times where 
it will be necessary.


But if anyone wants to analyse and test it out before I make a merge 
request, here it is: 
https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window


In terms of size, it reduces the compiler size on x86_64-win64 by about 
0.5%, but what will be interesting to hear feedback on is execution 
speed, since it's removing redundant code and reducing the number of 
pointer deallocations.


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-02-25 Thread J. Gareth Moreton via fpc-devel
Fixed those bugs.  I forgot to check to see if the input registers (if 
they are registers) were identical when it came across an arithmetic 
instruction (this was because the code was extended from just handling 
shift and rotate instructions, which can only read from %cl).


Gareth aka. Kit

On 25/02/2022 13:08, J. Gareth Moreton via fpc-devel wrote:
Well I'm not out of the woods yet, I've got one failure on 
x86_64-win64 and two on i386-win32:


x86_64-win64:

Failed to run webtbs/tw16040.pp 2021/09/13 08:19:30

i386-win32:

Failed to run test/packages/bzip2/tbzip2streamtest.pp 2021/09/13 08:19:28
Failed to run webtbs/tw16040.pp 2021/09/13 08:19:30

Given the nature of the tests, my instincts tell me it's something 
small and should be easily fixable.


Gareth aka. Kit




--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-02-25 Thread J. Gareth Moreton via fpc-devel


On 25/02/2022 08:29, Marco Borsari via fpc-devel wrote:

This is very useful, thank you.
I think FPC has an excellent register allocator, but frustrated on 32 bit
by scarce resources and by the lack of reloading check.


Unfortunately the equivalent procedure isn't optimised on i386-win32:

.Lj679:
    movl    %eax,%edx
.Lj680:
    movl    %edx,-832(%ebp)
    leal    (,%edx,8),%ecx
    movl    -824(%ebp),%edx
    movl    76(%edx),%eax
    cltd
    idivl    %ecx
    imull    -832(%ebp),%eax
    movl    %eax,-828(%ebp)
    addl    8(%ebp),%eax
    movl    %eax,-828(%ebp)
    movl    -832(%ebp),%eax
    leal    (,%eax,8),%ecx
    movl    -824(%ebp),%edx
    movl    76(%edx),%eax
    cltd
    idivl    %ecx
    movl    %edx,%esi

The compiler has no way of knowing that -832(%ebp) contains the value of 
%edx at the start and hence loaded into %eax (which is used for the 
initial address instead of %edx, although the optimisation would still 
fail even if they used the same registers) in the repeated sequence.  A 
lot of these optimisations may require a means of adding 'hints' to the 
assembly language list to indicate the state of things.


A more minor example in the same unit (dbgdwarf):

    movl    %eax,%esi
    movl    60(%eax),%edx
    movl    -564(%ebp),%eax
    cmpl    72(%eax),%edx
    jl    .Lj359
    movl    60(%esi),%edx
    movl    -564(%ebp),%eax
    cmpl    76(%eax),%edx

This only gets optimised to:

    movl    %eax,%esi
    movl    60(%eax),%edx
    movl    -564(%ebp),%eax
    cmpl    72(%eax),%edx
    jl    .Lj359
    movl    60(%esi),%edx
    cmpl    76(%eax),%edx

This is because the peephole optimiser changes %esi to %eax in the "movl 
60(%eax),%edx" instruction on account that it will minimise a pipeline 
stall (it doesn't have to wait for %esi to get loaded when %eax is 
definitely loaded).  If there was a means of leaving a hint that %esi = 
%eax at that point, then it might be possible to better optimise it to 
the ideal:


    movl    %eax,%esi
    movl    60(%eax),%edx
    movl    -564(%ebp),%eax
    cmpl    72(%eax),%edx
    jl    .Lj359
    cmpl    76(%eax),%edx

This is what my proposed feature over at 
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/74 is 
meant to help with (the showcase uses the "extra optimisation 
information" to store information on the state of the upper 32 bits of 
registers in x86_64, so it can make deeper optimisations knowing whether 
it's set to zero or not).


Some other things might need some deeper thought:

    movl    -16(%ebp),%edx
    movl    (%edx),%eax
    movl    20(%eax),%eax
    movl    20(%eax),%eax
    movzbl    169(%eax),%eax
    pushl    %eax
    movl    -16(%ebp),%edx
    movl    (%edx),%eax

For some reason, the second "movl -16(%ebp),%edx" isn't removed. I'm not 
sure yet whether this is because the sliding window is too small (the 
first one gets removed due to another "movl -16(%ebp),%edx" that appears 
earlier, so this entry does NOT appear in the sliding window, only the 
earlier one) or because the compiler makes some incorrect assumptions 
about PUSH instructions and hence thinks the value of %edx is unreliable.


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-02-25 Thread J. Gareth Moreton via fpc-devel
Well I'm not out of the woods yet, I've got one failure on x86_64-win64 
and two on i386-win32:


x86_64-win64:

Failed to run webtbs/tw16040.pp 2021/09/13 08:19:30

i386-win32:

Failed to run test/packages/bzip2/tbzip2streamtest.pp 2021/09/13 08:19:28
Failed to run webtbs/tw16040.pp 2021/09/13 08:19:30

Given the nature of the tests, my instincts tell me it's something small 
and should be easily fixable.


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-02-25 Thread Marco Borsari via fpc-devel
On Fri, 25 Feb 2022 05:08:48 +
"J. Gareth Moreton via fpc-devel"  wrote:


> Almost every source file in the compiler and the RTL shows some kind of 
> improvement.  A lot of them are just redundant pointer deallocations, so 
> this will help with cache misses and the like - that aside though, here 
> are a couple of my favourites... one from dbgdwarf - 
> TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:
> 
> Before:
> 
>      ...
> .Lj682:
>      leaq    (,%r13,8),%rcx
>      movq    120(%rdi),%rax
>      cqto
>      idivq    %rcx
>      imulq    %r13,%rax
>      movq    %rax,%r12
>      addq    56(%rbp),%r12
>      leaq    (,%r13,8),%rcx
>      movq    120(%rdi),%rax
>      cqto
>      idivq    %rcx
>      movq    %rdx,%rsi
>      cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
> 
> After:
> 
>      ...
> .Lj682:
>      leaq    (,%r13,8),%rcx
>      movq    120(%rdi),%rax
>      cqto
>      idivq    %rcx
>      imulq    %r13,%rax
>      movq    %rax,%r12
>      addq    56(%rbp),%r12
>      movq    %rdx,%rsi
>      cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
> 
> This one has successfully removed an IDIV instruction because the 
> algorithm was able to detect that the subroutine wanted the remainder in 
> %edx, and it was still available from the first IDIV call because it 
> hadn't been overwritten, and neither %r13 nor %rdi had changed values so 
> the references are the same.

This is very useful, thank you.
I think FPC has an excellent register allocator, but frustrated on 32 bit
by scarce resources and by the lack of reloading check.

-- 
Simplex sigillum veri
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-02-24 Thread J. Gareth Moreton via fpc-devel

I did it!

After a good week of work and getting things wrong, I finally found a 
solution that works nicely and is extensible, at least for x86.  A bit 
of refactoring and it can be ported to other platforms.  I'm just 
running the test suites to see if I can break things now.  Honestly the 
hard part was making sure all the registers were tracked properly.


https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window

Please feel free to test it and break it!

It will likely undergo adjustments over time to refactor bits and 
pieces, add extra instructions (especially as it doesn't support SHRX 
and SHLX yet, for example) and see if I can make the sliding window more 
efficient - I increased it in size from 8 to 16 and then 32 entries, 
since even at 16, some optimisations were missed in the RTL, but this 
depends on a number of factors.


It's gotten some pretty good optimisations.  On x86_64-win64 under -O4, 
the three lazarus binaries... before:


lazarus.exe:

313,990,206
313,873,982

lazbuild.exe

59,704,391
59,725,895

startlazarus.exe

27,471,147
27,461,419

And the compiler itself, ppcx64.exe:

3,777,536
3,766,272

Remember that though I call the component a "sliding window", it's only 
because it's very similar to how LZ77 finds start points for run-length 
encoding, and the comments and variable names mention run-length 
encoding as it scans sequences of identical instructions.  However, at 
no point is it actually performing data compression, and the end-result 
is common subexpression elimination.  All the space savings are from the 
removal of redundant sequences of instructions, giving both a space and 
a speed boost.


Almost every source file in the compiler and the RTL shows some kind of 
improvement.  A lot of them are just redundant pointer deallocations, so 
this will help with cache misses and the like - that aside though, here 
are a couple of my favourites... one from dbgdwarf - 
TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:


Before:

    ...
.Lj682:
    leaq    (,%r13,8),%rcx
    movq    120(%rdi),%rax
    cqto
    idivq    %rcx
    imulq    %r13,%rax
    movq    %rax,%r12
    addq    56(%rbp),%r12
    leaq    (,%r13,8),%rcx
    movq    120(%rdi),%rax
    cqto
    idivq    %rcx
    movq    %rdx,%rsi
    cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

After:

    ...
.Lj682:
    leaq    (,%r13,8),%rcx
    movq    120(%rdi),%rax
    cqto
    idivq    %rcx
    imulq    %r13,%rax
    movq    %rax,%r12
    addq    56(%rbp),%r12
    movq    %rdx,%rsi
    cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

This one has successfully removed an IDIV instruction because the 
algorithm was able to detect that the subroutine wanted the remainder in 
%edx, and it was still available from the first IDIV call because it 
hadn't been overwritten, and neither %r13 nor %rdi had changed values so 
the references are the same.


This one is from SysUtils' IsLeapYear function and is one I have 
personally wanted to optimise further ever since I first saw its 
disassembly after I implemented the fast "x mod const = 0" algorithm.


Before:

    ...
    imulw    $23593,%cx,%ax
    rorw    $2,%ax
    cmpw    $655,%ax
    ja    .Lj6979
    imulw    $23593,%cx,%ax
    rorw    $4,%ax
    cmpw    $163,%ax
    setbeb    %al
    ret
.Lj6979:
    ...

After:

    ...
    imulw    $23593,%cx,%ax
    rorw    $2,%ax
    cmpw    $655,%ax
    ja    .Lj6979
    rorw    $2,%ax
    cmpw    $163,%ax
    setbeb    %al
    ret
.Lj6979:
    ...

In this case, the RLE doesn't produce an exact match since the end 
result of %ax is different, but the important point to note is that as 
long as %cx doesn't change value (which it doesn't). the first sequence 
can be transformed into the second sequence via a "rorw $2,%ax" 
instruction, so the end result is that "imulw $23593,%cx,%ax; rorw 
$4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of 
%ax, thus removing a multiplication.


Because this has been a mammoth undertaking and is quite a major 
addition to the compiler, I'm going to hold off making a merge request 
until I write a document on it, like I've done in the past with some of 
the other larger optimisations I've made.  The branch is available at 
the link at the top of this e-mail though if anyone wants to take a look 
at it.


One final note is that this optimisation is rather slow and specialist, 
so much so that I've added a new optimizer switch named 'cs_opt_asmcse' 
("Assembly CSE", to differentiate it from "Node CSE"); this is enabled 
by default with -O3, and requires the peephole optimizer to also be 
enabled in order to function.


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-02-17 Thread J. Gareth Moreton via fpc-devel
That's useful to know, thanks.  Ideally I'd like to put it behind a kind 
of CSE compiler flag because this will add a degree of overhead to it 
and make compilation slower (currently I'm only doing it for -O3).  
Currently I'm only tracking "mov (ref),%reg" and "lea (ref),%reg" 
instructions and will build from there.


You know I'm never one to turn down a challenge - honestly I'm surprised 
I got this much success already!


I'll take a look at your commits.  At present, to ensure mistakes are 
not made, the sliding window is emptied/reset whenever the following is hit:


- A live label.
- An instruction that modifies the stack pointer (this one may prove 
unnecessary due to register tracking).
- An instruction that writes to memory, since even if it writes to the 
stack or global memory, there's no way to fully predict whether another 
register is pointing to this location and is hence volatile.


Gareth aka. Kit

On 17/02/2022 21:38, Jonas Maebe via fpc-devel wrote:

On 17/02/2022 20:25, J. Gareth Moreton via fpc-devel wrote:
P.S. The term "sliding window" comes from the LZ77 compression 
algorithm and is used to track repeated sequences in ZIP files, among 
others. This prototype optimisation essentially uses the same 
construct, but built for instructions instead of individual bytes.


The usual term for this kind of optimisations is common subexpression 
elimination (CSE). We used to have one at the assembler level for i386 
(my first main contribution to the compiler), but removed it in the 
end because it was extremely prone to breaking — even more so than the 
peephole optimiser. This happened in 
a04cae2c4ba32bc5c7a8461b5851dee8e70c272b (it was supposed to have 
happened in 1b439307490861afd739ba5a48dc175f09727a89, but I made a 
mistake).


It doesn't mean it can't be implemented in a robust way, but I can 
guarantee from experience that it is extremely hard.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Prototype optimisation... Sliding Window

2022-02-17 Thread Jonas Maebe via fpc-devel

On 17/02/2022 20:25, J. Gareth Moreton via fpc-devel wrote:
P.S. The term "sliding window" comes from the LZ77 compression algorithm 
and is used to track repeated sequences in ZIP files, among others. This 
prototype optimisation essentially uses the same construct, but built 
for instructions instead of individual bytes.


The usual term for this kind of optimisations is common subexpression 
elimination (CSE). We used to have one at the assembler level for i386 
(my first main contribution to the compiler), but removed it in the end 
because it was extremely prone to breaking — even more so than the 
peephole optimiser. This happened in 
a04cae2c4ba32bc5c7a8461b5851dee8e70c272b (it was supposed to have 
happened in 1b439307490861afd739ba5a48dc175f09727a89, but I made a mistake).


It doesn't mean it can't be implemented in a robust way, but I can 
guarantee from experience that it is extremely hard.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] Prototype optimisation... Sliding Window

2022-02-17 Thread J. Gareth Moreton via fpc-devel

Hi everyone,

So I've started experimenting with a new technique in the peephole 
optimizer for x86 platforms that I've named the Sliding Window.  The 
intention is to use it to help replace common blocks of code within a 
procedure, such as pointer dereferences.  So far I'm having a degree of 
success, although it's pre-alpha currently.


In the aasmcnst disassembly - before:

.section 
.text.n_aasmcnst$_$ttai_typedconstbuilder_$__$$_emit_pooled_shortstring_const_ref$shortstring,"ax"

    ...
    movq    %rax,%rdx
    movq    U_$AASMDATA_$$_CURRENT_ASMDATA(%rip),%rax
    movq    224(%rax),%rcx
    movq    U_$AASMDATA_$$_CURRENT_ASMDATA(%rip),%rax
    movq    224(%rax),%rax
    movq    (%rax),%rax
    call    *232(%rax)
   ...

After:

.section 
.text.n_aasmcnst$_$ttai_typedconstbuilder_$__$$_emit_pooled_shortstring_const_ref$shortstring,"ax"

    ...
    movq    %rax,%rdx
    movq    U_$AASMDATA_$$_CURRENT_ASMDATA(%rip),%rax
    movq    224(%rax),%rcx
    movq    (%rcx),%rax
    call    *232(%rax)
    ...

The peephole optimizer replaces the second " movq 
U_$AASMDATA_$$_CURRENT_ASMDATA(%rip),%rax" and the "movq 224(%rax),%rax" 
instruction with "mov %rcx,%rax", which is then removed by a 
conventional peephole optimisation because %rax is completely 
overwritten on the next instruction ("mov (%rcx),%rax").  I sense I 
might be onto a winner here!


I do have a question though... are there any tai objects that signal a 
memory fence or a synchronisation hint so it doesn't optimise a repeated 
reference? I know instructions like MFENCE can kind of force it.


Gareth aka. Kit

P.S. The term "sliding window" comes from the LZ77 compression algorithm 
and is used to track repeated sequences in ZIP files, among others.  
This prototype optimisation essentially uses the same construct, but 
built for instructions instead of individual bytes.



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel