Re: [fpc-devel] Prototype optimisation... Sliding Window
Any word on testing and approval on the Assembly-level Common Subexpression Eliminator (using Sliding Window) over at https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/191? I know there's room for improvement in the future for sure. I just hope it won't be prone to breaking like the one Jonas mentioned. Kit ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
It's done! https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/191 Whitepaper can be found as an attachment in that merge request. Go ham on trying to break it! It will likely need refactoring later, especially as I want to port it to AArch64 at some point to see how it affects code generation on that platform. Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
So far I haven't managed to do very accurate timing measurements - I'll try to resolve that and get more accurate figures, especially as I'm still writing up the whitepaper explaining everything before I make a merge request. Compilation time I'm not hugely concerned about since this sliding window feature is only enabled on -O3 and higher, but the faster I can get it, the better, especially as this family of Pascal compilers pride themselves on compilation speed. Gareth aka. Kit On 11/03/2022 12:43, Flávio Etrusco via fpc-devel wrote: Kudos! In the end, did it make much of a difference in the compilation time? Em sex., 25 de fev. de 2022 às 02:08, J. Gareth Moreton via fpc-devel escreveu: I did it! After a good week of work and getting things wrong, I finally found a solution that works nicely and is extensible, at least for x86. A bit of refactoring and it can be ported to other platforms. I'm just running the test suites to see if I can break things now. Honestly the hard part was making sure all the registers were tracked properly. https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window Please feel free to test it and break it! It will likely undergo adjustments over time to refactor bits and pieces, add extra instructions (especially as it doesn't support SHRX and SHLX yet, for example) and see if I can make the sliding window more efficient - I increased it in size from 8 to 16 and then 32 entries, since even at 16, some optimisations were missed in the RTL, but this depends on a number of factors. It's gotten some pretty good optimisations. On x86_64-win64 under -O4, the three lazarus binaries... before: lazarus.exe: 313,990,206 313,873,982 lazbuild.exe 59,704,391 59,725,895 startlazarus.exe 27,471,147 27,461,419 And the compiler itself, ppcx64.exe: 3,777,536 3,766,272 Remember that though I call the component a "sliding window", it's only because it's very similar to how LZ77 finds start points for run-length encoding, and the comments and variable names mention run-length encoding as it scans sequences of identical instructions. However, at no point is it actually performing data compression, and the end-result is common subexpression elimination. All the space savings are from the removal of redundant sequences of instructions, giving both a space and a speed boost. Almost every source file in the compiler and the RTL shows some kind of improvement. A lot of them are just redundant pointer deallocations, so this will help with cache misses and the like - that aside though, here are a couple of my favourites... one from dbgdwarf - TDebugInfoDwarf.appendsym_fieldvar_with_name_offset: Before: ... .Lj682: leaq(,%r13,8),%rcx movq120(%rdi),%rax cqto idivq%rcx imulq%r13,%rax movq%rax,%r12 addq56(%rbp),%r12 leaq(,%r13,8),%rcx movq120(%rdi),%rax cqto idivq%rcx movq%rdx,%rsi cmpb$0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) After: ... .Lj682: leaq(,%r13,8),%rcx movq120(%rdi),%rax cqto idivq%rcx imulq%r13,%rax movq%rax,%r12 addq56(%rbp),%r12 movq%rdx,%rsi cmpb$0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) This one has successfully removed an IDIV instruction because the algorithm was able to detect that the subroutine wanted the remainder in %edx, and it was still available from the first IDIV call because it hadn't been overwritten, and neither %r13 nor %rdi had changed values so the references are the same. This one is from SysUtils' IsLeapYear function and is one I have personally wanted to optimise further ever since I first saw its disassembly after I implemented the fast "x mod const = 0" algorithm. Before: ... imulw$23593,%cx,%ax rorw$2,%ax cmpw$655,%ax ja.Lj6979 imulw$23593,%cx,%ax rorw$4,%ax cmpw$163,%ax setbeb%al ret .Lj6979: ... After: ... imulw$23593,%cx,%ax rorw$2,%ax cmpw$655,%ax ja.Lj6979 rorw$2,%ax cmpw$163,%ax setbeb%al ret .Lj6979: ... In this case, the RLE doesn't produce an exact match since the end result of %ax is different, but the important point to note is that as long as %cx doesn't change value (which it doesn't). the first sequence can be transformed into the second sequence via a "rorw $2,%ax" instruction, so the end result is that "imulw $23593,%cx,%ax; rorw $4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of %ax, thus removing a multiplication. Because this has been a mammoth undertaking and is quite a major addition to the compiler, I'm going to hold off making a merge request until I write a document on it, like I've done in the past with some of the other larger optimisations I've made. The branch is
Re: [fpc-devel] Prototype optimisation... Sliding Window
Kudos! In the end, did it make much of a difference in the compilation time? Em sex., 25 de fev. de 2022 às 02:08, J. Gareth Moreton via fpc-devel escreveu: > > I did it! > > After a good week of work and getting things wrong, I finally found a > solution that works nicely and is extensible, at least for x86. A bit > of refactoring and it can be ported to other platforms. I'm just > running the test suites to see if I can break things now. Honestly the > hard part was making sure all the registers were tracked properly. > > https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window > > Please feel free to test it and break it! > > It will likely undergo adjustments over time to refactor bits and > pieces, add extra instructions (especially as it doesn't support SHRX > and SHLX yet, for example) and see if I can make the sliding window more > efficient - I increased it in size from 8 to 16 and then 32 entries, > since even at 16, some optimisations were missed in the RTL, but this > depends on a number of factors. > > It's gotten some pretty good optimisations. On x86_64-win64 under -O4, > the three lazarus binaries... before: > > lazarus.exe: > > 313,990,206 > 313,873,982 > > lazbuild.exe > > 59,704,391 > 59,725,895 > > startlazarus.exe > > 27,471,147 > 27,461,419 > > And the compiler itself, ppcx64.exe: > > 3,777,536 > 3,766,272 > > Remember that though I call the component a "sliding window", it's only > because it's very similar to how LZ77 finds start points for run-length > encoding, and the comments and variable names mention run-length > encoding as it scans sequences of identical instructions. However, at > no point is it actually performing data compression, and the end-result > is common subexpression elimination. All the space savings are from the > removal of redundant sequences of instructions, giving both a space and > a speed boost. > > Almost every source file in the compiler and the RTL shows some kind of > improvement. A lot of them are just redundant pointer deallocations, so > this will help with cache misses and the like - that aside though, here > are a couple of my favourites... one from dbgdwarf - > TDebugInfoDwarf.appendsym_fieldvar_with_name_offset: > > Before: > > ... > .Lj682: > leaq(,%r13,8),%rcx > movq120(%rdi),%rax > cqto > idivq%rcx > imulq%r13,%rax > movq%rax,%r12 > addq56(%rbp),%r12 > leaq(,%r13,8),%rcx > movq120(%rdi),%rax > cqto > idivq%rcx > movq%rdx,%rsi > cmpb$0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) > > After: > > ... > .Lj682: > leaq(,%r13,8),%rcx > movq120(%rdi),%rax > cqto > idivq%rcx > imulq%r13,%rax > movq%rax,%r12 > addq56(%rbp),%r12 > movq%rdx,%rsi > cmpb$0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) > > This one has successfully removed an IDIV instruction because the > algorithm was able to detect that the subroutine wanted the remainder in > %edx, and it was still available from the first IDIV call because it > hadn't been overwritten, and neither %r13 nor %rdi had changed values so > the references are the same. > > This one is from SysUtils' IsLeapYear function and is one I have > personally wanted to optimise further ever since I first saw its > disassembly after I implemented the fast "x mod const = 0" algorithm. > > Before: > > ... > imulw$23593,%cx,%ax > rorw$2,%ax > cmpw$655,%ax > ja.Lj6979 > imulw$23593,%cx,%ax > rorw$4,%ax > cmpw$163,%ax > setbeb%al > ret > .Lj6979: > ... > > After: > > ... > imulw$23593,%cx,%ax > rorw$2,%ax > cmpw$655,%ax > ja.Lj6979 > rorw$2,%ax > cmpw$163,%ax > setbeb%al > ret > .Lj6979: > ... > > In this case, the RLE doesn't produce an exact match since the end > result of %ax is different, but the important point to note is that as > long as %cx doesn't change value (which it doesn't). the first sequence > can be transformed into the second sequence via a "rorw $2,%ax" > instruction, so the end result is that "imulw $23593,%cx,%ax; rorw > $4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of > %ax, thus removing a multiplication. > > Because this has been a mammoth undertaking and is quite a major > addition to the compiler, I'm going to hold off making a merge request > until I write a document on it, like I've done in the past with some of > the other larger optimisations I've made. The branch is available at > the link at the top of this e-mail though if anyone wants to take a look > at it. > > One final note is that this optimisation is rather slow and specialist, > so much so that I've added a new optimizer switch named 'cs_opt_asmcse' > ("Assembly CSE", to differentiate it from "Node CSE"); this is enabled > by default with -O3, and
Re: [fpc-devel] Prototype optimisation... Sliding Window
Hey everyone. I'm having a bit of difficulty with writing the whitepaper for this feature since I'm struggling to find the words to explain what I've done and why I've done it the way I've done it. Of course it could mean it's too complicated, but I'm also possibly very out of practice or going senile. I'll keep trying. Gareth aka. Kit On 27/02/2022 23:37, J. Gareth Moreton via fpc-devel wrote: I will need to refactor this feature later on, especially with names, because the data structure is not a true sliding window because it only operates over a subset of instructions that are added to a list based on whether they fit the criteria of what I call 'seed instructions'. In future, the sliding window may be replaced completely. But for now it's just semantics and hopefully all will be explained in the whitepaper once I've finished it. Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
I will need to refactor this feature later on, especially with names, because the data structure is not a true sliding window because it only operates over a subset of instructions that are added to a list based on whether they fit the criteria of what I call 'seed instructions'. In future, the sliding window may be replaced completely. But for now it's just semantics and hopefully all will be explained in the whitepaper once I've finished it. Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
All tests passed successfully, including an extra addition that shaved another 5kb off the compiler! Now the hard part... writing that whitepaper for everyone, since I think this is one of those times where it will be necessary. But if anyone wants to analyse and test it out before I make a merge request, here it is: https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window In terms of size, it reduces the compiler size on x86_64-win64 by about 0.5%, but what will be interesting to hear feedback on is execution speed, since it's removing redundant code and reducing the number of pointer deallocations. Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
Fixed those bugs. I forgot to check to see if the input registers (if they are registers) were identical when it came across an arithmetic instruction (this was because the code was extended from just handling shift and rotate instructions, which can only read from %cl). Gareth aka. Kit On 25/02/2022 13:08, J. Gareth Moreton via fpc-devel wrote: Well I'm not out of the woods yet, I've got one failure on x86_64-win64 and two on i386-win32: x86_64-win64: Failed to run webtbs/tw16040.pp 2021/09/13 08:19:30 i386-win32: Failed to run test/packages/bzip2/tbzip2streamtest.pp 2021/09/13 08:19:28 Failed to run webtbs/tw16040.pp 2021/09/13 08:19:30 Given the nature of the tests, my instincts tell me it's something small and should be easily fixable. Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
On 25/02/2022 08:29, Marco Borsari via fpc-devel wrote: This is very useful, thank you. I think FPC has an excellent register allocator, but frustrated on 32 bit by scarce resources and by the lack of reloading check. Unfortunately the equivalent procedure isn't optimised on i386-win32: .Lj679: movl %eax,%edx .Lj680: movl %edx,-832(%ebp) leal (,%edx,8),%ecx movl -824(%ebp),%edx movl 76(%edx),%eax cltd idivl %ecx imull -832(%ebp),%eax movl %eax,-828(%ebp) addl 8(%ebp),%eax movl %eax,-828(%ebp) movl -832(%ebp),%eax leal (,%eax,8),%ecx movl -824(%ebp),%edx movl 76(%edx),%eax cltd idivl %ecx movl %edx,%esi The compiler has no way of knowing that -832(%ebp) contains the value of %edx at the start and hence loaded into %eax (which is used for the initial address instead of %edx, although the optimisation would still fail even if they used the same registers) in the repeated sequence. A lot of these optimisations may require a means of adding 'hints' to the assembly language list to indicate the state of things. A more minor example in the same unit (dbgdwarf): movl %eax,%esi movl 60(%eax),%edx movl -564(%ebp),%eax cmpl 72(%eax),%edx jl .Lj359 movl 60(%esi),%edx movl -564(%ebp),%eax cmpl 76(%eax),%edx This only gets optimised to: movl %eax,%esi movl 60(%eax),%edx movl -564(%ebp),%eax cmpl 72(%eax),%edx jl .Lj359 movl 60(%esi),%edx cmpl 76(%eax),%edx This is because the peephole optimiser changes %esi to %eax in the "movl 60(%eax),%edx" instruction on account that it will minimise a pipeline stall (it doesn't have to wait for %esi to get loaded when %eax is definitely loaded). If there was a means of leaving a hint that %esi = %eax at that point, then it might be possible to better optimise it to the ideal: movl %eax,%esi movl 60(%eax),%edx movl -564(%ebp),%eax cmpl 72(%eax),%edx jl .Lj359 cmpl 76(%eax),%edx This is what my proposed feature over at https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/74 is meant to help with (the showcase uses the "extra optimisation information" to store information on the state of the upper 32 bits of registers in x86_64, so it can make deeper optimisations knowing whether it's set to zero or not). Some other things might need some deeper thought: movl -16(%ebp),%edx movl (%edx),%eax movl 20(%eax),%eax movl 20(%eax),%eax movzbl 169(%eax),%eax pushl %eax movl -16(%ebp),%edx movl (%edx),%eax For some reason, the second "movl -16(%ebp),%edx" isn't removed. I'm not sure yet whether this is because the sliding window is too small (the first one gets removed due to another "movl -16(%ebp),%edx" that appears earlier, so this entry does NOT appear in the sliding window, only the earlier one) or because the compiler makes some incorrect assumptions about PUSH instructions and hence thinks the value of %edx is unreliable. Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
Well I'm not out of the woods yet, I've got one failure on x86_64-win64 and two on i386-win32: x86_64-win64: Failed to run webtbs/tw16040.pp 2021/09/13 08:19:30 i386-win32: Failed to run test/packages/bzip2/tbzip2streamtest.pp 2021/09/13 08:19:28 Failed to run webtbs/tw16040.pp 2021/09/13 08:19:30 Given the nature of the tests, my instincts tell me it's something small and should be easily fixable. Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
On Fri, 25 Feb 2022 05:08:48 + "J. Gareth Moreton via fpc-devel" wrote: > Almost every source file in the compiler and the RTL shows some kind of > improvement. A lot of them are just redundant pointer deallocations, so > this will help with cache misses and the like - that aside though, here > are a couple of my favourites... one from dbgdwarf - > TDebugInfoDwarf.appendsym_fieldvar_with_name_offset: > > Before: > > ... > .Lj682: > leaq (,%r13,8),%rcx > movq 120(%rdi),%rax > cqto > idivq %rcx > imulq %r13,%rax > movq %rax,%r12 > addq 56(%rbp),%r12 > leaq (,%r13,8),%rcx > movq 120(%rdi),%rax > cqto > idivq %rcx > movq %rdx,%rsi > cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) > > After: > > ... > .Lj682: > leaq (,%r13,8),%rcx > movq 120(%rdi),%rax > cqto > idivq %rcx > imulq %r13,%rax > movq %rax,%r12 > addq 56(%rbp),%r12 > movq %rdx,%rsi > cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) > > This one has successfully removed an IDIV instruction because the > algorithm was able to detect that the subroutine wanted the remainder in > %edx, and it was still available from the first IDIV call because it > hadn't been overwritten, and neither %r13 nor %rdi had changed values so > the references are the same. This is very useful, thank you. I think FPC has an excellent register allocator, but frustrated on 32 bit by scarce resources and by the lack of reloading check. -- Simplex sigillum veri ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
I did it! After a good week of work and getting things wrong, I finally found a solution that works nicely and is extensible, at least for x86. A bit of refactoring and it can be ported to other platforms. I'm just running the test suites to see if I can break things now. Honestly the hard part was making sure all the registers were tracked properly. https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window Please feel free to test it and break it! It will likely undergo adjustments over time to refactor bits and pieces, add extra instructions (especially as it doesn't support SHRX and SHLX yet, for example) and see if I can make the sliding window more efficient - I increased it in size from 8 to 16 and then 32 entries, since even at 16, some optimisations were missed in the RTL, but this depends on a number of factors. It's gotten some pretty good optimisations. On x86_64-win64 under -O4, the three lazarus binaries... before: lazarus.exe: 313,990,206 313,873,982 lazbuild.exe 59,704,391 59,725,895 startlazarus.exe 27,471,147 27,461,419 And the compiler itself, ppcx64.exe: 3,777,536 3,766,272 Remember that though I call the component a "sliding window", it's only because it's very similar to how LZ77 finds start points for run-length encoding, and the comments and variable names mention run-length encoding as it scans sequences of identical instructions. However, at no point is it actually performing data compression, and the end-result is common subexpression elimination. All the space savings are from the removal of redundant sequences of instructions, giving both a space and a speed boost. Almost every source file in the compiler and the RTL shows some kind of improvement. A lot of them are just redundant pointer deallocations, so this will help with cache misses and the like - that aside though, here are a couple of my favourites... one from dbgdwarf - TDebugInfoDwarf.appendsym_fieldvar_with_name_offset: Before: ... .Lj682: leaq (,%r13,8),%rcx movq 120(%rdi),%rax cqto idivq %rcx imulq %r13,%rax movq %rax,%r12 addq 56(%rbp),%r12 leaq (,%r13,8),%rcx movq 120(%rdi),%rax cqto idivq %rcx movq %rdx,%rsi cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) After: ... .Lj682: leaq (,%r13,8),%rcx movq 120(%rdi),%rax cqto idivq %rcx imulq %r13,%rax movq %rax,%r12 addq 56(%rbp),%r12 movq %rdx,%rsi cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip) This one has successfully removed an IDIV instruction because the algorithm was able to detect that the subroutine wanted the remainder in %edx, and it was still available from the first IDIV call because it hadn't been overwritten, and neither %r13 nor %rdi had changed values so the references are the same. This one is from SysUtils' IsLeapYear function and is one I have personally wanted to optimise further ever since I first saw its disassembly after I implemented the fast "x mod const = 0" algorithm. Before: ... imulw $23593,%cx,%ax rorw $2,%ax cmpw $655,%ax ja .Lj6979 imulw $23593,%cx,%ax rorw $4,%ax cmpw $163,%ax setbeb %al ret .Lj6979: ... After: ... imulw $23593,%cx,%ax rorw $2,%ax cmpw $655,%ax ja .Lj6979 rorw $2,%ax cmpw $163,%ax setbeb %al ret .Lj6979: ... In this case, the RLE doesn't produce an exact match since the end result of %ax is different, but the important point to note is that as long as %cx doesn't change value (which it doesn't). the first sequence can be transformed into the second sequence via a "rorw $2,%ax" instruction, so the end result is that "imulw $23593,%cx,%ax; rorw $4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of %ax, thus removing a multiplication. Because this has been a mammoth undertaking and is quite a major addition to the compiler, I'm going to hold off making a merge request until I write a document on it, like I've done in the past with some of the other larger optimisations I've made. The branch is available at the link at the top of this e-mail though if anyone wants to take a look at it. One final note is that this optimisation is rather slow and specialist, so much so that I've added a new optimizer switch named 'cs_opt_asmcse' ("Assembly CSE", to differentiate it from "Node CSE"); this is enabled by default with -O3, and requires the peephole optimizer to also be enabled in order to function. Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
That's useful to know, thanks. Ideally I'd like to put it behind a kind of CSE compiler flag because this will add a degree of overhead to it and make compilation slower (currently I'm only doing it for -O3). Currently I'm only tracking "mov (ref),%reg" and "lea (ref),%reg" instructions and will build from there. You know I'm never one to turn down a challenge - honestly I'm surprised I got this much success already! I'll take a look at your commits. At present, to ensure mistakes are not made, the sliding window is emptied/reset whenever the following is hit: - A live label. - An instruction that modifies the stack pointer (this one may prove unnecessary due to register tracking). - An instruction that writes to memory, since even if it writes to the stack or global memory, there's no way to fully predict whether another register is pointing to this location and is hence volatile. Gareth aka. Kit On 17/02/2022 21:38, Jonas Maebe via fpc-devel wrote: On 17/02/2022 20:25, J. Gareth Moreton via fpc-devel wrote: P.S. The term "sliding window" comes from the LZ77 compression algorithm and is used to track repeated sequences in ZIP files, among others. This prototype optimisation essentially uses the same construct, but built for instructions instead of individual bytes. The usual term for this kind of optimisations is common subexpression elimination (CSE). We used to have one at the assembler level for i386 (my first main contribution to the compiler), but removed it in the end because it was extremely prone to breaking — even more so than the peephole optimiser. This happened in a04cae2c4ba32bc5c7a8461b5851dee8e70c272b (it was supposed to have happened in 1b439307490861afd739ba5a48dc175f09727a89, but I made a mistake). It doesn't mean it can't be implemented in a robust way, but I can guarantee from experience that it is extremely hard. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Prototype optimisation... Sliding Window
On 17/02/2022 20:25, J. Gareth Moreton via fpc-devel wrote: P.S. The term "sliding window" comes from the LZ77 compression algorithm and is used to track repeated sequences in ZIP files, among others. This prototype optimisation essentially uses the same construct, but built for instructions instead of individual bytes. The usual term for this kind of optimisations is common subexpression elimination (CSE). We used to have one at the assembler level for i386 (my first main contribution to the compiler), but removed it in the end because it was extremely prone to breaking — even more so than the peephole optimiser. This happened in a04cae2c4ba32bc5c7a8461b5851dee8e70c272b (it was supposed to have happened in 1b439307490861afd739ba5a48dc175f09727a89, but I made a mistake). It doesn't mean it can't be implemented in a robust way, but I can guarantee from experience that it is extremely hard. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[fpc-devel] Prototype optimisation... Sliding Window
Hi everyone, So I've started experimenting with a new technique in the peephole optimizer for x86 platforms that I've named the Sliding Window. The intention is to use it to help replace common blocks of code within a procedure, such as pointer dereferences. So far I'm having a degree of success, although it's pre-alpha currently. In the aasmcnst disassembly - before: .section .text.n_aasmcnst$_$ttai_typedconstbuilder_$__$$_emit_pooled_shortstring_const_ref$shortstring,"ax" ... movq %rax,%rdx movq U_$AASMDATA_$$_CURRENT_ASMDATA(%rip),%rax movq 224(%rax),%rcx movq U_$AASMDATA_$$_CURRENT_ASMDATA(%rip),%rax movq 224(%rax),%rax movq (%rax),%rax call *232(%rax) ... After: .section .text.n_aasmcnst$_$ttai_typedconstbuilder_$__$$_emit_pooled_shortstring_const_ref$shortstring,"ax" ... movq %rax,%rdx movq U_$AASMDATA_$$_CURRENT_ASMDATA(%rip),%rax movq 224(%rax),%rcx movq (%rcx),%rax call *232(%rax) ... The peephole optimizer replaces the second " movq U_$AASMDATA_$$_CURRENT_ASMDATA(%rip),%rax" and the "movq 224(%rax),%rax" instruction with "mov %rcx,%rax", which is then removed by a conventional peephole optimisation because %rax is completely overwritten on the next instruction ("mov (%rcx),%rax"). I sense I might be onto a winner here! I do have a question though... are there any tai objects that signal a memory fence or a synchronisation hint so it doesn't optimise a repeated reference? I know instructions like MFENCE can kind of force it. Gareth aka. Kit P.S. The term "sliding window" comes from the LZ77 compression algorithm and is used to track repeated sequences in ZIP files, among others. This prototype optimisation essentially uses the same construct, but built for instructions instead of individual bytes. -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel