Re: [fpc-devel] generate assembler with no clear purpose MOV
p.s. tested execution speed and there is no measurable difference. > asm code > # [109] bit:= longint(1) shl k; > movslq %ecx,%rdx > # Register r8d allocated > movl $1,%r8d > # Register edx,edx allocated > shlx %edx,%r8d,%edx > # Register r8d released > # Register edx allocated > movl %edx,%esi > # Peephole Optimization: %esi = %edx; changed to minimise pipeline stall > (MovXXX2MovXXX) > # Peephole Optimization: Mov2Nop 4 done > what purpose serve: movslq %ecx,%rdx ? > movl %edx,%esi seems unnecessary, > when just enough would be > movl $1,%esi > shlx %ecx,%esi,%esi > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[fpc-devel] generate assembler with no clear purpose MOV
hi example code: function roo(lk:longint):byte; var k : longint; bit : longint; num : byte; begin num:=0; for k:=0 to 25 do begin bit:= longint(1) shl k; if (lk and bit) <> 0 then begin lk:=lk xor bit; inc(num); end; end; roo:=num; end; begin end. asm code # [109] bit:= longint(1) shl k; movslq %ecx,%rdx # Register r8d allocated movl $1,%r8d # Register edx,edx allocated shlx %edx,%r8d,%edx # Register r8d released # Register edx allocated movl %edx,%esi # Peephole Optimization: %esi = %edx; changed to minimise pipeline stall (MovXXX2MovXXX) # Peephole Optimization: Mov2Nop 4 done what purpose serve: movslq %ecx,%rdx ? movl %edx,%esi seems unnecessary, when just enough would be movl $1,%esi shlx %ecx,%esi,%esi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] generate assembler with no clear purpose MOV
- Reply to message - Subject: Re: [fpc-devel] generate assembler with no clear purpose MOV Date: otrd., 4 febr. 2020, 22:24 From: J. Gareth Moreton To: > To hazard a guess, it's sign-extending to the CPU word size as an > intermediate step. It's not something the peephole optimizer can easily > eliminate. Do the register allocations give any clues before that > instruction? # Var k located in register ecx # Var bit located in register esi seems to be sign-extend, but if change variable "k" and "bit" to dword then there is simple movl %ecx,%edx. Instruction SHLX (as well SHRX) is treated as variables always are memory variables and there for first read value in temp register and after write back. As well SHL and SHR are logical operators so no need for sign extension. While those MOV instructions do not hurt much, there is benefit of resolving this issue - 2 or 1 free registers available for other purposes. > On 04/02/2020 18:50, Marģers . via fpc-devel wrote: > > p.s. tested execution speed and there is no measurable difference. > > > > > >> asm code > >> # [109] bit:= longint(1) shl k; > >> movslq %ecx,%rdx > >> # Register r8d allocated > >> movl $1,%r8d > >> # Register edx,edx allocated > >> shlx %edx,%r8d,%edx > >> # Register r8d released > >> # Register edx allocated > >> movl %edx,%esi > >> # Peephole Optimization: %esi = %edx; changed to minimise pipeline stall > >> (MovXXX2MovXXX) > >> # Peephole Optimization: Mov2Nop 4 done > > > >> what purpose serve: movslq %ecx,%rdx ? > >> movl %edx,%esi seems unnecessary, > >> when just enough would be > >> movl $1,%esi > >> shlx %ecx,%esi,%esi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] generate assembler with no clear purpose MOV
Are you able to dump the nodes as well with -an? (You'll need to define -dEXTDEBUG though) That might give some clues behind the presence of that movslq instruction. Gareth aka. Kit On 04/02/2020 21:15, Marģers . via fpc-devel wrote: - Reply to message - Subject: Re: [fpc-devel] generate assembler with no clear purpose MOV Date: otrd., 4 febr. 2020, 22:24 From: J. Gareth Moreton To: To hazard a guess, it's sign-extending to the CPU word size as an intermediate step. It's not something the peephole optimizer can easily eliminate. Do the register allocations give any clues before that instruction? # Var k located in register ecx # Var bit located in register esi seems to be sign-extend, but if change variable "k" and "bit" to dword then there is simple movl %ecx,%edx. Instruction SHLX (as well SHRX) is treated as variables always are memory variables and there for first read value in temp register and after write back. As well SHL and SHR are logical operators so no need for sign extension. While those MOV instructions do not hurt much, there is benefit of resolving this issue - 2 or 1 free registers available for other purposes. On 04/02/2020 18:50, Marģers . via fpc-devel wrote: p.s. tested execution speed and there is no measurable difference. asm code # [109] bit:= longint(1) shl k; movslq %ecx,%rdx # Register r8d allocated movl $1,%r8d # Register edx,edx allocated shlx %edx,%r8d,%edx # Register r8d released # Register edx allocated movl %edx,%esi # Peephole Optimization: %esi = %edx; changed to minimise pipeline stall (MovXXX2MovXXX) # Peephole Optimization: Mov2Nop 4 done what purpose serve: movslq %ecx,%rdx ? movl %edx,%esi seems unnecessary, when just enough would be movl $1,%esi shlx %ecx,%esi,%esi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[fpc-devel] test plz ignore
test plz ignore ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] generate assembler with no clear purpose MOV
From: J. Gareth Moreton To: > Are you able to dump the nodes as well with -an? (You'll need to define > -dEXTDEBUG though) That might give some clues behind the presence of > that movslq instruction. building compiler with -dEXTDEBUG does not work for me make singlezipinstall OS_TARGET=linux CPU_TARGET=x86_64 OPT="-dEXTDEBUG -CpCOREAVX2 -OpCOREAVX2 -Fu/home/user/fpc304/lib/fpc/3.0.4/units/x86_64-linux/rtl/" constexp.pas(125,13) Warning: Location (LOC_CSSETREG) not equal to expectloc (LOC_REG): typeconvn constexp.pas(594) Fatal: There were 1 errors compiling module, stopping ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[fpc-devel] Peephole Pass 1 Optimisation Suggestion
Hi everyone, I have an idea in regards to improving compilation speed. It mostly applies to the x86 family, but I see no reason why it cannot be platform-agnostic. The idea is basically this: - The optimisation level selected (-O1, -O2, -O3/-O4) dictates the MAXIMUM number of times Pass 1 is executed for a block of code. Maximum count will be 1 for -O1, 2 for -O2 and 5 for -O3 and -O4. - Pass 1 optimisation is stopped if the maximum pass count is reached or if no changes were made (no functions returned True for that iteration). Currently, at least for x86, at least two runs of Pass 1 are performed, even if the first iteration did not change anything. Under -O3 and -O4, pass 1 is run as many times as it needs to until all individual optimisation methods return False, but then a final iteration of pass 1 is run anyway. The main reason for this is because some changes may forget to set the Result to True (assembler comparisons under -O2 will detect some of these). In terms of benefits, -O1, being the quick, debugger-friendly option, will compile faster because an entire iteration of Pass 1 is dropped at the cost of slightly less efficient code (but such code shouldn't be used for a release build and only for the debugging of high-level code, so is acceptable in my eyes), -O2 will be approximately equal speed except for the simplest of routines (which will be slightly faster), and -O3 and -O4 will be faster because these will drop at least one run-through of Pass 1. There is a chance that the most complex of routines will be less optimal, but after 5 iterations, the vast majority of code blocks should be optimal - if not, then I'd argue that some of the optimisation routines could be improved to do more in a single pass. Also, from a safety perspective, if there is a faulty optimisation that causes an infinite loop (e.g. two optimisations that 'fight' each other, of which at least one partial example exists in x86), the maximum pass count ensures the compiler can still progress even under the highest optimisation settings. Originally, -O3 used to run Pass 1 a maximum for 4 times (not including the 2nd call to Pass 1 afterwards, hence why I selected 5 as the maximum count), but this was removed at some point in the past, admittedly by myself under the mistaken belief that optimisations wouldn't produce buggy code or otherwise get caught in an infinite loop. For testing and comparison, since this only involves the number of runs of Pass 1 and not what Pass 1 actually does, side-by-side analysis of assembler dumps using a directory comparison tool will confirm that output code is unchanged for -O2 and higher, and measuring compilation time will determine that there is indeed a saving. That's my plan... how does it sound? Gareth aka. Kit ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel