Re: [fpc-devel] generate assembler with no clear purpose MOV

2020-02-04 Thread Marģers . via fpc-devel
 p.s. tested execution speed and there is no measurable difference.


> asm code
> # [109] bit:= longint(1) shl k;
>     movslq    %ecx,%rdx
>     # Register r8d allocated
>     movl    $1,%r8d
>     # Register edx,edx allocated
>     shlx    %edx,%r8d,%edx
>     # Register r8d released
>     # Register edx allocated
>     movl    %edx,%esi
> # Peephole Optimization: %esi = %edx; changed to minimise pipeline stall 
> (MovXXX2MovXXX)
> # Peephole Optimization: Mov2Nop 4 done


> what purpose serve: movslq    %ecx,%rdx   ?

> movl    %edx,%esi seems unnecessary,
> when just enough would be
> movl    $1,%esi
> shlx    %ecx,%esi,%esi

> ___
> fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] generate assembler with no clear purpose MOV

2020-02-04 Thread Marģers . via fpc-devel
 hi 
example code:
function roo(lk:longint):byte;
var k : longint;
    bit : longint;
    num : byte;
begin
 num:=0;
 for k:=0 to 25 do
 begin
  bit:= longint(1) shl k;
  if (lk and bit) <> 0 then
  begin
   lk:=lk xor bit;
   inc(num);
  end;
 end;
 roo:=num;
end;
begin
end.

asm code 
# [109] bit:= longint(1) shl k;
    movslq    %ecx,%rdx
    # Register r8d allocated
    movl    $1,%r8d
    # Register edx,edx allocated
    shlx    %edx,%r8d,%edx
    # Register r8d released
    # Register edx allocated
    movl    %edx,%esi
# Peephole Optimization: %esi = %edx; changed to minimise pipeline stall 
(MovXXX2MovXXX)
# Peephole Optimization: Mov2Nop 4 done


what purpose serve: movslq    %ecx,%rdx   ?

movl    %edx,%esi seems unnecessary, 
when just enough would be
movl    $1,%esi
shlx    %ecx,%esi,%esi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] generate assembler with no clear purpose MOV

2020-02-04 Thread Marģers . via fpc-devel
 

- Reply to message -
Subject: Re: [fpc-devel] generate assembler with no clear purpose MOV
Date: otrd., 4 febr. 2020, 22:24
From:  J. Gareth Moreton 
To:  
> To hazard a guess, it's sign-extending to the CPU word size as an
> intermediate step.  It's not something the peephole optimizer can easily
> eliminate.  Do the register allocations give any clues before that
> instruction?


# Var k located in register ecx
# Var bit located in register esi

seems to be sign-extend, but if change variable "k" and "bit" to dword then 
there is simple movl %ecx,%edx.
Instruction SHLX (as well SHRX) is treated as variables always are memory 
variables and there for first read value in temp register and after write back. 
As well SHL and SHR are logical operators so no need for sign extension.
While those MOV instructions do not hurt much, there is benefit of resolving 
this issue - 2 or 1 free registers available for other purposes.


> On 04/02/2020 18:50, Marģers . via fpc-devel wrote:
> >  p.s. tested execution speed and there is no measurable difference.
> >
> >
> >> asm code
> >> # [109] bit:= longint(1) shl k;
> >>     movslq    %ecx,%rdx
> >>     # Register r8d allocated
> >>     movl    $1,%r8d
> >>     # Register edx,edx allocated
> >>     shlx    %edx,%r8d,%edx
> >>     # Register r8d released
> >>     # Register edx allocated
> >>     movl    %edx,%esi
> >> # Peephole Optimization: %esi = %edx; changed to minimise pipeline stall 
> >> (MovXXX2MovXXX)
> >> # Peephole Optimization: Mov2Nop 4 done
> >
> >> what purpose serve: movslq    %ecx,%rdx   ?
> >> movl    %edx,%esi seems unnecessary,
> >> when just enough would be
> >> movl    $1,%esi
> >> shlx    %ecx,%esi,%esi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] generate assembler with no clear purpose MOV

2020-02-04 Thread J. Gareth Moreton
Are you able to dump the nodes as well with -an? (You'll need to define 
-dEXTDEBUG though) That might give some clues behind the presence of 
that movslq instruction.


Gareth aka. Kit

On 04/02/2020 21:15, Marģers . via fpc-devel wrote:
  


- Reply to message -
Subject: Re: [fpc-devel] generate assembler with no clear purpose MOV
Date: otrd., 4 febr. 2020, 22:24
From:  J. Gareth Moreton 
To:  

To hazard a guess, it's sign-extending to the CPU word size as an
intermediate step.  It's not something the peephole optimizer can easily
eliminate.  Do the register allocations give any clues before that
instruction?


# Var k located in register ecx
# Var bit located in register esi

seems to be sign-extend, but if change variable "k" and "bit" to dword then 
there is simple movl %ecx,%edx.
Instruction SHLX (as well SHRX) is treated as variables always are memory 
variables and there for first read value in temp register and after write back. 
As well SHL and SHR are logical operators so no need for sign extension.
While those MOV instructions do not hurt much, there is benefit of resolving 
this issue - 2 or 1 free registers available for other purposes.



On 04/02/2020 18:50, Marģers . via fpc-devel wrote:

  p.s. tested execution speed and there is no measurable difference.



asm code
# [109] bit:= longint(1) shl k;
     movslq    %ecx,%rdx
     # Register r8d allocated
     movl    $1,%r8d
     # Register edx,edx allocated
     shlx    %edx,%r8d,%edx
     # Register r8d released
     # Register edx allocated
     movl    %edx,%esi
# Peephole Optimization: %esi = %edx; changed to minimise pipeline stall 
(MovXXX2MovXXX)
# Peephole Optimization: Mov2Nop 4 done
what purpose serve: movslq    %ecx,%rdx   ?
movl    %edx,%esi seems unnecessary,
when just enough would be
movl    $1,%esi
shlx    %ecx,%esi,%esi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] test plz ignore

2020-02-04 Thread Dimitrios Chr. Ioannidis via fpc-devel

test plz ignore

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] generate assembler with no clear purpose MOV

2020-02-04 Thread Marģers . via fpc-devel
 

From:  J. Gareth Moreton 
To:  
> Are you able to dump the nodes as well with -an? (You'll need to define
> -dEXTDEBUG though) That might give some clues behind the presence of
> that movslq instruction.

building compiler with -dEXTDEBUG does not work for me
make singlezipinstall OS_TARGET=linux CPU_TARGET=x86_64  OPT="-dEXTDEBUG 
-CpCOREAVX2 -OpCOREAVX2 
-Fu/home/user/fpc304/lib/fpc/3.0.4/units/x86_64-linux/rtl/"

constexp.pas(125,13) Warning: Location (LOC_CSSETREG) not equal to expectloc 
(LOC_REG): typeconvn
constexp.pas(594) Fatal: There were 1 errors compiling module, stopping

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] Peephole Pass 1 Optimisation Suggestion

2020-02-04 Thread J. Gareth Moreton
Hi everyone,

I have an idea in regards to improving compilation speed.  It mostly applies to 
the x86 family, but I see no 
reason why it cannot be platform-agnostic.  The idea is basically this:

- The optimisation level selected (-O1, -O2, -O3/-O4) dictates the MAXIMUM 
number of times Pass 1 is executed 
for a block of code.  Maximum count will be 1 for -O1, 2 for -O2 and 5 for -O3 
and -O4.
- Pass 1 optimisation is stopped if the maximum pass count is reached or if no 
changes were made (no functions 
returned True for that iteration).

Currently, at least for x86, at least two runs of Pass 1 are performed, even if 
the first iteration did not 
change anything.  Under -O3 and -O4, pass 1 is run as many times as it needs to 
until all individual 
optimisation methods return False, but then a final iteration of pass 1 is run 
anyway.  The main reason for 
this is because some changes may forget to set the Result to True (assembler 
comparisons under -O2 will detect 
some of these).

In terms of benefits, -O1, being the quick, debugger-friendly option, will 
compile faster because an entire 
iteration of Pass 1 is dropped at the cost of slightly less efficient code (but 
such code shouldn't be used 
for a release build and only for the debugging of high-level code, so is 
acceptable in my eyes), -O2 will be 
approximately equal speed except for the simplest of routines (which will be 
slightly faster), and -O3 and -O4 
will be faster because these will drop at least one run-through of Pass 1.  
There is a chance that the most 
complex of routines will be less optimal, but after 5 iterations, the vast 
majority of code blocks should be 
optimal - if not, then I'd argue that some of the optimisation routines could 
be improved to do more in a 
single pass.

Also, from a safety perspective, if there is a faulty optimisation that causes 
an infinite loop (e.g. two 
optimisations that 'fight' each other, of which at least one partial example 
exists in x86), the maximum pass 
count ensures the compiler can still progress even under the highest 
optimisation settings.  Originally, -O3 
used to run Pass 1 a maximum for 4 times (not including the 2nd call to Pass 1 
afterwards, hence why I 
selected 5 as the maximum count), but this was removed at some point in the 
past, admittedly by myself under 
the mistaken belief that optimisations wouldn't produce buggy code or otherwise 
get caught in an infinite 
loop.

For testing and comparison, since this only involves the number of runs of Pass 
1 and not what Pass 1 actually 
does, side-by-side analysis of assembler dumps using a directory comparison 
tool will confirm that output code 
is unchanged for -O2 and higher, and measuring compilation time will determine 
that there is indeed a saving.

That's my plan... how does it sound?

Gareth aka. Kit
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel