I'm interested in trying to improve the code generated for shift operations
(in particular involving a compile time constant shift) for the AVR
target.  The AVR processor doesn't have a barrel shifter, instead it can
only shift a single bit position per clock cycle. Currently the compiler by
default generates a bit shift loop where the loop is executed n times to
push a value by n bits.  The only optimization I noticed for the case of
shifting a value by a compile time constant is in Tcg.a_op_const_reg_reg
where an 8 bit shift of a 16 bit value is converted to copying the low byte
of the left operand into the high byte of the result, and setting the low
byte of the result to 0.

I would like to extend this type of optimization to cover more cases - the
obvious extension is to convert all shifts by 8 bit multiples by
corresponding byte moves. A more general approach (which I've got working
for shl as concept) is to at least convert all 8 bit multiples as byte
moves, then just do the last few bit shifts (if any) either as an unrolled
loop (e.g. as implemented in tcgavr.a_op_const_reg_internal) or by
generating the conventional shift loop (as implemented in
tcgavr.a_op_reg_reg_internal). At the moment I've implemented the this
logic in tcgavr.a_op_const_reg_reg.  I first check if I can generate
smaller code compared to a shift loop and if not, the code calls the
inherited method  a_op_const_reg_reg which basically follows the existing
path (see also attached patch)

The code generator is complex to follow for me, since the functionality is
kind of normalized and distributed across generic and CPU specific parts,
and code flow jumps around up and down the inheritance chain.  I therefore
have some questions around this proposed modification and implementation in
the code generator for which I would like some guidance on:
* Is tcgavr.a_op_const_reg_reg the correct place for this type of
functionality?
* Am I messing up something else by effectively moving most of the code
generation for this case higher up the call chain?
* Am I missing some other path which could also benefit from this
optimization?
* Should I try and generate different code depending on whether -Os is
specified or not (e.g. perform more loop unrolling if -Os is not specified)?
* Any comments on the patch, which is a work in progress?

best wishes,
Christo
diff --git compiler/avr/cgcpu.pas compiler/avr/cgcpu.pas
index 4be4f4823e..355e204bf7 100644
--- compiler/avr/cgcpu.pas
+++ compiler/avr/cgcpu.pas
@@ -438,6 +438,9 @@ unit cgcpu;
 
 
      procedure tcgavr.a_op_const_reg_reg(list: TAsmList; op: TOpCg; size: tcgsize; a: tcgint; src, dst: tregister);
+       var
+         tmpSrc, tmpDst: TRegister;
+         b, i, j: byte;
        begin
          if (op in [OP_MUL,OP_IMUL]) and (size in [OS_16,OS_S16]) and (a in [2,4,8]) then
            begin
@@ -451,6 +454,43 @@ unit cgcpu;
                  a:=a shr 1;
                end;
            end
+         // Word size wins up to remainder = 3
+         // Dword size wins up to remainder = 2
+         else if (op = OP_SHL) and (a > 0) // a = 0 get eliminated later by tcg.optimize_op_const
+              and ((a mod 8) * tcgsize2size[size] <= 8) then // remaining shifts short enough for unrolled loop
+           begin
+             b := a div 8;  // number of bytes to shift
+             // first fill LSB with 0
+             tmpSrc := src;
+             tmpDst := dst;
+             for i := 1 to b do
+             begin
+               a_load_const_reg(list,OS_8,0,tmpDst);
+               tmpDst := GetNextReg(tmpDst);
+             end;
+
+             for i := b+1 to tcgsize2size[size] do
+             begin
+               a_load_reg_reg(list,OS_8,OS_8,tmpSrc,tmpDst);
+               if i < tcgsize2size[size] then // don't retrieve next reg if on last iteration
+               begin
+                 tmpSrc := GetNextReg(tmpSrc);
+                 tmpDst := GetNextReg(tmpDst);
+               end;
+             end;
+
+             b := a mod 8;
+             if b > 0 then
+             begin
+               for j := 1 to b do
+               begin
+                 list.concat(taicpu.op_reg(A_LSL, dst));
+                 if not(size in [OS_8, OS_S8]) then
+                   for i := 2 to tcgsize2size[size] do
+                     list.concat(taicpu.op_reg(A_ROL,GetOffsetReg64(dst, NR_NO, i-1)));
+               end;
+             end;
+           end
          else
            inherited a_op_const_reg_reg(list,op,size,a,src,dst);
        end;
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to