I'm interested in trying to improve the code generated for shift operations (in particular involving a compile time constant shift) for the AVR target. The AVR processor doesn't have a barrel shifter, instead it can only shift a single bit position per clock cycle. Currently the compiler by default generates a bit shift loop where the loop is executed n times to push a value by n bits. The only optimization I noticed for the case of shifting a value by a compile time constant is in Tcg.a_op_const_reg_reg where an 8 bit shift of a 16 bit value is converted to copying the low byte of the left operand into the high byte of the result, and setting the low byte of the result to 0.
I would like to extend this type of optimization to cover more cases - the obvious extension is to convert all shifts by 8 bit multiples by corresponding byte moves. A more general approach (which I've got working for shl as concept) is to at least convert all 8 bit multiples as byte moves, then just do the last few bit shifts (if any) either as an unrolled loop (e.g. as implemented in tcgavr.a_op_const_reg_internal) or by generating the conventional shift loop (as implemented in tcgavr.a_op_reg_reg_internal). At the moment I've implemented the this logic in tcgavr.a_op_const_reg_reg. I first check if I can generate smaller code compared to a shift loop and if not, the code calls the inherited method a_op_const_reg_reg which basically follows the existing path (see also attached patch) The code generator is complex to follow for me, since the functionality is kind of normalized and distributed across generic and CPU specific parts, and code flow jumps around up and down the inheritance chain. I therefore have some questions around this proposed modification and implementation in the code generator for which I would like some guidance on: * Is tcgavr.a_op_const_reg_reg the correct place for this type of functionality? * Am I messing up something else by effectively moving most of the code generation for this case higher up the call chain? * Am I missing some other path which could also benefit from this optimization? * Should I try and generate different code depending on whether -Os is specified or not (e.g. perform more loop unrolling if -Os is not specified)? * Any comments on the patch, which is a work in progress? best wishes, Christo
diff --git compiler/avr/cgcpu.pas compiler/avr/cgcpu.pas index 4be4f4823e..355e204bf7 100644 --- compiler/avr/cgcpu.pas +++ compiler/avr/cgcpu.pas @@ -438,6 +438,9 @@ unit cgcpu; procedure tcgavr.a_op_const_reg_reg(list: TAsmList; op: TOpCg; size: tcgsize; a: tcgint; src, dst: tregister); + var + tmpSrc, tmpDst: TRegister; + b, i, j: byte; begin if (op in [OP_MUL,OP_IMUL]) and (size in [OS_16,OS_S16]) and (a in [2,4,8]) then begin @@ -451,6 +454,43 @@ unit cgcpu; a:=a shr 1; end; end + // Word size wins up to remainder = 3 + // Dword size wins up to remainder = 2 + else if (op = OP_SHL) and (a > 0) // a = 0 get eliminated later by tcg.optimize_op_const + and ((a mod 8) * tcgsize2size[size] <= 8) then // remaining shifts short enough for unrolled loop + begin + b := a div 8; // number of bytes to shift + // first fill LSB with 0 + tmpSrc := src; + tmpDst := dst; + for i := 1 to b do + begin + a_load_const_reg(list,OS_8,0,tmpDst); + tmpDst := GetNextReg(tmpDst); + end; + + for i := b+1 to tcgsize2size[size] do + begin + a_load_reg_reg(list,OS_8,OS_8,tmpSrc,tmpDst); + if i < tcgsize2size[size] then // don't retrieve next reg if on last iteration + begin + tmpSrc := GetNextReg(tmpSrc); + tmpDst := GetNextReg(tmpDst); + end; + end; + + b := a mod 8; + if b > 0 then + begin + for j := 1 to b do + begin + list.concat(taicpu.op_reg(A_LSL, dst)); + if not(size in [OS_8, OS_S8]) then + for i := 2 to tcgsize2size[size] do + list.concat(taicpu.op_reg(A_ROL,GetOffsetReg64(dst, NR_NO, i-1))); + end; + end; + end else inherited a_op_const_reg_reg(list,op,size,a,src,dst); end;
_______________________________________________ fpc-devel maillist - email@example.com https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel