Am 19.08.2019 um 22:20 schrieb Christo Crause: > I'm interested in trying to improve the code generated for shift > operations (in particular involving a compile time constant shift) for > the AVR target. The AVR processor doesn't have a barrel shifter, > instead it can only shift a single bit position per clock cycle. > Currently the compiler by default generates a bit shift loop where the > loop is executed n times to push a value by n bits. The only > optimization I noticed for the case of shifting a value by a compile > time constant is in Tcg.a_op_const_reg_reg where an 8 bit shift of a 16 > bit value is converted to copying the low byte of the left operand into > the high byte of the result, and setting the low byte of the result to 0. > > I would like to extend this type of optimization to cover more cases - > the obvious extension is to convert all shifts by 8 bit multiples by > corresponding byte moves. A more general approach (which I've got > working for shl as concept) is to at least convert all 8 bit multiples > as byte moves, then just do the last few bit shifts (if any) either as > an unrolled loop (e.g. as implemented in tcgavr.a_op_const_reg_internal) > or by generating the conventional shift loop (as implemented in > tcgavr.a_op_reg_reg_internal). At the moment I've implemented the this > logic in tcgavr.a_op_const_reg_reg. I first check if I can generate > smaller code compared to a shift loop and if not, the code calls the > inherited method a_op_const_reg_reg which basically follows the > existing path (see also attached patch) > > The code generator is complex to follow for me, since the functionality > is kind of normalized and distributed across generic and CPU specific > parts, and code flow jumps around up and down the inheritance chain.
Yes, nice OOP ;) > I > therefore have some questions around this proposed modification and > implementation in the code generator for which I would like some > guidance on: > * Is tcgavr.a_op_const_reg_reg the correct place for this type of > functionality? > * Am I messing up something else by effectively moving most of the code > generation for this case higher up the call chain? > * Am I missing some other path which could also benefit from this > optimization? > * Should I try and generate different code depending on whether -Os is > specified or not (e.g. perform more loop unrolling if -Os is not specified)? > * Any comments on the patch, which is a work in progress? For me the idea looks good, actually, I planned once to do the same, but got stuck for time reasons with the beginnings you discovered. If such a patch really works, can normaly only checked by testing. Besides simple benchmarking I often do another thing: I build without the patch with -al, copy all .s files to a tmp dir, then I do a build with the patch with -al, copy all resulting .s files to another temp. dir and then I look at the differences with WinMerge. It becomes pretty fast obvious if such a patch works good. _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel