That's where the first stage of my deep optimizer might be able to help, since it explicitly starts at a MOV command (say, mov %reg_source, %reg_dest) and scans forward to see if %reg_dest can be replaced with %reg_source - it stops when it hits a jump, call, non-skippable label, when the registers change value or if %reg_dest cannot be replaced. If all references to %reg_dest were replaced prior to its value being completely overwritten, then it can then delete the original MOV.
While it will only be appropriate to run at -O2 and -O3, the restrictions surrounding the register replacement, and specially where it stops searching, help ensure that it runs relatively quickly. By the way, now that the debug strip patch has now being approved (thanks Florian!), I can now safely submit my prototype for the Deep Optimizer! https://bugs.freepascal.org/view.php?id=33871 - as specifid in the notes, compile the compiler with the DEBUG_AOPTCPU directive and compile projects with the -a flag if you wish to see where the Deep Optimizer has made savings in the intermediate assemblies. Note that this version works during post-peephole optimisations. I'm looking at ways to move it to the preallocation phase in a way that's extensible, especially in regards to tracking virtual registers. It's a little tricker because it's very easy to leave a dangling pointer. Gareth aka. Kit On Sun 17/06/18 09:56 , Florian Klämpfl [email protected] sent: Am 16.06.2018 um 23:21 schrieb J. Gareth Moreton: > Note that I speak mostly from an x86_64 perspective, since this is where I have almost universal exposure. > > So I've been pondering a few things after researching Florian's prototype patch for optimisations done prior to register > allocation, when the pre-compiled assembly language utilises imaginary (virtual) registers pretty much everywhere other > than where distinct registers are required (e.g. function parameters). My question is... how much can be moved to the > pre-allocation stage? A lot, basically everything which reduced register pressure. The only problem is, at this stage, the code contains a lot of moves (compile with -sr to see how it looks like). So the optimizer must be able to handle this. It might be even possible to build a generic optimizer pass at this stage. Example: A typical sequence FPC often generates is: mov %src1,%dest1 add %dest1,%src2,%dest2 If src1 is no released after mov but dest1 is release, src1 and dest1 still cannot be coalesced as they interfere, so an extra register is allocated. The move will be remove by the peephole optimizer, but register was allocated and increase register pressure. Such optimizations could be done generic (for all CPUs): if the destination of a mov is only read afterwards (this information is already generically available), the mov can be removed and in this case dest1 can be replaced by src1. _______________________________________________ fpc-devel maillist - [email protected] [1] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel [2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel Links: ------ [1] mailto:[email protected] [2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
_______________________________________________ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
