Re: [fpc-devel] Optimization of redundant mov's
Am 20.03.2017 um 08:50 schrieb Jonas Maebe: >> And again, I've seen this happen more than once on i386 code, where it even >> creates "fake" register pressure (by using 2 or more registers to hold >> exactly >> the same temporary) > > That's again something that needs to be solved at the register allocator > level (with SSA). SSA is still on my todo list :) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Optimization of redundant mov's
On 19/03/17 21:28, Martok wrote: It's called register spilling: once there are no registers left to hold values, the compiler has to pick registers whose value will be kept in memory instead. I thought it would be something like that... Still, my main issue was with the repeated fetches. I'd (naively!) say that it should be relatively easy for an assembly-level optimizer to detect that these are repeated loads of the same thing, with nothing that could affect the outcome inbetween. It's not even a CSE in the technical sense, not a sub-expression but the entire thing... It is trivial to create a peephole optimization for that particular pattern. At least if it's just two loads, because after you've optimized the second load into a register move, the third load no longer fits the pattern... Unless you create a special peephole optimizer pass that goes over the code backwards to apply this specific optimization, or you first match the pattern as many times as possible before changing it. But then it will still fail if there is at least one other instruction in between. So then you have to slightly generalise it, and in the end you do end up with a full-blown assembler CSE optimizer, like the one we removed for 3.0. I'm a staunch believer in not wasting time on stuff like that, it's just not worth it. Especially since a better register allocator, or SSA, can probably achieve the same thing in this case. E.g. those memory loads are probably optimised by the processor itself (not necessarily coming even from the L1 cache, but possibly from the write-back buffer). Not as well as one might believe, manually fixing (by forcing @CurrentHash into a register with a local variable) just those 4 lines gives a ~2% increase in MB/s for this hash. Which is quite a lot, given this is the part *without* actual computations. You cannot attribute those 2% exclusively to keeping the values in registers. E.g. removing them can change branch target alignments. Even adding random nops can get you 10% due to changed code layout. And again, I've seen this happen more than once on i386 code, where it even creates "fake" register pressure (by using 2 or more registers to hold exactly the same temporary) That's again something that needs to be solved at the register allocator level (with SSA). Freeing up registers anymore afterwards is useless, since only the register allocator can keep stuff in them permanently. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Optimization of redundant mov's
Hi, > It's called register spilling: once there are no registers left to hold > values, the compiler has to pick registers whose value will be kept in > memory instead. I thought it would be something like that... Still, my main issue was with the repeated fetches. I'd (naively!) say that it should be relatively easy for an assembly-level optimizer to detect that these are repeated loads of the same thing, with nothing that could affect the outcome inbetween. It's not even a CSE in the technical sense, not a sub-expression but the entire thing... > E.g. those memory loads > are probably optimised by the processor itself (not necessarily coming > even from the L1 cache, but possibly from the write-back buffer). Not as well as one might believe, manually fixing (by forcing @CurrentHash into a register with a local variable) just those 4 lines gives a ~2% increase in MB/s for this hash. Which is quite a lot, given this is the part *without* actual computations. And again, I've seen this happen more than once on i386 code, where it even creates "fake" register pressure (by using 2 or more registers to hold exactly the same temporary) that makes the rest of the code worse than it could be. As a ballpark: the same change as above results in a 10% speedup by freeing up 2 registers (all-int64 operations on i386, so 2 regs needed for everything, having one more is very noticeable...) It just strikes me as odd to have some rather good local code but then just pointlessly add the second-most expensive operation in between ;-) Regards, Martok ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Optimization of redundant mov's
Martok wrote: > a:= CurrentHash[0]; b:= CurrentHash[1]; c:= CurrentHash[2]; d:= > CurrentHash[3]; > 000100074943 488b8424a002 mov0x2a0(%rsp),%rax > 00010007494B 4c8b5038 mov0x38(%rax),%r10 > 00010007494F 488b8424a002 mov0x2a0(%rsp),%rax > 000100074957 4c8b5840 mov0x40(%rax),%r11 > 00010007495B 488b9424a002 mov0x2a0(%rsp),%rdx > 000100074963 488b4248 mov0x48(%rdx),%rax > 000100074967 488b9424a002 mov0x2a0(%rsp),%rdx > 00010007496F 488b6a50 mov0x50(%rdx),%rbp > > Every single one of the "mov 0x2a0(%rsp), %rxx" instructions except the first > is > redundant and causes another memory round-trip. At the same time, more > registers > are used, which probably makes other optimizations more difficult, especially > when something similar happens on i386. > > Now, the fun part: I haven't been able to build a simple test that causes the > same issue (the self-pointer already is in %rcx and not fetched from the stack > each time), so I have a feeling this may be a side effect of some other part > of > the code. It's called register spilling: once there are no registers left to hold values, the compiler has to pick registers whose value will be kept in memory instead. Register allocation is an NP-complete problem, so the result will never be 100% optimal (at least if you don't want to wait forever while the compiler checks out all possible assignments). One possible heuristic, which is used by FPC's register allocator, is to spill the register that conflicts with the largest number of other registers (to minimise the number of registers spilled to memory). There are techniques to more optimally spill (e.g. live range splitting), and there are also other kinds of optimisations that could be run after register allocation to make the code more optimal. CSE at the assembler level could be used in this case. That's a very complex undertaking for relatively little gain though. E.g. those memory loads are probably optimised by the processor itself (not necessarily coming even from the L1 cache, but possibly from the write-back buffer). Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel