Re: [fpc-devel] Optimization of redundant mov's

2017-03-24 Thread Florian Klämpfl
Am 20.03.2017 um 08:50 schrieb Jonas Maebe:
>> And again, I've seen this happen more than once on i386 code, where it even
>> creates "fake" register pressure (by using 2 or more registers to hold 
>> exactly
>> the same temporary)
> 
> That's again something that needs to be solved at the register allocator 
> level (with SSA).

SSA is still on my todo list :)

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Optimization of redundant mov's

2017-03-20 Thread Jonas Maebe

On 19/03/17 21:28, Martok wrote:



It's called register spilling: once there are no registers left to hold
values, the compiler has to pick registers whose value will be kept in
memory instead.

I thought it would be something like that...

Still, my main issue was with the repeated fetches. I'd (naively!) say that it
should be relatively easy for an assembly-level optimizer to detect that these
are repeated loads of the same thing, with nothing that could affect the outcome
inbetween. It's not even a CSE in the technical sense, not a sub-expression but
the entire thing...


It is trivial to create a peephole optimization for that particular 
pattern. At least if it's just two loads, because after you've optimized 
the second load into a register move, the third load no longer fits the 
pattern... Unless you create a special peephole optimizer pass that goes 
over the code backwards to apply this specific optimization, or you 
first match the pattern as many times as possible before changing it. 
But then it will still fail if there is at least one other instruction 
in between.


So then you have to slightly generalise it, and in the end you do end up 
with a full-blown assembler CSE optimizer, like the one we removed for 
3.0. I'm a staunch believer in not wasting time on stuff like that, it's 
just not worth it. Especially since a better register allocator, or SSA, 
can probably achieve the same thing in this case.



E.g. those memory loads
are probably optimised by the processor itself (not necessarily coming
even from the L1 cache, but possibly from the write-back buffer).

Not as well as one might believe, manually fixing (by forcing @CurrentHash into
a register with a local variable) just those 4 lines gives a ~2% increase in
MB/s for this hash. Which is quite a lot, given this is the part *without*
actual computations.


You cannot attribute those 2% exclusively to keeping the values in 
registers. E.g. removing them can change branch target alignments. Even 
adding random nops can get you 10% due to changed code layout.



And again, I've seen this happen more than once on i386 code, where it even
creates "fake" register pressure (by using 2 or more registers to hold exactly
the same temporary)


That's again something that needs to be solved at the register allocator 
level (with SSA). Freeing up registers anymore afterwards is useless, 
since only the register allocator can keep stuff in them permanently.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Optimization of redundant mov's

2017-03-20 Thread Martok
Hi,

> It's called register spilling: once there are no registers left to hold
> values, the compiler has to pick registers whose value will be kept in
> memory instead.
I thought it would be something like that...

Still, my main issue was with the repeated fetches. I'd (naively!) say that it
should be relatively easy for an assembly-level optimizer to detect that these
are repeated loads of the same thing, with nothing that could affect the outcome
inbetween. It's not even a CSE in the technical sense, not a sub-expression but
the entire thing...

> E.g. those memory loads
> are probably optimised by the processor itself (not necessarily coming
> even from the L1 cache, but possibly from the write-back buffer).
Not as well as one might believe, manually fixing (by forcing @CurrentHash into
a register with a local variable) just those 4 lines gives a ~2% increase in
MB/s for this hash. Which is quite a lot, given this is the part *without*
actual computations.

And again, I've seen this happen more than once on i386 code, where it even
creates "fake" register pressure (by using 2 or more registers to hold exactly
the same temporary) that makes the rest of the code worse than it could be.
As a ballpark: the same change as above results in a 10% speedup by freeing up 2
registers (all-int64 operations on i386, so 2 regs needed for everything, having
one more is very noticeable...)

It just strikes me as odd to have some rather good local code but then just
pointlessly add the second-most expensive operation in between ;-)


Regards,

Martok



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Optimization of redundant mov's

2017-03-19 Thread Jonas Maebe

Martok wrote:

>  a:= CurrentHash[0]; b:= CurrentHash[1]; c:= CurrentHash[2]; d:= 
> CurrentHash[3];
> 000100074943 488b8424a002 mov0x2a0(%rsp),%rax
> 00010007494B 4c8b5038 mov0x38(%rax),%r10
> 00010007494F 488b8424a002 mov0x2a0(%rsp),%rax
> 000100074957 4c8b5840 mov0x40(%rax),%r11
> 00010007495B 488b9424a002 mov0x2a0(%rsp),%rdx
> 000100074963 488b4248 mov0x48(%rdx),%rax
> 000100074967 488b9424a002 mov0x2a0(%rsp),%rdx
> 00010007496F 488b6a50 mov0x50(%rdx),%rbp
> 
> Every single one of the "mov 0x2a0(%rsp), %rxx" instructions except the first 
> is
> redundant and causes another memory round-trip. At the same time, more 
> registers
> are used, which probably makes other optimizations more difficult, especially
> when something similar happens on i386.
> 
> Now, the fun part: I haven't been able to build a simple test that causes the
> same issue (the self-pointer already is in %rcx and not fetched from the stack
> each time), so I have a feeling this may be a side effect of some other part 
> of
> the code.

It's called register spilling: once there are no registers left to hold
values, the compiler has to pick registers whose value will be kept in
memory instead. Register allocation is an NP-complete problem, so the
result will never be 100% optimal (at least if you don't want to wait
forever while the compiler checks out all possible assignments). One
possible heuristic, which is used by FPC's register allocator, is to
spill the register that conflicts with the largest number of other
registers (to minimise the number of registers spilled to memory).

There are techniques to more optimally spill (e.g. live range
splitting), and there are also other kinds of optimisations that could
be run after register allocation to make the code more optimal. CSE at
the assembler level could be used in this case. That's a very complex
undertaking for relatively little gain though. E.g. those memory loads
are probably optimised by the processor itself (not necessarily coming
even from the L1 cache, but possibly from the write-back buffer).


Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel