Re: [fpc-devel] x86_64.inc CompareByte

2017-10-31 Thread C Western
On 31/10/17 11:47, Florian Klämpfl wrote: Am 30.10.2017 um 19:46 schrieb C Western: On 29/10/17 22:18, Florian Klämpfl wrote: I have committed your lastest patch with a few changes: the loop entry is aligned now to 16 bytes, I used movb instead of movbzl and inc instead of add. For me

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-31 Thread Florian Klämpfl
Am 30.10.2017 um 19:46 schrieb C Western: > On 29/10/17 22:18, Florian Klämpfl wrote: >> >> I have committed your lastest patch with a few changes: the loop entry is >> aligned now to 16 bytes, I >> used movb instead of movbzl and inc instead of add. For me (Haswell CPU) >> this works better. I

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-30 Thread C Western
On 29/10/17 22:18, Florian Klämpfl wrote: I have committed your lastest patch with a few changes: the loop entry is aligned now to 16 bytes, I used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this works better. I think also these changes are better on average.

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-29 Thread Florian Klämpfl
Am 23.10.2017 um 22:58 schrieb Markus Beth: > Here are the numbers for on ivy bridge CPU: > The output for [1] using the current RTL CompareByte is: >   9.001.275.281   cycles:u    ( +-  0,00% ) >  28.000.560.462   instructions:u #   3,11  insn per cycle ( +-  0,00% ) >  

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-23 Thread Markus Beth
Here are the numbers for on ivy bridge CPU: The output for [1] using the current RTL CompareByte is: 9.001.275.281 cycles:u( +- 0,00% ) 28.000.560.462 instructions:u # 3,11 insn per cycle ( +- 0,00% ) 2,654735815 seconds time elapsed( +- 0,00% ) The

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-23 Thread Martok
Using the code given below as "inner", I measure this: Current Trunk: O0 compare-byte-1 : 196065.112 +/- 896.754 cycles/inner [0.5 %CV 1.6 %R] O1 compare-byte-1 : 196510.158 +/- 577.976 cycles/inner [0.3 %CV 1.1 %R] O3 compare-byte-1 : 187540.922 +/- 706.167 cycles/inner [0.4 %CV 1.5 %R] Patch

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-22 Thread Markus Beth
I used 2 different benchmarks. One for (very) short buffers [1] and one for rather large buffers [2]. [1]: var key, key2: string; res: LongWord; i: SizeInt; begin key := 'A'; key2 := 'A'; for i:= 0 to 10 do begin res := CompareByte(key[1], key2[1], Length(key)); end;

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-22 Thread Florian Klämpfl
Am 21.10.2017 um 01:24 schrieb Markus Beth: > Find attached the already announced version of CompareByte. > What benchmark did you use? In my tests it is slightly slower than that one of fpc 3.0.x? I used the following test program: var buf1,buf2 : array[0..127] of byte; pos,len,i,j :

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-21 Thread Marco van de Voort
In our previous episode, Markus Beth said: > Find attached the already announced version of CompareByte. > > BTW: If you really like to see a PCMPSTR based implementation, have a > look at Agner Fog's Subroutine library asmlib.zip > (http://agner.org/optimize/). And then you see GPL licensed,

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-20 Thread Markus Beth
Find attached the already announced version of CompareByte. BTW: If you really like to see a PCMPSTR based implementation, have a look at Agner Fog's Subroutine library asmlib.zip (http://agner.org/optimize/). On 16.10.2017 23:08, Markus Beth wrote: On 16.10.2017 22:41, Florian Klämpfl wrote:

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-17 Thread Florian Klämpfl
Am 16.10.2017 um 23:08 schrieb Markus Beth: > On 16.10.2017 22:41, Florian Klämpfl wrote: >>> P.S.: I am currently working on another version of CompareByte that might >>> have a slightly higher >>> latency for very small len but a higher throughput (2 cycles per iteration >>> vs. 3 cycles on an

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-17 Thread Sven Barth via fpc-devel
Am 16.10.2017 23:04 schrieb "Markus Beth" : > > On 16.10.2017 22:41, Florian Klämpfl wrote: >> BTW: I would really like to see a PCMPSTR based implementation :) > > PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of SSE4.2. How would you deal with Intel

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-16 Thread Markus Beth
On 16.10.2017 22:41, Florian Klämpfl wrote: P.S.: I am currently working on another version of CompareByte that might have a slightly higher latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale CPU (Westmere microarchitecture)). But this

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-16 Thread Florian Klämpfl
Am 16.10.2017 um 22:33 schrieb Markus Beth: > Sorry for the late reply. I had a weekend off(line). > > The instructions were chosen on purpose and Sergey already cited the part of > the Intel documentation > that explains why this is correct. You can find a similar part in AMD "AMD64 >

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-16 Thread Markus Beth
Sorry for the late reply. I had a weekend off(line). The instructions were chosen on purpose and Sergey already cited the part of the Intel documentation that explains why this is correct. You can find a similar part in AMD "AMD64 Architecture Programmer’s Manual Volume 1: Application

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-16 Thread Сергей Сергеенко
On 15 Oct 2017 Florian Klämpfl wrote: > I had a look and tested it and it worked, I didn't notice the problem below > either. Sorry for wrong warning. I cannot provide any example where my suggestions are true. The reason for it is described on page Vol. 1 3-13 of Intel 64 and IA-32 Architectures

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-15 Thread Florian Klämpfl
Am 12.10.2017 um 20:37 schrieb sserg...@gmail.com: > Hi. > > Sorry for late message. But nobody still have said about possible problem > with > suggested patch. Well, it's always very hard to review such highly optimized code. I had a look and tested it and it worked, I didn't notice the

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-12 Thread sserg . me
Hi. Sorry for late message. But nobody still have said about possible problem with suggested patch. So I decide to pay attention on that proposed code may be incorrect under some circumstances IMHO. Instruction on line 657 subq %rcx, %rax decreases value in %rax on %rcx, but previous code

Re: [fpc-devel] x86_64.inc CompareByte

2017-09-30 Thread J. Gareth Moreton
Oops, I just realised now that that byte sequence is your workaround, not actually already present in the code! My bad, but yes, that is the correct byte sequence (although it's worth putting a comment in to actually state what they are). If the bug is known to still be present in GAS, then

Re: [fpc-devel] x86_64.inc CompareByte

2017-09-30 Thread J. Gareth Moreton
Hi Markus, Nice to see there's more than one person working to improve compiled code on x86-64! I can answer one question... the byte sequence 0F B6 01 is the direct machine code representation of movzbl (%rcx),%eax - this might be due to a bug with the assembler or movzbl not being

[fpc-devel] x86_64.inc CompareByte

2017-09-30 Thread Markus Beth
It did some changes to CompareByte in rtl/x86_64/x86_64.inc to reduce the code size and make it run faster (see attached path). I was successful with the code size deduction (47 bytes vs. 62 bytes) and also with the speed (according to a micro benchmark [1] run on an Ivy Bridge desktop). To