Re: [fpc-devel] x86_64.inc CompareByte
On 31/10/17 11:47, Florian Klämpfl wrote: Am 30.10.2017 um 19:46 schrieb C Western: On 29/10/17 22:18, Florian Klämpfl wrote: I have committed your lastest patch with a few changes: the loop entry is aligned now to 16 bytes, I used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this works better. I think also these changes are better on average. With this patch on x86_64 linux lazarus crashes at random places, but quite frequently, and My mistake, I fixed it. CompareByte seems to be implicated. Should the zero exit be: xorl %rax, %rax No, this is fine. This clears also the upper 32 bit. You can probably tell I haven't done much assembler programming recently. I am happy to confirm that lazarus now seems much more stable. Thanks Colin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Am 30.10.2017 um 19:46 schrieb C Western: > On 29/10/17 22:18, Florian Klämpfl wrote: >> >> I have committed your lastest patch with a few changes: the loop entry is >> aligned now to 16 bytes, I >> used movb instead of movbzl and inc instead of add. For me (Haswell CPU) >> this works better. I think >> also these changes are better on average. >> > With this patch on x86_64 linux lazarus crashes at random places, but quite > frequently, and My mistake, I fixed it. > CompareByte seems to be implicated. Should the zero exit be: > > xorl %rax, %rax No, this is fine. This clears also the upper 32 bit. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
On 29/10/17 22:18, Florian Klämpfl wrote: I have committed your lastest patch with a few changes: the loop entry is aligned now to 16 bytes, I used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this works better. I think also these changes are better on average. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel With this patch on x86_64 linux lazarus crashes at random places, but quite frequently, and CompareByte seems to be implicated. Should the zero exit be: xorl%rax, %rax Colin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Am 23.10.2017 um 22:58 schrieb Markus Beth: > Here are the numbers for on ivy bridge CPU: > The output for [1] using the current RTL CompareByte is: > 9.001.275.281 cycles:u ( +- 0,00% ) > 28.000.560.462 instructions:u # 3,11 insn per cycle ( +- 0,00% ) > 2,654735815 seconds time elapsed ( +- 0,00% ) > > The output for [1] using the x86_64_comparebyte3.patch CompareByte is: > 9.002.038.628 cycles:u ( +- 0,01% ) > 26.000.559.441 instructions:u # 2,89 insn per cycle ( +- 0,00% ) > 2,655002891 seconds time elapsed ( +- 0,01% ) > > The output for [2] using the current RTL CompareByte is: > 227.941.173.371 cycles:u ( +- 0,00% ) > 734.077.388.160 instructions:u # 3,22 insn per cycle ( +- 0,00% ) > 67,215188648 seconds time elapsed ( +- 0,00% ) > > The output for [2] using the x86_64_comparebyte3.patch CompareByte is: > 210.694.292.040 cycles:u ( +- 0,00% ) > 524.341.215.569 instructions:u # 2,49 insn per cycle ( +- 0,00% ) > 62,129294243 seconds time elapsed ( +- 0,00% ) > > > With Florian's benchmark I also observe that the patched version is > slightly slower than the original. But I have no idea why this is so. I have committed your lastest patch with a few changes: the loop entry is aligned now to 16 bytes, I used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this works better. I think also these changes are better on average. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Here are the numbers for on ivy bridge CPU: The output for [1] using the current RTL CompareByte is: 9.001.275.281 cycles:u( +- 0,00% ) 28.000.560.462 instructions:u # 3,11 insn per cycle ( +- 0,00% ) 2,654735815 seconds time elapsed( +- 0,00% ) The output for [1] using the x86_64_comparebyte3.patch CompareByte is: 9.002.038.628 cycles:u( +- 0,01% ) 26.000.559.441 instructions:u # 2,89 insn per cycle ( +- 0,00% ) 2,655002891 seconds time elapsed( +- 0,01% ) The output for [2] using the current RTL CompareByte is: 227.941.173.371 cycles:u( +- 0,00% ) 734.077.388.160 instructions:u # 3,22 insn per cycle ( +- 0,00% ) 67,215188648 seconds time elapsed( +- 0,00% ) The output for [2] using the x86_64_comparebyte3.patch CompareByte is: 210.694.292.040 cycles:u( +- 0,00% ) 524.341.215.569 instructions:u # 2,49 insn per cycle ( +- 0,00% ) 62,129294243 seconds time elapsed( +- 0,00% ) With Florian's benchmark I also observe that the patched version is slightly slower than the original. But I have no idea why this is so. On 23.10.2017 00:25, Markus Beth wrote: I used 2 different benchmarks. One for (very) short buffers [1] and one for rather large buffers [2]. [1]: var key, key2: string; res: LongWord; i: SizeInt; begin key := 'A'; key2 := 'A'; for i:= 0 to 10 do begin res := CompareByte(key[1], key2[1], Length(key)); end; end. [2]: var key, key2: string; res: LongWord; i: SizeInt; begin SetLength(key,10240 * 1024); SetLength(key2,10240 * 1024); for i:= 0 to 1 do begin hash := CompareByte_RTL(key[1], key2[1], Length(key)); end; end. The measurement takes place on a Intel Core i5 CPU M520@2.40GHz which has a Westmere Microarchitecture. The programs are run on an otherwise idle Linux (OpenSuse Tumbleweed) system via perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte after the following setup: cpupower frequency-set -g performance echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere) echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge) The output for [1] using the current RTL CompareByte is: 11.336.449.124 cycles ( +- 0,05% ) 28.077.280.776 instructions # 2,48 insn per cycle ( +- 0,00% ) 4,736782553 seconds time elapsed ( +- 0,05% ) The output for [1] using the x86_64_comparebyte3.patch CompareByte is: 10.293.397.316 cycles ( +- 0,01% ) 26.070.305.490 instructions # 2,53 insn per cycle ( +- 0,00% ) 4,301081734 seconds time elapsed ( +- 0,01% ) The output for [2] using the current RTL CompareByte is: 325.526.707.243 cycles ( +- 0,31% ) 736.237.912.850 instructions # 2,26 insn per cycle ( +- 0,00% ) 136,013215979 seconds time elapsed ( +- 0,31% ) The output for [2] using the x86_64_comparebyte3.patch CompareByte is: 224.621.009.410 cycles ( +- 0,95% ) 525.832.575.056 instructions # 2,34 insn per cycle ( +- 0,00% ) 93,851685247 seconds time elapsed ( +- 0,95% ) I hopefully can come up with the corresponding numbers for a ivy bridge CPU tomorrow. On 22.10.2017 20:55, Florian Klämpfl wrote: Am 21.10.2017 um 01:24 schrieb Markus Beth: Find attached the already announced version of CompareByte. What benchmark did you use? In my tests it is slightly slower than that one of fpc 3.0.x? I used the following test program: var buf1,buf2 : array[0..127] of byte; pos,len,i,j : longint; begin for i:=1 to 100 do begin len:=random(100); for j:=0 to len-1 do begin buf1[j]:=random(256); buf2[j]:=random(256); end; for j:=0 to random(10) do buf2[j]:=buf1[j]; for j:=1 to 100 do CompareByte(buf1,buf2,len); end; end. On 16.10.2017 23:08, Markus Beth wrote: On 16.10.2017 22:41, Florian Klämpfl wrote: P.S.: I am currently working on another version of CompareByte that might have a slightly higher latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale CPU (Westmere microarchitecture)). But this would need some more testing and benchmarking. I can come up with it here again if this would be of any interest. Small lengths in terms of matching string or overall lengths? It is small length in terms of matching string as there is some setup work before the loop. BTW: I would really like to see a PCMPSTR based implementation :) PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of SSE4.2. How would you deal with Intel core microarchitecture CPUs that don't have it? ___ fpc-devel maillist -
Re: [fpc-devel] x86_64.inc CompareByte
Using the code given below as "inner", I measure this: Current Trunk: O0 compare-byte-1 : 196065.112 +/- 896.754 cycles/inner [0.5 %CV 1.6 %R] O1 compare-byte-1 : 196510.158 +/- 577.976 cycles/inner [0.3 %CV 1.1 %R] O3 compare-byte-1 : 187540.922 +/- 706.167 cycles/inner [0.4 %CV 1.5 %R] Patch from 2017-10-21: O0 compare-byte-2 : 175831.632 +/- 965.972 cycles/inner [0.5 %CV 2.1 %R] O1 compare-byte-2 : 176039.560 +/- 527.141 cycles/inner [0.3 %CV 1.0 %R] O3 compare-byte-2 : 158527.167 +/- 661.690 cycles/inner [0.4 %CV 1.5 %R] (%CV: coefficient of variance * 100%. %R: span as % of mean) CPU: Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz Family 6 Model 60 Stepping 3 (Haswell) true single core clock (measured) 2.83 GHz So the new version is a bit faster, but not by a large margin (10-15%). It is statistically significant though. While I'm at it, i386 could use some love: O1 compare-byte-1 : 755247.183 +/- 8125.671 cycles/inner [1.1 %CV 4.5 %R] That's 3.8 times slower than x64 for exactly the same code. Code: len:=random(100); for j:=0 to len-1 do begin buf1[j]:=random(256); buf2[j]:=random(256); end; for j:=0 to random(10) do buf2[j]:=buf1[j]; for j:=1 to 1 do CompareBytePatch(buf1,buf2,len); // or System.CompareByte -- Regards, Martok Ceterum censeo b32079 esse sanandam. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
I used 2 different benchmarks. One for (very) short buffers [1] and one for rather large buffers [2]. [1]: var key, key2: string; res: LongWord; i: SizeInt; begin key := 'A'; key2 := 'A'; for i:= 0 to 10 do begin res := CompareByte(key[1], key2[1], Length(key)); end; end. [2]: var key, key2: string; res: LongWord; i: SizeInt; begin SetLength(key,10240 * 1024); SetLength(key2,10240 * 1024); for i:= 0 to 1 do begin hash := CompareByte_RTL(key[1], key2[1], Length(key)); end; end. The measurement takes place on a Intel Core i5 CPU M520@2.40GHz which has a Westmere Microarchitecture. The programs are run on an otherwise idle Linux (OpenSuse Tumbleweed) system via perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte after the following setup: cpupower frequency-set -g performance echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere) echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge) The output for [1] using the current RTL CompareByte is: 11.336.449.124 cycles ( +- 0,05% ) 28.077.280.776 instructions # 2,48 insn per cycle ( +- 0,00% ) 4,736782553 seconds time elapsed( +- 0,05% ) The output for [1] using the x86_64_comparebyte3.patch CompareByte is: 10.293.397.316 cycles ( +- 0,01% ) 26.070.305.490 instructions # 2,53 insn per cycle ( +- 0,00% ) 4,301081734 seconds time elapsed( +- 0,01% ) The output for [2] using the current RTL CompareByte is: 325.526.707.243 cycles ( +- 0,31% ) 736.237.912.850 instructions # 2,26 insn per cycle ( +- 0,00% ) 136,013215979 seconds time elapsed( +- 0,31% ) The output for [2] using the x86_64_comparebyte3.patch CompareByte is: 224.621.009.410 cycles ( +- 0,95% ) 525.832.575.056 instructions # 2,34 insn per cycle ( +- 0,00% ) 93,851685247 seconds time elapsed( +- 0,95% ) I hopefully can come up with the corresponding numbers for a ivy bridge CPU tomorrow. On 22.10.2017 20:55, Florian Klämpfl wrote: Am 21.10.2017 um 01:24 schrieb Markus Beth: Find attached the already announced version of CompareByte. What benchmark did you use? In my tests it is slightly slower than that one of fpc 3.0.x? I used the following test program: var buf1,buf2 : array[0..127] of byte; pos,len,i,j : longint; begin for i:=1 to 100 do begin len:=random(100); for j:=0 to len-1 do begin buf1[j]:=random(256); buf2[j]:=random(256); end; for j:=0 to random(10) do buf2[j]:=buf1[j]; for j:=1 to 100 do CompareByte(buf1,buf2,len); end; end. On 16.10.2017 23:08, Markus Beth wrote: On 16.10.2017 22:41, Florian Klämpfl wrote: P.S.: I am currently working on another version of CompareByte that might have a slightly higher latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale CPU (Westmere microarchitecture)). But this would need some more testing and benchmarking. I can come up with it here again if this would be of any interest. Small lengths in terms of matching string or overall lengths? It is small length in terms of matching string as there is some setup work before the loop. BTW: I would really like to see a PCMPSTR based implementation :) PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of SSE4.2. How would you deal with Intel core microarchitecture CPUs that don't have it? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Am 21.10.2017 um 01:24 schrieb Markus Beth: > Find attached the already announced version of CompareByte. > What benchmark did you use? In my tests it is slightly slower than that one of fpc 3.0.x? I used the following test program: var buf1,buf2 : array[0..127] of byte; pos,len,i,j : longint; begin for i:=1 to 100 do begin len:=random(100); for j:=0 to len-1 do begin buf1[j]:=random(256); buf2[j]:=random(256); end; for j:=0 to random(10) do buf2[j]:=buf1[j]; for j:=1 to 100 do CompareByte(buf1,buf2,len); end; end. > > > On 16.10.2017 23:08, Markus Beth wrote: >> On 16.10.2017 22:41, Florian Klämpfl wrote: P.S.: I am currently working on another version of CompareByte that might have a slightly higher latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale CPU (Westmere microarchitecture)). But this would need some more testing and benchmarking. I can come up with it here again if this would be of any interest. >>> >>> Small lengths in terms of matching string or overall lengths? >> >> It is small length in terms of matching string as there is some setup work >> before the loop. >> >>> BTW: I would really like to see a PCMPSTR based implementation :) >> PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of >> SSE4.2. How would you >> deal with Intel core microarchitecture CPUs that don't have it? > > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
In our previous episode, Markus Beth said: > Find attached the already announced version of CompareByte. > > BTW: If you really like to see a PCMPSTR based implementation, have a > look at Agner Fog's Subroutine library asmlib.zip > (http://agner.org/optimize/). And then you see GPL licensed, and move on :-) GPL is not suitable for the RTL. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Find attached the already announced version of CompareByte. BTW: If you really like to see a PCMPSTR based implementation, have a look at Agner Fog's Subroutine library asmlib.zip (http://agner.org/optimize/). On 16.10.2017 23:08, Markus Beth wrote: On 16.10.2017 22:41, Florian Klämpfl wrote: P.S.: I am currently working on another version of CompareByte that might have a slightly higher latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale CPU (Westmere microarchitecture)). But this would need some more testing and benchmarking. I can come up with it here again if this would be of any interest. Small lengths in terms of matching string or overall lengths? It is small length in terms of matching string as there is some setup work before the loop. BTW: I would really like to see a PCMPSTR based implementation :) PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of SSE4.2. How would you deal with Intel core microarchitecture CPUs that don't have it? Index: trunk/rtl/x86_64/x86_64.inc === --- trunk/rtl/x86_64/x86_64.inc (Revision 37497) +++ trunk/rtl/x86_64/x86_64.inc (Arbeitskopie) @@ -640,27 +640,36 @@ mov%rsi, %rdx mov%rdi, %rcx {$endif win64} -testq %r8,%r8 -je .LCmpbyteZero +negq%r8 +jz .LCmpbyteZero +subq%r8, %rcx +subq%r8, %rdx + .balign 8 .LCmpbyteLoop: -movb(%rcx),%r9b -cmpb(%rdx),%r9b -leaq1(%rcx),%rcx -leaq1(%rdx),%rdx +{$ifdef oldbinutils} +// for the reason why this alternate coding of movzbl is given here +// see the comments in FillChar above +.byte 0x42,0x0F,0xB6,0x04,0x01 +{$else} +movzbl (%rcx,%r8), %eax +{$endif} +cmpb(%rdx,%r8), %al jne .LCmpbyteExitFast -decq%r8 +addq$1, %r8 jne .LCmpbyteLoop +.LCmpbyteZero: + xorl%eax, %eax + retq + .LCmpbyteExitFast: - movzbq -1(%rdx),%r8 { Compare last position } - movzbq %r9b,%rax - subq%r8,%rax - ret - -.LCmpbyteZero: - movq$0,%rax - ret +{$ifdef oldbinutils} +.byte 0x42,0x0F,0xB6,0x0C,0x02 +{$else} + movzbl (%rdx,%r8), %ecx{ Compare last position } +{$endif} + subq%rcx, %rax end; {$endif FPC_SYSTEM_HAS_COMPAREBYTE} ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Am 16.10.2017 um 23:08 schrieb Markus Beth: > On 16.10.2017 22:41, Florian Klämpfl wrote: >>> P.S.: I am currently working on another version of CompareByte that might >>> have a slightly higher >>> latency for very small len but a higher throughput (2 cycles per iteration >>> vs. 3 cycles on an Intel >>> Arrandale CPU (Westmere microarchitecture)). But this would need some more >>> testing and benchmarking. >>> I can come up with it here again if this would be of any interest. >> >> Small lengths in terms of matching string or overall lengths? > > It is small length in terms of matching string as there is some setup work > before the loop. > >> BTW: I would really like to see a PCMPSTR based implementation :) > PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of > SSE4.2. How would you deal > with Intel core microarchitecture CPUs that don't have it? Just set a flag at startup if it is supported and then branch on the flag. As the flag never changes, branch prediction most likely will work very good. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Am 16.10.2017 23:04 schrieb "Markus Beth": > > On 16.10.2017 22:41, Florian Klämpfl wrote: >> BTW: I would really like to see a PCMPSTR based implementation :) > > PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of SSE4.2. How would you deal with Intel core microarchitecture CPUs that don't have it? It could be selected at runtime, after all CPUID can always be checked. Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
On 16.10.2017 22:41, Florian Klämpfl wrote: P.S.: I am currently working on another version of CompareByte that might have a slightly higher latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale CPU (Westmere microarchitecture)). But this would need some more testing and benchmarking. I can come up with it here again if this would be of any interest. Small lengths in terms of matching string or overall lengths? It is small length in terms of matching string as there is some setup work before the loop. BTW: I would really like to see a PCMPSTR based implementation :) PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of SSE4.2. How would you deal with Intel core microarchitecture CPUs that don't have it? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Am 16.10.2017 um 22:33 schrieb Markus Beth: > Sorry for the late reply. I had a weekend off(line). > > The instructions were chosen on purpose and Sergey already cited the part of > the Intel documentation > that explains why this is correct. You can find a similar part in AMD "AMD64 > Architecture > Programmer’s Manual Volume 1: Application Programming": Yes, Sergey is of course right, it was too late yesterday :) > >> 3.4.5 High 32 Bits >> In 64-bit mode, the following rules apply to extension of results into >> the high 32 bits when results smaller than 64 bits are written: >> >> * Zero-Extension of 32-Bit Results: 32-bit results are zero-extended >> into the high 32 bits of 64-bit GPR destination registers. > > I think other x86_64 CPU manufacturers also adhere to this rule as I know gcc > also relies on this. > > I generally prefer the instructions operating on 32 bit operands over those > operating on 64 bit > operands where appropriate because they are typically encoded in less bytes > as they do not need a > REX prefix. > > I have updated the patch (attached) to include a code path for 'oldbinutils' > as Gareth suggested. In > addition I switched the tails (.LCmpbyteZero and .LCmpbyteExitFast) as when > we leave the loop > because the loop count reaches zero, we know already that the last bytes were > the same and do not > need to subq them. > > Markus > > P.S.: I am currently working on another version of CompareByte that might > have a slightly higher > latency for very small len but a higher throughput (2 cycles per iteration > vs. 3 cycles on an Intel > Arrandale CPU (Westmere microarchitecture)). But this would need some more > testing and benchmarking. > I can come up with it here again if this would be of any interest. Small lengths in terms of matching string or overall lengths? BTW: I would really like to see a PCMPSTR based implementation :) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Sorry for the late reply. I had a weekend off(line). The instructions were chosen on purpose and Sergey already cited the part of the Intel documentation that explains why this is correct. You can find a similar part in AMD "AMD64 Architecture Programmer’s Manual Volume 1: Application Programming": > 3.4.5 High 32 Bits > In 64-bit mode, the following rules apply to extension of results into > the high 32 bits when results smaller than 64 bits are written: > > * Zero-Extension of 32-Bit Results: 32-bit results are zero-extended > into the high 32 bits of 64-bit GPR destination registers. I think other x86_64 CPU manufacturers also adhere to this rule as I know gcc also relies on this. I generally prefer the instructions operating on 32 bit operands over those operating on 64 bit operands where appropriate because they are typically encoded in less bytes as they do not need a REX prefix. I have updated the patch (attached) to include a code path for 'oldbinutils' as Gareth suggested. In addition I switched the tails (.LCmpbyteZero and .LCmpbyteExitFast) as when we leave the loop because the loop count reaches zero, we know already that the last bytes were the same and do not need to subq them. Markus P.S.: I am currently working on another version of CompareByte that might have a slightly higher latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale CPU (Westmere microarchitecture)). But this would need some more testing and benchmarking. I can come up with it here again if this would be of any interest. On 16.10.2017 19:41, Сергей Сергеенко wrote: On 15 Oct 2017 Florian Klämpfl wrote: I had a look and tested it and it worked, I didn't notice the problem below either. Sorry for wrong warning. I cannot provide any example where my suggestions are true. The reason for it is described on page Vol. 1 3-13 of Intel 64 and IA-32 Architectures Software Developer's Manual: When in 64-bit mode, operand size determines the number of valid bits in the destination general-purpose register: [...] 32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose [...] So, instructions movzbl (%rcx),%eax and movzbl -1(%rdx),%ecx and xorl%eax,%eax should put zero into 32 high bits of appropriate registers. I think also the final xor should be a xorq %rax,%rax, right? As I said above xorl %eax, %eax should be enough. Index: trunk/rtl/x86_64/x86_64.inc === --- trunk/rtl/x86_64/x86_64.inc (Revision 37477) +++ trunk/rtl/x86_64/x86_64.inc (Arbeitskopie) @@ -645,22 +645,30 @@ .balign 8 .LCmpbyteLoop: -movb(%rcx),%r9b -cmpb(%rdx),%r9b +{$ifdef oldbinutils} +// for the reason why this alternate coding of movzbl is given here +// see the comments in FillChar above +.byte 0x0F,0xB6,0x01 +{$else} +movzbl (%rcx),%eax +{$endif} +cmpb(%rdx),%al leaq1(%rcx),%rcx leaq1(%rdx),%rdx jne .LCmpbyteExitFast decq%r8 jne .LCmpbyteLoop +.LCmpbyteZero: + xorl%eax,%eax + retq + .LCmpbyteExitFast: - movzbq -1(%rdx),%r8 { Compare last position } - movzbq %r9b,%rax - subq%r8,%rax - ret - -.LCmpbyteZero: - movq$0,%rax - ret +{$ifdef oldbinutils} +.byte 0x0F,0xB6,0x4A,0xFF +{$else} + movzbl -1(%rdx),%ecx{ Compare last position } +{$endif} + subq%rcx,%rax end; {$endif FPC_SYSTEM_HAS_COMPAREBYTE} ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
On 15 Oct 2017 Florian Klämpfl wrote: > I had a look and tested it and it worked, I didn't notice the problem below > either. Sorry for wrong warning. I cannot provide any example where my suggestions are true. The reason for it is described on page Vol. 1 3-13 of Intel 64 and IA-32 Architectures Software Developer's Manual: > When in 64-bit mode, operand size determines the number of valid bits in > the destination general-purpose register: > > [...] > > 32-bit operands generate a 32-bit result, zero-extended to a 64-bit > result in the destination general-purpose > > [...] So, instructions > movzbl (%rcx),%eax and > movzbl -1(%rdx),%ecx and > xorl%eax,%eax should put zero into 32 high bits of appropriate registers. > I think also the final xor should be a xorq %rax,%rax, right? As I said above xorl %eax, %eax should be enough. -- With best regards Sergey ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Am 12.10.2017 um 20:37 schrieb sserg...@gmail.com: > Hi. > > Sorry for late message. But nobody still have said about possible problem > with > suggested patch. Well, it's always very hard to review such highly optimized code. I had a look and tested it and it worked, I didn't notice the problem below either. > So I decide to pay attention on that proposed code may be > incorrect under some circumstances IMHO. > > Instruction on line 657 > > subq %rcx, %rax > > decreases value in %rax on %rcx, but previous code doesn't set any value to > 32 > high bits of %rax and 32 high bits of %rcx still contain 32 high bits of buf1 > address. So I think that correct result is not guarantied. > > I suggest to use mozbq instead of movzbl to fix this issue. I think also the final xor should be a xorq %rax,%rax, right? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Hi. Sorry for late message. But nobody still have said about possible problem with suggested patch. So I decide to pay attention on that proposed code may be incorrect under some circumstances IMHO. Instruction on line 657 subq %rcx, %rax decreases value in %rax on %rcx, but previous code doesn't set any value to 32 high bits of %rax and 32 high bits of %rcx still contain 32 high bits of buf1 address. So I think that correct result is not guarantied. I suggest to use mozbq instead of movzbl to fix this issue. -- With best regards Sergey ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Oops, I just realised now that that byte sequence is your workaround, not actually already present in the code! My bad, but yes, that is the correct byte sequence (although it's worth putting a comment in to actually state what they are). If the bug is known to still be present in GAS, then it might be a necessity - otherwise is it possible to flag it up to get it fixed in GAS rather than having to do the dangerous task of encoding direct machine code to get around it? Gareth On Sun 01/10/17 00:00 , "J. Gareth Moreton" gar...@moreton-family.com sent: > Hi Markus, > > > > Nice to see there's more than one person working to improve compiled code > on x86-64! > > > I can answer one question... the byte sequence 0F B6 01 is the direct > machine code representation of movzbl > (%rcx),%eax - this might be due to a bug with the assembler or movzbl not > being recognised (I had to do the > same thing with xgetbv once). > > > > Gareth Moreton > > > > > > On Sat 30/09/17 23:24 , Markus Beth markus.be > t...@zkrd.de sent: > > It did some changes to CompareByte in > rtl/x86_64/x86_64.inc to reduce > > the code size and make it run faster (see > attached path). I was > > successful with the code size deduction (47 > bytes vs. 62 bytes) and also > > with the speed (according to a micro benchmark > [1] run on an Ivy Bridge > > desktop). > > > > > > To achieve this I used movzbl twice. But then I > came across the comment > > in FillChar (also in rtl/x86_64/x86_64.inc) > about movzbl breaking > > targets using external GAS (Mantis #19188). As > this Mantis issue is > > dated back in 2011 my question is: Is this still > valid? And what would > > be the preferred way to overcome this issue? > > > {$ifdef oldbinutils} > > > .byte 0x0F,0xb6,0x01 > > > {$else} > > > movzbl (%rcx),%eax > > > {$endif} > > > > > > Markus > > > > > > [1] the benchmark compares a 10 MB memory block > with itself 1 times > > > > > ___ > > > fpc-devel maillist - fpc-devel@lists.freepascal.org > > http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel [1] > > > > > > > > > > > Links: > > > -- > > > [1] > > > http://secureweb.fast.net.uk/parse.php?redirect=http://lists.fr > eepascal.org > > /cgi-bin/mailman/listinfo/fpc-devel > > > > > > > ___ > > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > > > > ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] x86_64.inc CompareByte
Hi Markus, Nice to see there's more than one person working to improve compiled code on x86-64! I can answer one question... the byte sequence 0F B6 01 is the direct machine code representation of movzbl (%rcx),%eax - this might be due to a bug with the assembler or movzbl not being recognised (I had to do the same thing with xgetbv once). Gareth Moreton On Sat 30/09/17 23:24 , Markus Beth markus.b...@zkrd.de sent: > It did some changes to CompareByte in rtl/x86_64/x86_64.inc to reduce > the code size and make it run faster (see attached path). I was > successful with the code size deduction (47 bytes vs. 62 bytes) and also > with the speed (according to a micro benchmark [1] run on an Ivy Bridge > desktop). > > To achieve this I used movzbl twice. But then I came across the comment > in FillChar (also in rtl/x86_64/x86_64.inc) about movzbl breaking > targets using external GAS (Mantis #19188). As this Mantis issue is > dated back in 2011 my question is: Is this still valid? And what would > be the preferred way to overcome this issue? > {$ifdef oldbinutils} > .byte 0x0F,0xb6,0x01 > {$else} > movzbl (%rcx),%eax > {$endif} > > Markus > > [1] the benchmark compares a 10 MB memory block with itself 1 times > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel [1] > > > > Links: > -- > [1] > http://secureweb.fast.net.uk/parse.php?redirect=http://lists.freepascal.org > /cgi-bin/mailman/listinfo/fpc-devel > ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[fpc-devel] x86_64.inc CompareByte
It did some changes to CompareByte in rtl/x86_64/x86_64.inc to reduce the code size and make it run faster (see attached path). I was successful with the code size deduction (47 bytes vs. 62 bytes) and also with the speed (according to a micro benchmark [1] run on an Ivy Bridge desktop). To achieve this I used movzbl twice. But then I came across the comment in FillChar (also in rtl/x86_64/x86_64.inc) about movzbl breaking targets using external GAS (Mantis #19188). As this Mantis issue is dated back in 2011 my question is: Is this still valid? And what would be the preferred way to overcome this issue? {$ifdef oldbinutils} .byte 0x0F,0xb6,0x01 {$else} movzbl (%rcx),%eax {$endif} Markus [1] the benchmark compares a 10 MB memory block with itself 1 times Index: trunk/rtl/x86_64/x86_64.inc === --- trunk/rtl/x86_64/x86_64.inc (Revision 37365) +++ trunk/rtl/x86_64/x86_64.inc (Arbeitskopie) @@ -645,8 +645,8 @@ .balign 8 .LCmpbyteLoop: -movb(%rcx),%r9b -cmpb(%rdx),%r9b +movzbl (%rcx),%eax +cmpb(%rdx),%al leaq1(%rcx),%rcx leaq1(%rdx),%rdx jne .LCmpbyteExitFast @@ -653,14 +653,12 @@ decq%r8 jne .LCmpbyteLoop .LCmpbyteExitFast: - movzbq -1(%rdx),%r8 { Compare last position } - movzbq %r9b,%rax - subq%r8,%rax - ret + movzbl -1(%rdx),%ecx{ Compare last position } + subq%rcx,%rax + retq .LCmpbyteZero: - movq$0,%rax - ret + xorl%eax,%eax end; {$endif FPC_SYSTEM_HAS_COMPAREBYTE} ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel