Re: [fpc-devel] x86_64.inc CompareByte

2017-10-31 Thread C Western

On 31/10/17 11:47, Florian Klämpfl wrote:

Am 30.10.2017 um 19:46 schrieb C Western:

On 29/10/17 22:18, Florian Klämpfl wrote:


I have committed your lastest patch with a few changes: the loop entry is 
aligned now to 16 bytes, I
used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this 
works better. I think
also these changes are better on average.


With this patch on x86_64 linux lazarus crashes at random places, but quite 
frequently, and


My mistake, I fixed it.


CompareByte seems to be implicated. Should the zero exit be:

xorl    %rax, %rax


No, this is fine. This clears also the upper 32 bit.


You can probably tell I haven't done much assembler programming 
recently. I am happy to confirm that lazarus now seems much more stable.


Thanks

Colin

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-31 Thread Florian Klämpfl
Am 30.10.2017 um 19:46 schrieb C Western:
> On 29/10/17 22:18, Florian Klämpfl wrote:
>>
>> I have committed your lastest patch with a few changes: the loop entry is 
>> aligned now to 16 bytes, I
>> used movb instead of movbzl and inc instead of add. For me (Haswell CPU) 
>> this works better. I think
>> also these changes are better on average.
>>
> With this patch on x86_64 linux lazarus crashes at random places, but quite 
> frequently, and

My mistake, I fixed it.

> CompareByte seems to be implicated. Should the zero exit be:
> 
> xorl    %rax, %rax

No, this is fine. This clears also the upper 32 bit.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-30 Thread C Western

On 29/10/17 22:18, Florian Klämpfl wrote:


I have committed your lastest patch with a few changes: the loop entry is 
aligned now to 16 bytes, I
used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this 
works better. I think
also these changes are better on average.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

With this patch on x86_64 linux lazarus crashes at random places, but 
quite frequently, and CompareByte seems to be implicated. Should the 
zero exit be:


xorl%rax, %rax

Colin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-29 Thread Florian Klämpfl
Am 23.10.2017 um 22:58 schrieb Markus Beth:
> Here are the numbers for on ivy bridge CPU:
> The output for [1] using the current RTL CompareByte is:
>   9.001.275.281   cycles:u    ( +-  0,00% )
>  28.000.560.462   instructions:u #   3,11  insn per cycle ( +-  0,00% )
>   2,654735815 seconds time elapsed    ( +-  0,00% )
> 
> The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
>   9.002.038.628   cycles:u    ( +-  0,01% )
>  26.000.559.441   instructions:u #   2,89  insn per cycle ( +-  0,00% )
>   2,655002891 seconds time elapsed    ( +-  0,01% )
> 
> The output for [2] using the current RTL CompareByte is:
> 227.941.173.371   cycles:u    ( +-  0,00% )
> 734.077.388.160   instructions:u #   3,22  insn per cycle ( +-  0,00% )
>  67,215188648 seconds time elapsed    ( +-  0,00% )
> 
> The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
> 210.694.292.040   cycles:u    ( +-  0,00% )
> 524.341.215.569   instructions:u #   2,49  insn per cycle ( +-  0,00% )
>  62,129294243 seconds time elapsed    ( +-  0,00% )
> 
> 
> With Florian's benchmark I also observe that the patched version is
> slightly slower than the original. But I have no idea why this is so.

I have committed your lastest patch with a few changes: the loop entry is 
aligned now to 16 bytes, I
used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this 
works better. I think
also these changes are better on average.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-23 Thread Markus Beth

Here are the numbers for on ivy bridge CPU:
The output for [1] using the current RTL CompareByte is:
  9.001.275.281   cycles:u( +-  0,00% )
 28.000.560.462   instructions:u #   3,11  insn per cycle ( +-  0,00% )
  2,654735815 seconds time elapsed( +-  0,00% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
  9.002.038.628   cycles:u( +-  0,01% )
 26.000.559.441   instructions:u #   2,89  insn per cycle ( +-  0,00% )
  2,655002891 seconds time elapsed( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
227.941.173.371   cycles:u( +-  0,00% )
734.077.388.160   instructions:u #   3,22  insn per cycle ( +-  0,00% )
 67,215188648 seconds time elapsed( +-  0,00% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
210.694.292.040   cycles:u( +-  0,00% )
524.341.215.569   instructions:u #   2,49  insn per cycle ( +-  0,00% )
 62,129294243 seconds time elapsed( +-  0,00% )


With Florian's benchmark I also observe that the patched version is
slightly slower than the original. But I have no idea why this is so.


On 23.10.2017 00:25, Markus Beth wrote:

I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].

[1]:
var
   key, key2: string;
   res: LongWord;
   i: SizeInt;

begin
   key  := 'A';
   key2 := 'A';
   for i:= 0 to 10 do begin
     res := CompareByte(key[1], key2[1], Length(key));
   end;
end.

[2]:
var
   key, key2: string;
   res: LongWord;
   i: SizeInt;

begin
   SetLength(key,10240 * 1024);
   SetLength(key2,10240 * 1024);
   for i:= 0 to 1 do begin
     hash := CompareByte_RTL(key[1], key2[1], Length(key));
   end;
end.


The measurement takes place on a Intel Core i5 CPU M520@2.40GHz which
has a Westmere Microarchitecture. The programs are run on an otherwise
idle Linux (OpenSuse Tumbleweed) system via

perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte

after the following setup:
  cpupower frequency-set -g performance
  echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)


The output for [1] using the current RTL CompareByte is:
  11.336.449.124   cycles  ( +-  0,05% )
  28.077.280.776   instructions   #   2,48  insn per cycle ( +-  0,00% )
   4,736782553 seconds time elapsed    ( +-  0,05% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
  10.293.397.316   cycles  ( +-  0,01% )
  26.070.305.490   instructions   #   2,53  insn per cycle ( +-  0,00% )
   4,301081734 seconds time elapsed    ( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
325.526.707.243   cycles  ( +-  0,31% )
736.237.912.850   instructions   #   2,26  insn per cycle ( +-  0,00% )
136,013215979 seconds time elapsed    ( +-  0,31% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
224.621.009.410   cycles  ( +-  0,95% )
525.832.575.056   instructions   #   2,34  insn per cycle ( +-  0,00% )
  93,851685247 seconds time elapsed    ( +-  0,95% )


I hopefully can come up with the corresponding numbers for a ivy bridge
CPU tomorrow.


On 22.10.2017 20:55, Florian Klämpfl wrote:

Am 21.10.2017 um 01:24 schrieb Markus Beth:

Find attached the already announced version of CompareByte.



What benchmark did you use? In my tests it is slightly slower than 
that one of fpc 3.0.x?


I used the following test program:

var
   buf1,buf2 : array[0..127] of byte;
   pos,len,i,j : longint;

begin
   for i:=1 to 100 do
 begin
   len:=random(100);
   for j:=0 to len-1 do
 begin
   buf1[j]:=random(256);
   buf2[j]:=random(256);
 end;

   for j:=0 to random(10) do
 buf2[j]:=buf1[j];

   for j:=1 to 100 do
 CompareByte(buf1,buf2,len);
 end;
end.




On 16.10.2017 23:08, Markus Beth wrote:

On 16.10.2017 22:41, Florian Klämpfl wrote:
P.S.: I am currently working on another version of CompareByte 
that might have a slightly higher
latency for very small len but a higher throughput (2 cycles per 
iteration vs. 3 cycles on an Intel
Arrandale CPU (Westmere microarchitecture)). But this would need 
some more testing and

benchmarking.
I can come up with it here again if this would be of any interest.


Small lengths in terms of matching string or overall lengths?


It is small length in terms of matching string as there is some 
setup work before the loop.



BTW: I would really like to see a PCMPSTR based implementation :)
PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is 
part of SSE4.2. How would you

deal with Intel core microarchitecture CPUs that don't have it?

___
fpc-devel maillist  -  

Re: [fpc-devel] x86_64.inc CompareByte

2017-10-23 Thread Martok
Using the code given below as "inner", I measure this:

Current Trunk:
O0 compare-byte-1 : 196065.112 +/- 896.754 cycles/inner [0.5 %CV 1.6 %R]
O1 compare-byte-1 : 196510.158 +/- 577.976 cycles/inner [0.3 %CV 1.1 %R]
O3 compare-byte-1 : 187540.922 +/- 706.167 cycles/inner [0.4 %CV 1.5 %R]
Patch from 2017-10-21:
O0 compare-byte-2 : 175831.632 +/- 965.972 cycles/inner [0.5 %CV 2.1 %R]
O1 compare-byte-2 : 176039.560 +/- 527.141 cycles/inner [0.3 %CV 1.0 %R]
O3 compare-byte-2 : 158527.167 +/- 661.690 cycles/inner [0.4 %CV 1.5 %R]
(%CV: coefficient of variance * 100%. %R: span as % of mean)

CPU:
 Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz Family 6 Model 60 Stepping 3 (Haswell)
 true single core clock (measured) 2.83 GHz


So the new version is a bit faster, but not by a large margin (10-15%). It is
statistically significant though.
While I'm at it, i386 could use some love:
O1 compare-byte-1 :  755247.183 +/- 8125.671 cycles/inner [1.1 %CV 4.5 %R]
That's 3.8 times slower than x64 for exactly the same code.

Code:
len:=random(100);
for j:=0 to len-1 do
  begin
buf1[j]:=random(256);
buf2[j]:=random(256);
  end;

for j:=0 to random(10) do
  buf2[j]:=buf1[j];

for j:=1 to 1 do
  CompareBytePatch(buf1,buf2,len);  // or System.CompareByte


-- 
Regards,
Martok

Ceterum censeo b32079 esse sanandam.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-22 Thread Markus Beth

I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].

[1]:
var
  key, key2: string;
  res: LongWord;
  i: SizeInt;

begin
  key  := 'A';
  key2 := 'A';
  for i:= 0 to 10 do begin
res := CompareByte(key[1], key2[1], Length(key));
  end;
end.

[2]:
var
  key, key2: string;
  res: LongWord;
  i: SizeInt;

begin
  SetLength(key,10240 * 1024);
  SetLength(key2,10240 * 1024);
  for i:= 0 to 1 do begin
hash := CompareByte_RTL(key[1], key2[1], Length(key));
  end;
end.


The measurement takes place on a Intel Core i5 CPU M520@2.40GHz which
has a Westmere Microarchitecture. The programs are run on an otherwise
idle Linux (OpenSuse Tumbleweed) system via

perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte

after the following setup:
 cpupower frequency-set -g performance
 echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
 echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)


The output for [1] using the current RTL CompareByte is:
 11.336.449.124   cycles  ( +-  0,05% )
 28.077.280.776   instructions   #   2,48  insn per cycle ( +-  0,00% )
  4,736782553 seconds time elapsed( +-  0,05% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
 10.293.397.316   cycles  ( +-  0,01% )
 26.070.305.490   instructions   #   2,53  insn per cycle ( +-  0,00% )
  4,301081734 seconds time elapsed( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
325.526.707.243   cycles  ( +-  0,31% )
736.237.912.850   instructions   #   2,26  insn per cycle ( +-  0,00% )
136,013215979 seconds time elapsed( +-  0,31% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
224.621.009.410   cycles  ( +-  0,95% )
525.832.575.056   instructions   #   2,34  insn per cycle ( +-  0,00% )
 93,851685247 seconds time elapsed( +-  0,95% )


I hopefully can come up with the corresponding numbers for a ivy bridge
CPU tomorrow.


On 22.10.2017 20:55, Florian Klämpfl wrote:

Am 21.10.2017 um 01:24 schrieb Markus Beth:

Find attached the already announced version of CompareByte.



What benchmark did you use? In my tests it is slightly slower than that one of 
fpc 3.0.x?

I used the following test program:

var
   buf1,buf2 : array[0..127] of byte;
   pos,len,i,j : longint;

begin
   for i:=1 to 100 do
 begin
   len:=random(100);
   for j:=0 to len-1 do
 begin
   buf1[j]:=random(256);
   buf2[j]:=random(256);
 end;

   for j:=0 to random(10) do
 buf2[j]:=buf1[j];

   for j:=1 to 100 do
 CompareByte(buf1,buf2,len);
 end;
end.




On 16.10.2017 23:08, Markus Beth wrote:

On 16.10.2017 22:41, Florian Klämpfl wrote:

P.S.: I am currently working on another version of CompareByte that might have 
a slightly higher
latency for very small len but a higher throughput (2 cycles per iteration vs. 
3 cycles on an Intel
Arrandale CPU (Westmere microarchitecture)). But this would need some more 
testing and
benchmarking.
I can come up with it here again if this would be of any interest.


Small lengths in terms of matching string or overall lengths?


It is small length in terms of matching string as there is some setup work 
before the loop.


BTW: I would really like to see a PCMPSTR based implementation :)

PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of 
SSE4.2. How would you
deal with Intel core microarchitecture CPUs that don't have it?

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-22 Thread Florian Klämpfl
Am 21.10.2017 um 01:24 schrieb Markus Beth:
> Find attached the already announced version of CompareByte.
> 

What benchmark did you use? In my tests it is slightly slower than that one of 
fpc 3.0.x?

I used the following test program:

var
  buf1,buf2 : array[0..127] of byte;
  pos,len,i,j : longint;

begin
  for i:=1 to 100 do
begin
  len:=random(100);
  for j:=0 to len-1 do
begin
  buf1[j]:=random(256);
  buf2[j]:=random(256);
end;

  for j:=0 to random(10) do
buf2[j]:=buf1[j];

  for j:=1 to 100 do
CompareByte(buf1,buf2,len);
end;
end.

> 
> 
> On 16.10.2017 23:08, Markus Beth wrote:
>> On 16.10.2017 22:41, Florian Klämpfl wrote:
 P.S.: I am currently working on another version of CompareByte that might 
 have a slightly higher
 latency for very small len but a higher throughput (2 cycles per iteration 
 vs. 3 cycles on an Intel
 Arrandale CPU (Westmere microarchitecture)). But this would need some more 
 testing and
 benchmarking.
 I can come up with it here again if this would be of any interest.
>>>
>>> Small lengths in terms of matching string or overall lengths?
>>
>> It is small length in terms of matching string as there is some setup work 
>> before the loop.
>>
>>> BTW: I would really like to see a PCMPSTR based implementation :)
>> PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of 
>> SSE4.2. How would you
>> deal with Intel core microarchitecture CPUs that don't have it?
> 
> 
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-21 Thread Marco van de Voort
In our previous episode, Markus Beth said:
> Find attached the already announced version of CompareByte.
> 
> BTW: If you really like to see a PCMPSTR based implementation, have a
> look at Agner Fog's Subroutine library asmlib.zip
> (http://agner.org/optimize/).

And then you see GPL licensed, and move on :-)

GPL is not suitable for the RTL.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-20 Thread Markus Beth

Find attached the already announced version of CompareByte.

BTW: If you really like to see a PCMPSTR based implementation, have a
look at Agner Fog's Subroutine library asmlib.zip
(http://agner.org/optimize/).


On 16.10.2017 23:08, Markus Beth wrote:

On 16.10.2017 22:41, Florian Klämpfl wrote:
P.S.: I am currently working on another version of CompareByte that 
might have a slightly higher
latency for very small len but a higher throughput (2 cycles per 
iteration vs. 3 cycles on an Intel
Arrandale CPU (Westmere microarchitecture)). But this would need some 
more testing and benchmarking.

I can come up with it here again if this would be of any interest.


Small lengths in terms of matching string or overall lengths?


It is small length in terms of matching string as there is some setup 
work before the loop.



BTW: I would really like to see a PCMPSTR based implementation :)
PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of 
SSE4.2. How would you deal with Intel core microarchitecture CPUs that 
don't have it?
Index: trunk/rtl/x86_64/x86_64.inc
===
--- trunk/rtl/x86_64/x86_64.inc	(Revision 37497)
+++ trunk/rtl/x86_64/x86_64.inc	(Arbeitskopie)
@@ -640,27 +640,36 @@
 mov%rsi, %rdx
 mov%rdi, %rcx
 {$endif win64}
-testq   %r8,%r8
-je  .LCmpbyteZero
+negq%r8
+jz  .LCmpbyteZero
 
+subq%r8, %rcx
+subq%r8, %rdx
+
 .balign 8
 .LCmpbyteLoop:
-movb(%rcx),%r9b
-cmpb(%rdx),%r9b
-leaq1(%rcx),%rcx
-leaq1(%rdx),%rdx
+{$ifdef oldbinutils}
+// for the reason why this alternate coding of movzbl is given here
+// see the comments in FillChar above
+.byte 0x42,0x0F,0xB6,0x04,0x01
+{$else}
+movzbl  (%rcx,%r8), %eax
+{$endif}
+cmpb(%rdx,%r8), %al
 jne .LCmpbyteExitFast
-decq%r8
+addq$1, %r8
 jne .LCmpbyteLoop
+.LCmpbyteZero:
+ xorl%eax, %eax
+ retq
+
 .LCmpbyteExitFast:
- movzbq  -1(%rdx),%r8 { Compare last position }
- movzbq  %r9b,%rax
- subq%r8,%rax
- ret
-
-.LCmpbyteZero:
- movq$0,%rax
- ret
+{$ifdef oldbinutils}
+.byte 0x42,0x0F,0xB6,0x0C,0x02
+{$else}
+ movzbl  (%rdx,%r8), %ecx{ Compare last position }
+{$endif}
+ subq%rcx, %rax
 end;
 {$endif FPC_SYSTEM_HAS_COMPAREBYTE}
 
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-17 Thread Florian Klämpfl
Am 16.10.2017 um 23:08 schrieb Markus Beth:
> On 16.10.2017 22:41, Florian Klämpfl wrote:
>>> P.S.: I am currently working on another version of CompareByte that might 
>>> have a slightly higher
>>> latency for very small len but a higher throughput (2 cycles per iteration 
>>> vs. 3 cycles on an Intel
>>> Arrandale CPU (Westmere microarchitecture)). But this would need some more 
>>> testing and benchmarking.
>>> I can come up with it here again if this would be of any interest.
>>
>> Small lengths in terms of matching string or overall lengths?
> 
> It is small length in terms of matching string as there is some setup work 
> before the loop.
> 
>> BTW: I would really like to see a PCMPSTR based implementation :)
> PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of 
> SSE4.2. How would you deal
> with Intel core microarchitecture CPUs that don't have it?

Just set a flag at startup if it is supported and then branch on the flag. As 
the flag never
changes, branch prediction most likely will work very good.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-17 Thread Sven Barth via fpc-devel
Am 16.10.2017 23:04 schrieb "Markus Beth" :
>
> On 16.10.2017 22:41, Florian Klämpfl wrote:
>> BTW: I would really like to see a PCMPSTR based implementation :)
>
> PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of
SSE4.2. How would you deal with Intel core microarchitecture CPUs that
don't have it?

It could be selected at runtime, after all CPUID can always be checked.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-16 Thread Markus Beth

On 16.10.2017 22:41, Florian Klämpfl wrote:

P.S.: I am currently working on another version of CompareByte that might have 
a slightly higher
latency for very small len but a higher throughput (2 cycles per iteration vs. 
3 cycles on an Intel
Arrandale CPU (Westmere microarchitecture)). But this would need some more 
testing and benchmarking.
I can come up with it here again if this would be of any interest.


Small lengths in terms of matching string or overall lengths?


It is small length in terms of matching string as there is some setup 
work before the loop.



BTW: I would really like to see a PCMPSTR based implementation :)
PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of 
SSE4.2. How would you deal with Intel core microarchitecture CPUs that 
don't have it?

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-16 Thread Florian Klämpfl
Am 16.10.2017 um 22:33 schrieb Markus Beth:
> Sorry for the late reply. I had a weekend off(line).
> 
> The instructions were chosen on purpose and Sergey already cited the part of 
> the Intel documentation
> that explains why this is correct. You can find a similar part in AMD "AMD64 
> Architecture
> Programmer’s Manual Volume 1: Application Programming":

Yes, Sergey is of course right, it was too late yesterday :)

> 
>> 3.4.5 High 32 Bits
>> In 64-bit mode, the following rules apply to extension of results into
>> the high 32 bits when results smaller than 64 bits are written:
>>
>> * Zero-Extension of 32-Bit Results: 32-bit results are zero-extended
>>   into the high 32 bits of 64-bit GPR destination registers.
> 
> I think other x86_64 CPU manufacturers also adhere to this rule as I know gcc 
> also relies on this.
> 
> I generally prefer the instructions operating on 32 bit operands over those 
> operating on 64 bit
> operands where appropriate because they are typically encoded in less bytes 
> as they do not need a
> REX prefix.
> 
> I have updated the patch (attached) to include a code path for 'oldbinutils' 
> as Gareth suggested. In
> addition I switched the tails (.LCmpbyteZero and .LCmpbyteExitFast) as when 
> we leave the loop
> because the loop count reaches zero, we know already that the last bytes were 
> the same and do not
> need to subq them.
> 
> Markus
> 
> P.S.: I am currently working on another version of CompareByte that might 
> have a slightly higher
> latency for very small len but a higher throughput (2 cycles per iteration 
> vs. 3 cycles on an Intel
> Arrandale CPU (Westmere microarchitecture)). But this would need some more 
> testing and benchmarking.
> I can come up with it here again if this would be of any interest.

Small lengths in terms of matching string or overall lengths?

BTW: I would really like to see a PCMPSTR based implementation :)

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-16 Thread Markus Beth

Sorry for the late reply. I had a weekend off(line).

The instructions were chosen on purpose and Sergey already cited the 
part of the Intel documentation that explains why this is correct. You 
can find a similar part in AMD "AMD64 Architecture Programmer’s Manual 
Volume 1: Application Programming":


> 3.4.5 High 32 Bits
> In 64-bit mode, the following rules apply to extension of results into
> the high 32 bits when results smaller than 64 bits are written:
>
> * Zero-Extension of 32-Bit Results: 32-bit results are zero-extended
>   into the high 32 bits of 64-bit GPR destination registers.

I think other x86_64 CPU manufacturers also adhere to this rule as I 
know gcc also relies on this.


I generally prefer the instructions operating on 32 bit operands over 
those operating on 64 bit operands where appropriate because they are 
typically encoded in less bytes as they do not need a REX prefix.


I have updated the patch (attached) to include a code path for 
'oldbinutils' as Gareth suggested. In addition I switched the tails 
(.LCmpbyteZero and .LCmpbyteExitFast) as when we leave the loop because 
the loop count reaches zero, we know already that the last bytes were 
the same and do not need to subq them.


Markus

P.S.: I am currently working on another version of CompareByte that 
might have a slightly higher latency for very small len but a higher 
throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale 
CPU (Westmere microarchitecture)). But this would need some more testing 
and benchmarking. I can come up with it here again if this would be of 
any interest.


On 16.10.2017 19:41, Сергей Сергеенко wrote:

On 15 Oct 2017 Florian Klämpfl wrote:

I had a look and tested it and it worked, I didn't notice the problem below
either.


Sorry for wrong warning. I cannot provide any example where my suggestions
are true. The reason for it is described on page Vol. 1 3-13 of Intel 64
and IA-32 Architectures Software Developer's Manual:


When in 64-bit mode, operand size determines the number of valid bits in
the destination general-purpose register:

[...]

  32-bit operands generate a 32-bit result, zero-extended to a 64-bit
  result in the destination general-purpose
  
[...]


So, instructions

 movzbl  (%rcx),%eax

and

  movzbl  -1(%rdx),%ecx

and

  xorl%eax,%eax

should put zero into 32 high bits of appropriate registers.


I think also the final xor should be a xorq %rax,%rax, right?


As I said above xorl %eax, %eax should be enough.

Index: trunk/rtl/x86_64/x86_64.inc
===
--- trunk/rtl/x86_64/x86_64.inc	(Revision 37477)
+++ trunk/rtl/x86_64/x86_64.inc	(Arbeitskopie)
@@ -645,22 +645,30 @@
 
 .balign 8
 .LCmpbyteLoop:
-movb(%rcx),%r9b
-cmpb(%rdx),%r9b
+{$ifdef oldbinutils}
+// for the reason why this alternate coding of movzbl is given here
+// see the comments in FillChar above
+.byte 0x0F,0xB6,0x01
+{$else}
+movzbl  (%rcx),%eax
+{$endif}
+cmpb(%rdx),%al
 leaq1(%rcx),%rcx
 leaq1(%rdx),%rdx
 jne .LCmpbyteExitFast
 decq%r8
 jne .LCmpbyteLoop
+.LCmpbyteZero:
+ xorl%eax,%eax
+ retq
+
 .LCmpbyteExitFast:
- movzbq  -1(%rdx),%r8 { Compare last position }
- movzbq  %r9b,%rax
- subq%r8,%rax
- ret
-
-.LCmpbyteZero:
- movq$0,%rax
- ret
+{$ifdef oldbinutils}
+.byte 0x0F,0xB6,0x4A,0xFF
+{$else}
+ movzbl  -1(%rdx),%ecx{ Compare last position }
+{$endif}
+ subq%rcx,%rax
 end;
 {$endif FPC_SYSTEM_HAS_COMPAREBYTE}
 
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-16 Thread Сергей Сергеенко
On 15 Oct 2017 Florian Klämpfl wrote:
> I had a look and tested it and it worked, I didn't notice the problem below
> either.

Sorry for wrong warning. I cannot provide any example where my suggestions
are true. The reason for it is described on page Vol. 1 3-13 of Intel 64
and IA-32 Architectures Software Developer's Manual:

> When in 64-bit mode, operand size determines the number of valid bits in
> the destination general-purpose register:
>
> [...]
> 
>  32-bit operands generate a 32-bit result, zero-extended to a 64-bit
>  result in the destination general-purpose
>  
> [...]

So, instructions 
> movzbl  (%rcx),%eax
and
>  movzbl  -1(%rdx),%ecx
and 
>  xorl%eax,%eax
should put zero into 32 high bits of appropriate registers.

> I think also the final xor should be a xorq %rax,%rax, right?

As I said above xorl %eax, %eax should be enough.

--
With best regards
Sergey
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-15 Thread Florian Klämpfl
Am 12.10.2017 um 20:37 schrieb sserg...@gmail.com:
> Hi.
> 
> Sorry for late message. But nobody still have said about possible problem 
> with 
> suggested patch. 

Well, it's always very hard to review such highly optimized code. I had a look 
and tested it and it
worked, I didn't notice the problem below either.

> So I decide to pay attention on that proposed code may be 
> incorrect under some circumstances IMHO.
> 
> Instruction on line 657
> 
> subq %rcx, %rax
> 
> decreases value in %rax on %rcx, but previous code doesn't set any value to 
> 32 
> high bits of %rax and 32 high bits of %rcx still contain 32 high bits of buf1 
> address. So I think that correct result is not guarantied.
> 
> I suggest to use mozbq instead of movzbl to fix this issue.

I think also the final xor should be a xorq %rax,%rax, right?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-12 Thread sserg . me
Hi.

Sorry for late message. But nobody still have said about possible problem with 
suggested patch. So I decide to pay attention on that proposed code may be 
incorrect under some circumstances IMHO.

Instruction on line 657

subq %rcx, %rax

decreases value in %rax on %rcx, but previous code doesn't set any value to 32 
high bits of %rax and 32 high bits of %rcx still contain 32 high bits of buf1 
address. So I think that correct result is not guarantied.

I suggest to use mozbq instead of movzbl to fix this issue.

--
With best regards
Sergey
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-09-30 Thread J. Gareth Moreton
Oops, I just realised now that that byte sequence is your workaround, not 
actually already present in the 
code!  My bad, but yes, that is the correct byte sequence (although it's worth 
putting a comment in to 
actually state what they are).

If the bug is known to still be present in GAS, then it might be a necessity - 
otherwise is it possible to 
flag it up to get it fixed in GAS rather than having to do the dangerous task 
of encoding direct machine 
code to get around it?

Gareth


On Sun 01/10/17 00:00 , "J. Gareth Moreton" gar...@moreton-family.com sent:
> Hi Markus,
> 
> 
> 
> Nice to see there's more than one person working to improve compiled code
> on x86-64!
> 
> 
> I can answer one question... the byte sequence 0F B6 01 is the direct
> machine code representation of movzbl 
> (%rcx),%eax - this might be due to a bug with the assembler or movzbl not
> being recognised (I had to do the 
> same thing with xgetbv once).
> 
> 
> 
> Gareth Moreton
> 
> 
> 
> 
> 
> On Sat 30/09/17 23:24 , Markus Beth markus.be
> t...@zkrd.de sent:
> > It did some changes to CompareByte in
> rtl/x86_64/x86_64.inc to reduce 
> > the code size and make it run faster (see
> attached path). I was 
> > successful with the code size deduction (47
> bytes vs. 62 bytes) and also 
> > with the speed (according to a micro benchmark
> [1] run on an Ivy Bridge 
> > desktop).
> 
> > 
> 
> > To achieve this I used movzbl twice. But then I
> came across the comment 
> > in FillChar (also in rtl/x86_64/x86_64.inc)
> about movzbl breaking 
> > targets using external GAS (Mantis #19188). As
> this Mantis issue is 
> > dated back in 2011 my question is: Is this still
> valid? And what would 
> > be the preferred way to overcome this issue?
> 
> > {$ifdef oldbinutils}
> 
> > .byte 0x0F,0xb6,0x01
> 
> > {$else}
> 
> > movzbl (%rcx),%eax
> 
> > {$endif}
> 
> > 
> 
> > Markus
> 
> > 
> 
> > [1] the benchmark compares a 10 MB memory block
> with itself 1 times
> > 
> 
> > ___
> 
> > fpc-devel maillist - fpc-devel@lists.freepascal.org
> > http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel [1]
> > 
> 
> > 
> 
> > 
> 
> > Links:
> 
> > --
> 
> > [1]
> 
> > http://secureweb.fast.net.uk/parse.php?redirect=http://lists.fr
> eepascal.org
> > /cgi-bin/mailman/listinfo/fpc-devel
> 
> > 
> 
> 
> 
> ___
> 
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> 
> 
> 
> 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-09-30 Thread J. Gareth Moreton
Hi Markus,

Nice to see there's more than one person working to improve compiled code on 
x86-64!

I can answer one question... the byte sequence 0F B6 01 is the direct machine 
code representation of movzbl 
(%rcx),%eax - this might be due to a bug with the assembler or movzbl not being 
recognised (I had to do the 
same thing with xgetbv once).

Gareth Moreton


On Sat 30/09/17 23:24 , Markus Beth markus.b...@zkrd.de sent:
> It did some changes to CompareByte in rtl/x86_64/x86_64.inc to reduce 
> the code size and make it run faster (see attached path). I was 
> successful with the code size deduction (47 bytes vs. 62 bytes) and also 
> with the speed (according to a micro benchmark [1] run on an Ivy Bridge 
> desktop).
> 
> To achieve this I used movzbl twice. But then I came across the comment 
> in FillChar (also in rtl/x86_64/x86_64.inc) about movzbl breaking 
> targets using external GAS (Mantis #19188). As this Mantis issue is 
> dated back in 2011 my question is: Is this still valid? And what would 
> be the preferred way to overcome this issue?
> {$ifdef oldbinutils}
> .byte 0x0F,0xb6,0x01
> {$else}
> movzbl (%rcx),%eax
> {$endif}
> 
> Markus
> 
> [1] the benchmark compares a 10 MB memory block with itself 1 times
> 
> ___
> fpc-devel maillist - fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel [1]
> 
> 
> 
> Links:
> --
> [1]
> http://secureweb.fast.net.uk/parse.php?redirect=http://lists.freepascal.org
> /cgi-bin/mailman/listinfo/fpc-devel
> 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] x86_64.inc CompareByte

2017-09-30 Thread Markus Beth
It did some changes to CompareByte in rtl/x86_64/x86_64.inc to reduce 
the code size and make it run faster (see attached path). I was 
successful with the code size deduction (47 bytes vs. 62 bytes) and also 
with the speed (according to a micro benchmark [1] run on an Ivy Bridge 
desktop).


To achieve this I used movzbl twice. But then I came across the comment 
in FillChar (also in rtl/x86_64/x86_64.inc) about movzbl breaking 
targets using external GAS (Mantis #19188). As this Mantis issue is 
dated back in 2011 my question is: Is this still valid? And what would 
be the preferred way to overcome this issue?

{$ifdef oldbinutils}
   .byte 0x0F,0xb6,0x01
{$else}
   movzbl (%rcx),%eax
{$endif}


Markus

[1] the benchmark compares a 10 MB memory block with itself 1 times
Index: trunk/rtl/x86_64/x86_64.inc
===
--- trunk/rtl/x86_64/x86_64.inc	(Revision 37365)
+++ trunk/rtl/x86_64/x86_64.inc	(Arbeitskopie)
@@ -645,8 +645,8 @@
 
 .balign 8
 .LCmpbyteLoop:
-movb(%rcx),%r9b
-cmpb(%rdx),%r9b
+movzbl  (%rcx),%eax
+cmpb(%rdx),%al
 leaq1(%rcx),%rcx
 leaq1(%rdx),%rdx
 jne .LCmpbyteExitFast
@@ -653,14 +653,12 @@
 decq%r8
 jne .LCmpbyteLoop
 .LCmpbyteExitFast:
- movzbq  -1(%rdx),%r8 { Compare last position }
- movzbq  %r9b,%rax
- subq%r8,%rax
- ret
+ movzbl  -1(%rdx),%ecx{ Compare last position }
+ subq%rcx,%rax
+ retq
 
 .LCmpbyteZero:
- movq$0,%rax
- ret
+ xorl%eax,%eax
 end;
 {$endif FPC_SYSTEM_HAS_COMPAREBYTE}
 
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel