Re: [fpc-devel] x86_64.inc CompareByte

2017-10-23 Thread Markus Beth

Here are the numbers for on ivy bridge CPU:
The output for [1] using the current RTL CompareByte is:
  9.001.275.281   cycles:u( +-  0,00% )
 28.000.560.462   instructions:u #   3,11  insn per cycle ( +-  0,00% )
  2,654735815 seconds time elapsed( +-  0,00% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
  9.002.038.628   cycles:u( +-  0,01% )
 26.000.559.441   instructions:u #   2,89  insn per cycle ( +-  0,00% )
  2,655002891 seconds time elapsed( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
227.941.173.371   cycles:u( +-  0,00% )
734.077.388.160   instructions:u #   3,22  insn per cycle ( +-  0,00% )
 67,215188648 seconds time elapsed( +-  0,00% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
210.694.292.040   cycles:u( +-  0,00% )
524.341.215.569   instructions:u #   2,49  insn per cycle ( +-  0,00% )
 62,129294243 seconds time elapsed( +-  0,00% )


With Florian's benchmark I also observe that the patched version is
slightly slower than the original. But I have no idea why this is so.


On 23.10.2017 00:25, Markus Beth wrote:

I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].

[1]:
var
   key, key2: string;
   res: LongWord;
   i: SizeInt;

begin
   key  := 'A';
   key2 := 'A';
   for i:= 0 to 10 do begin
     res := CompareByte(key[1], key2[1], Length(key));
   end;
end.

[2]:
var
   key, key2: string;
   res: LongWord;
   i: SizeInt;

begin
   SetLength(key,10240 * 1024);
   SetLength(key2,10240 * 1024);
   for i:= 0 to 1 do begin
     hash := CompareByte_RTL(key[1], key2[1], Length(key));
   end;
end.


The measurement takes place on a Intel Core i5 CPU M520@2.40GHz which
has a Westmere Microarchitecture. The programs are run on an otherwise
idle Linux (OpenSuse Tumbleweed) system via

perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte

after the following setup:
  cpupower frequency-set -g performance
  echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)


The output for [1] using the current RTL CompareByte is:
  11.336.449.124   cycles  ( +-  0,05% )
  28.077.280.776   instructions   #   2,48  insn per cycle ( +-  0,00% )
   4,736782553 seconds time elapsed    ( +-  0,05% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
  10.293.397.316   cycles  ( +-  0,01% )
  26.070.305.490   instructions   #   2,53  insn per cycle ( +-  0,00% )
   4,301081734 seconds time elapsed    ( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
325.526.707.243   cycles  ( +-  0,31% )
736.237.912.850   instructions   #   2,26  insn per cycle ( +-  0,00% )
136,013215979 seconds time elapsed    ( +-  0,31% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
224.621.009.410   cycles  ( +-  0,95% )
525.832.575.056   instructions   #   2,34  insn per cycle ( +-  0,00% )
  93,851685247 seconds time elapsed    ( +-  0,95% )


I hopefully can come up with the corresponding numbers for a ivy bridge
CPU tomorrow.


On 22.10.2017 20:55, Florian Klämpfl wrote:

Am 21.10.2017 um 01:24 schrieb Markus Beth:

Find attached the already announced version of CompareByte.



What benchmark did you use? In my tests it is slightly slower than 
that one of fpc 3.0.x?


I used the following test program:

var
   buf1,buf2 : array[0..127] of byte;
   pos,len,i,j : longint;

begin
   for i:=1 to 100 do
 begin
   len:=random(100);
   for j:=0 to len-1 do
 begin
   buf1[j]:=random(256);
   buf2[j]:=random(256);
 end;

   for j:=0 to random(10) do
 buf2[j]:=buf1[j];

   for j:=1 to 100 do
 CompareByte(buf1,buf2,len);
 end;
end.




On 16.10.2017 23:08, Markus Beth wrote:

On 16.10.2017 22:41, Florian Klämpfl wrote:
P.S.: I am currently working on another version of CompareByte 
that might have a slightly higher
latency for very small len but a higher throughput (2 cycles per 
iteration vs. 3 cycles on an Intel
Arrandale CPU (Westmere microarchitecture)). But this would need 
some more testing and

benchmarking.
I can come up with it here again if this would be of any interest.


Small lengths in terms of matching string or overall lengths?


It is small length in terms of matching string as there is some 
setup work before the loop.



BTW: I would really like to see a PCMPSTR based implementation :)
PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is 
part of SSE4.2. How would you

deal with Intel core microarchitecture CPUs that don't have it?

___
fpc-devel maillist  -  

Re: [fpc-devel] rdtscp

2017-10-23 Thread Nikolay Nikolov



On 10/23/2017 02:21 AM, Wolf wrote:



On 23/10/17 02:53, Nikolay Nikolov wrote:


FPC trunk supports rdtscp. And if you're using a FPC version that 
doesn't support an instruction, you can always hardcode it with 'db' 
(make sure you add a comment with the real instruction to keep your 
code readable), e.g:


db 0fh, 01h, 0f9h  { rdtscp - not supported by FPC 3.0's inline 
assembler }


Nikolay

db 0fh, 01h, 0f9h is a Delphi instruction, as far as I can make it 
out. It does not work with FPC 3.00 - at least with the settings 
Lazarus provides by default running under Linux.

What does work is
  .byte 0x0F, 0x01, 0xF9    // read the Time-Stamp Counter rdtscp 
(as op-code format,
                              // requires setting the 
compiler switch -aas
                              // in Lazarus: 
Options/Custom Options/All Options )
Yes, that's AT asm syntax. FPC supports both AT and Intel syntax 
modes with the {$asmmode} directive:


https://www.freepascal.org/docs-html/prog/progsu3.html

AT is the default for historical reasons, but Intel also works fine.

Nikolay
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64.inc CompareByte

2017-10-23 Thread Martok
Using the code given below as "inner", I measure this:

Current Trunk:
O0 compare-byte-1 : 196065.112 +/- 896.754 cycles/inner [0.5 %CV 1.6 %R]
O1 compare-byte-1 : 196510.158 +/- 577.976 cycles/inner [0.3 %CV 1.1 %R]
O3 compare-byte-1 : 187540.922 +/- 706.167 cycles/inner [0.4 %CV 1.5 %R]
Patch from 2017-10-21:
O0 compare-byte-2 : 175831.632 +/- 965.972 cycles/inner [0.5 %CV 2.1 %R]
O1 compare-byte-2 : 176039.560 +/- 527.141 cycles/inner [0.3 %CV 1.0 %R]
O3 compare-byte-2 : 158527.167 +/- 661.690 cycles/inner [0.4 %CV 1.5 %R]
(%CV: coefficient of variance * 100%. %R: span as % of mean)

CPU:
 Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz Family 6 Model 60 Stepping 3 (Haswell)
 true single core clock (measured) 2.83 GHz


So the new version is a bit faster, but not by a large margin (10-15%). It is
statistically significant though.
While I'm at it, i386 could use some love:
O1 compare-byte-1 :  755247.183 +/- 8125.671 cycles/inner [1.1 %CV 4.5 %R]
That's 3.8 times slower than x64 for exactly the same code.

Code:
len:=random(100);
for j:=0 to len-1 do
  begin
buf1[j]:=random(256);
buf2[j]:=random(256);
  end;

for j:=0 to random(10) do
  buf2[j]:=buf1[j];

for j:=1 to 1 do
  CompareBytePatch(buf1,buf2,len);  // or System.CompareByte


-- 
Regards,
Martok

Ceterum censeo b32079 esse sanandam.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel