I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].

[1]:
var
  key, key2: string;
  res: LongWord;
  i: SizeInt;

begin
  key  := 'A';
  key2 := 'A';
  for i:= 0 to 1000000000 do begin
    res := CompareByte(key[1], key2[1], Length(key));
  end;
end.

[2]:
var
  key, key2: string;
  res: LongWord;
  i: SizeInt;

begin
  SetLength(key,10240 * 1024);
  SetLength(key2,10240 * 1024);
  for i:= 0 to 10000 do begin
    hash := CompareByte_RTL(key[1], key2[1], Length(key));
  end;
end.


The measurement takes place on a Intel Core i5 CPU M520@2.40GHz which
has a Westmere Microarchitecture. The programs are run on an otherwise
idle Linux (OpenSuse Tumbleweed) system via

perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte

after the following setup:
 cpupower frequency-set -g performance
 echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
 echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)


The output for [1] using the current RTL CompareByte is:
 11.336.449.124   cycles                      ( +-  0,05% )
 28.077.280.776   instructions   #   2,48  insn per cycle ( +-  0,00% )
  4,736782553 seconds time elapsed            ( +-  0,05% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
 10.293.397.316   cycles                      ( +-  0,01% )
 26.070.305.490   instructions   #   2,53  insn per cycle ( +-  0,00% )
  4,301081734 seconds time elapsed            ( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
325.526.707.243   cycles                      ( +-  0,31% )
736.237.912.850   instructions   #   2,26  insn per cycle ( +-  0,00% )
136,013215979 seconds time elapsed            ( +-  0,31% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
224.621.009.410   cycles                      ( +-  0,95% )
525.832.575.056   instructions   #   2,34  insn per cycle ( +-  0,00% )
 93,851685247 seconds time elapsed            ( +-  0,95% )


I hopefully can come up with the corresponding numbers for a ivy bridge
CPU tomorrow.


On 22.10.2017 20:55, Florian Klämpfl wrote:
Am 21.10.2017 um 01:24 schrieb Markus Beth:
Find attached the already announced version of CompareByte.


What benchmark did you use? In my tests it is slightly slower than that one of 
fpc 3.0.x?

I used the following test program:

var
   buf1,buf2 : array[0..127] of byte;
   pos,len,i,j : longint;

begin
   for i:=1 to 100 do
     begin
       len:=random(100);
       for j:=0 to len-1 do
         begin
           buf1[j]:=random(256);
           buf2[j]:=random(256);
         end;

       for j:=0 to random(10) do
         buf2[j]:=buf1[j];

       for j:=1 to 1000000 do
         CompareByte(buf1,buf2,len);
     end;
end.



On 16.10.2017 23:08, Markus Beth wrote:
On 16.10.2017 22:41, Florian Klämpfl wrote:
P.S.: I am currently working on another version of CompareByte that might have 
a slightly higher
latency for very small len but a higher throughput (2 cycles per iteration vs. 
3 cycles on an Intel
Arrandale CPU (Westmere microarchitecture)). But this would need some more 
testing and
benchmarking.
I can come up with it here again if this would be of any interest.

Small lengths in terms of matching string or overall lengths?

It is small length in terms of matching string as there is some setup work 
before the loop.

BTW: I would really like to see a PCMPSTR based implementation :)
PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of 
SSE4.2. How would you
deal with Intel core microarchitecture CPUs that don't have it?
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to