Here are the numbers for on ivy bridge CPU:
The output for [1] using the current RTL CompareByte is:
  9.001.275.281   cycles:u                    ( +-  0,00% )
 28.000.560.462   instructions:u #   3,11  insn per cycle ( +-  0,00% )
  2,654735815 seconds time elapsed            ( +-  0,00% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
  9.002.038.628   cycles:u                    ( +-  0,01% )
 26.000.559.441   instructions:u #   2,89  insn per cycle ( +-  0,00% )
  2,655002891 seconds time elapsed            ( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
227.941.173.371   cycles:u                    ( +-  0,00% )
734.077.388.160   instructions:u #   3,22  insn per cycle ( +-  0,00% )
 67,215188648 seconds time elapsed            ( +-  0,00% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
210.694.292.040   cycles:u                    ( +-  0,00% )
524.341.215.569   instructions:u #   2,49  insn per cycle ( +-  0,00% )
 62,129294243 seconds time elapsed            ( +-  0,00% )


With Florian's benchmark I also observe that the patched version is
slightly slower than the original. But I have no idea why this is so.


On 23.10.2017 00:25, Markus Beth wrote:
I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].

[1]:
var
   key, key2: string;
   res: LongWord;
   i: SizeInt;

begin
   key  := 'A';
   key2 := 'A';
   for i:= 0 to 1000000000 do begin
     res := CompareByte(key[1], key2[1], Length(key));
   end;
end.

[2]:
var
   key, key2: string;
   res: LongWord;
   i: SizeInt;

begin
   SetLength(key,10240 * 1024);
   SetLength(key2,10240 * 1024);
   for i:= 0 to 10000 do begin
     hash := CompareByte_RTL(key[1], key2[1], Length(key));
   end;
end.


The measurement takes place on a Intel Core i5 CPU M520@2.40GHz which
has a Westmere Microarchitecture. The programs are run on an otherwise
idle Linux (OpenSuse Tumbleweed) system via

perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte

after the following setup:
  cpupower frequency-set -g performance
  echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)


The output for [1] using the current RTL CompareByte is:
  11.336.449.124   cycles                      ( +-  0,05% )
  28.077.280.776   instructions   #   2,48  insn per cycle ( +-  0,00% )
   4,736782553 seconds time elapsed            ( +-  0,05% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
  10.293.397.316   cycles                      ( +-  0,01% )
  26.070.305.490   instructions   #   2,53  insn per cycle ( +-  0,00% )
   4,301081734 seconds time elapsed            ( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
325.526.707.243   cycles                      ( +-  0,31% )
736.237.912.850   instructions   #   2,26  insn per cycle ( +-  0,00% )
136,013215979 seconds time elapsed            ( +-  0,31% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
224.621.009.410   cycles                      ( +-  0,95% )
525.832.575.056   instructions   #   2,34  insn per cycle ( +-  0,00% )
  93,851685247 seconds time elapsed            ( +-  0,95% )


I hopefully can come up with the corresponding numbers for a ivy bridge
CPU tomorrow.


On 22.10.2017 20:55, Florian Klämpfl wrote:
Am 21.10.2017 um 01:24 schrieb Markus Beth:
Find attached the already announced version of CompareByte.


What benchmark did you use? In my tests it is slightly slower than that one of fpc 3.0.x?

I used the following test program:

var
   buf1,buf2 : array[0..127] of byte;
   pos,len,i,j : longint;

begin
   for i:=1 to 100 do
     begin
       len:=random(100);
       for j:=0 to len-1 do
         begin
           buf1[j]:=random(256);
           buf2[j]:=random(256);
         end;

       for j:=0 to random(10) do
         buf2[j]:=buf1[j];

       for j:=1 to 1000000 do
         CompareByte(buf1,buf2,len);
     end;
end.



On 16.10.2017 23:08, Markus Beth wrote:
On 16.10.2017 22:41, Florian Klämpfl wrote:
P.S.: I am currently working on another version of CompareByte that might have a slightly higher latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale CPU (Westmere microarchitecture)). But this would need some more testing and
benchmarking.
I can come up with it here again if this would be of any interest.

Small lengths in terms of matching string or overall lengths?

It is small length in terms of matching string as there is some setup work before the loop.

BTW: I would really like to see a PCMPSTR based implementation :)
PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of SSE4.2. How would you
deal with Intel core microarchitecture CPUs that don't have it?
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to