I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].
[1]:
var
key, key2: string;
res: LongWord;
i: SizeInt;
begin
key := 'A';
key2 := 'A';
for i:= 0 to 1000000000 do begin
res := CompareByte(key[1], key2[1], Length(key));
end;
end.
[2]:
var
key, key2: string;
res: LongWord;
i: SizeInt;
begin
SetLength(key,10240 * 1024);
SetLength(key2,10240 * 1024);
for i:= 0 to 10000 do begin
hash := CompareByte_RTL(key[1], key2[1], Length(key));
end;
end.
The measurement takes place on a Intel Core i5 CPU M520@2.40GHz which
has a Westmere Microarchitecture. The programs are run on an otherwise
idle Linux (OpenSuse Tumbleweed) system via
perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte
after the following setup:
cpupower frequency-set -g performance
echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)
The output for [1] using the current RTL CompareByte is:
11.336.449.124 cycles ( +- 0,05% )
28.077.280.776 instructions # 2,48 insn per cycle ( +- 0,00% )
4,736782553 seconds time elapsed ( +- 0,05% )
The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
10.293.397.316 cycles ( +- 0,01% )
26.070.305.490 instructions # 2,53 insn per cycle ( +- 0,00% )
4,301081734 seconds time elapsed ( +- 0,01% )
The output for [2] using the current RTL CompareByte is:
325.526.707.243 cycles ( +- 0,31% )
736.237.912.850 instructions # 2,26 insn per cycle ( +- 0,00% )
136,013215979 seconds time elapsed ( +- 0,31% )
The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
224.621.009.410 cycles ( +- 0,95% )
525.832.575.056 instructions # 2,34 insn per cycle ( +- 0,00% )
93,851685247 seconds time elapsed ( +- 0,95% )
I hopefully can come up with the corresponding numbers for a ivy bridge
CPU tomorrow.
On 22.10.2017 20:55, Florian Klämpfl wrote:
Am 21.10.2017 um 01:24 schrieb Markus Beth:
Find attached the already announced version of CompareByte.
What benchmark did you use? In my tests it is slightly slower than that one of
fpc 3.0.x?
I used the following test program:
var
buf1,buf2 : array[0..127] of byte;
pos,len,i,j : longint;
begin
for i:=1 to 100 do
begin
len:=random(100);
for j:=0 to len-1 do
begin
buf1[j]:=random(256);
buf2[j]:=random(256);
end;
for j:=0 to random(10) do
buf2[j]:=buf1[j];
for j:=1 to 1000000 do
CompareByte(buf1,buf2,len);
end;
end.
On 16.10.2017 23:08, Markus Beth wrote:
On 16.10.2017 22:41, Florian Klämpfl wrote:
P.S.: I am currently working on another version of CompareByte that might have
a slightly higher
latency for very small len but a higher throughput (2 cycles per iteration vs.
3 cycles on an Intel
Arrandale CPU (Westmere microarchitecture)). But this would need some more
testing and
benchmarking.
I can come up with it here again if this would be of any interest.
Small lengths in terms of matching string or overall lengths?
It is small length in terms of matching string as there is some setup work
before the loop.
BTW: I would really like to see a PCMPSTR based implementation :)
PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of
SSE4.2. How would you
deal with Intel core microarchitecture CPUs that don't have it?
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel