Here are the numbers for on ivy bridge CPU:
The output for [1] using the current RTL CompareByte is:
9.001.275.281 cycles:u ( +- 0,00% )
28.000.560.462 instructions:u # 3,11 insn per cycle ( +- 0,00% )
2,654735815 seconds time elapsed ( +- 0,00% )
The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
9.002.038.628 cycles:u ( +- 0,01% )
26.000.559.441 instructions:u # 2,89 insn per cycle ( +- 0,00% )
2,655002891 seconds time elapsed ( +- 0,01% )
The output for [2] using the current RTL CompareByte is:
227.941.173.371 cycles:u ( +- 0,00% )
734.077.388.160 instructions:u # 3,22 insn per cycle ( +- 0,00% )
67,215188648 seconds time elapsed ( +- 0,00% )
The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
210.694.292.040 cycles:u ( +- 0,00% )
524.341.215.569 instructions:u # 2,49 insn per cycle ( +- 0,00% )
62,129294243 seconds time elapsed ( +- 0,00% )
With Florian's benchmark I also observe that the patched version is
slightly slower than the original. But I have no idea why this is so.
On 23.10.2017 00:25, Markus Beth wrote:
I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].
[1]:
var
key, key2: string;
res: LongWord;
i: SizeInt;
begin
key := 'A';
key2 := 'A';
for i:= 0 to 1000000000 do begin
res := CompareByte(key[1], key2[1], Length(key));
end;
end.
[2]:
var
key, key2: string;
res: LongWord;
i: SizeInt;
begin
SetLength(key,10240 * 1024);
SetLength(key2,10240 * 1024);
for i:= 0 to 10000 do begin
hash := CompareByte_RTL(key[1], key2[1], Length(key));
end;
end.
The measurement takes place on a Intel Core i5 CPU M520@2.40GHz which
has a Westmere Microarchitecture. The programs are run on an otherwise
idle Linux (OpenSuse Tumbleweed) system via
perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte
after the following setup:
cpupower frequency-set -g performance
echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)
The output for [1] using the current RTL CompareByte is:
11.336.449.124 cycles ( +- 0,05% )
28.077.280.776 instructions # 2,48 insn per cycle ( +- 0,00% )
4,736782553 seconds time elapsed ( +- 0,05% )
The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
10.293.397.316 cycles ( +- 0,01% )
26.070.305.490 instructions # 2,53 insn per cycle ( +- 0,00% )
4,301081734 seconds time elapsed ( +- 0,01% )
The output for [2] using the current RTL CompareByte is:
325.526.707.243 cycles ( +- 0,31% )
736.237.912.850 instructions # 2,26 insn per cycle ( +- 0,00% )
136,013215979 seconds time elapsed ( +- 0,31% )
The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
224.621.009.410 cycles ( +- 0,95% )
525.832.575.056 instructions # 2,34 insn per cycle ( +- 0,00% )
93,851685247 seconds time elapsed ( +- 0,95% )
I hopefully can come up with the corresponding numbers for a ivy bridge
CPU tomorrow.
On 22.10.2017 20:55, Florian Klämpfl wrote:
Am 21.10.2017 um 01:24 schrieb Markus Beth:
Find attached the already announced version of CompareByte.
What benchmark did you use? In my tests it is slightly slower than
that one of fpc 3.0.x?
I used the following test program:
var
buf1,buf2 : array[0..127] of byte;
pos,len,i,j : longint;
begin
for i:=1 to 100 do
begin
len:=random(100);
for j:=0 to len-1 do
begin
buf1[j]:=random(256);
buf2[j]:=random(256);
end;
for j:=0 to random(10) do
buf2[j]:=buf1[j];
for j:=1 to 1000000 do
CompareByte(buf1,buf2,len);
end;
end.
On 16.10.2017 23:08, Markus Beth wrote:
On 16.10.2017 22:41, Florian Klämpfl wrote:
P.S.: I am currently working on another version of CompareByte
that might have a slightly higher
latency for very small len but a higher throughput (2 cycles per
iteration vs. 3 cycles on an Intel
Arrandale CPU (Westmere microarchitecture)). But this would need
some more testing and
benchmarking.
I can come up with it here again if this would be of any interest.
Small lengths in terms of matching string or overall lengths?
It is small length in terms of matching string as there is some
setup work before the loop.
BTW: I would really like to see a PCMPSTR based implementation :)
PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is
part of SSE4.2. How would you
deal with Intel core microarchitecture CPUs that don't have it?
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel