On 31/10/17 11:47, Florian Klämpfl wrote:
Am 30.10.2017 um 19:46 schrieb C Western:
On 29/10/17 22:18, Florian Klämpfl wrote:
I have committed your lastest patch with a few changes: the loop entry is
aligned now to 16 bytes, I
used movb instead of movbzl and inc instead of add. For me
Am 30.10.2017 um 19:46 schrieb C Western:
> On 29/10/17 22:18, Florian Klämpfl wrote:
>>
>> I have committed your lastest patch with a few changes: the loop entry is
>> aligned now to 16 bytes, I
>> used movb instead of movbzl and inc instead of add. For me (Haswell CPU)
>> this works better. I
On 29/10/17 22:18, Florian Klämpfl wrote:
I have committed your lastest patch with a few changes: the loop entry is
aligned now to 16 bytes, I
used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this
works better. I think
also these changes are better on average.
Am 23.10.2017 um 22:58 schrieb Markus Beth:
> Here are the numbers for on ivy bridge CPU:
> The output for [1] using the current RTL CompareByte is:
> 9.001.275.281 cycles:u ( +- 0,00% )
> 28.000.560.462 instructions:u # 3,11 insn per cycle ( +- 0,00% )
>
Here are the numbers for on ivy bridge CPU:
The output for [1] using the current RTL CompareByte is:
9.001.275.281 cycles:u( +- 0,00% )
28.000.560.462 instructions:u # 3,11 insn per cycle ( +- 0,00% )
2,654735815 seconds time elapsed( +- 0,00% )
The
Using the code given below as "inner", I measure this:
Current Trunk:
O0 compare-byte-1 : 196065.112 +/- 896.754 cycles/inner [0.5 %CV 1.6 %R]
O1 compare-byte-1 : 196510.158 +/- 577.976 cycles/inner [0.3 %CV 1.1 %R]
O3 compare-byte-1 : 187540.922 +/- 706.167 cycles/inner [0.4 %CV 1.5 %R]
Patch
I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].
[1]:
var
key, key2: string;
res: LongWord;
i: SizeInt;
begin
key := 'A';
key2 := 'A';
for i:= 0 to 10 do begin
res := CompareByte(key[1], key2[1], Length(key));
end;
Am 21.10.2017 um 01:24 schrieb Markus Beth:
> Find attached the already announced version of CompareByte.
>
What benchmark did you use? In my tests it is slightly slower than that one of
fpc 3.0.x?
I used the following test program:
var
buf1,buf2 : array[0..127] of byte;
pos,len,i,j :
In our previous episode, Markus Beth said:
> Find attached the already announced version of CompareByte.
>
> BTW: If you really like to see a PCMPSTR based implementation, have a
> look at Agner Fog's Subroutine library asmlib.zip
> (http://agner.org/optimize/).
And then you see GPL licensed,
Find attached the already announced version of CompareByte.
BTW: If you really like to see a PCMPSTR based implementation, have a
look at Agner Fog's Subroutine library asmlib.zip
(http://agner.org/optimize/).
On 16.10.2017 23:08, Markus Beth wrote:
On 16.10.2017 22:41, Florian Klämpfl wrote:
Am 16.10.2017 um 23:08 schrieb Markus Beth:
> On 16.10.2017 22:41, Florian Klämpfl wrote:
>>> P.S.: I am currently working on another version of CompareByte that might
>>> have a slightly higher
>>> latency for very small len but a higher throughput (2 cycles per iteration
>>> vs. 3 cycles on an
Am 16.10.2017 23:04 schrieb "Markus Beth" :
>
> On 16.10.2017 22:41, Florian Klämpfl wrote:
>> BTW: I would really like to see a PCMPSTR based implementation :)
>
> PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of
SSE4.2. How would you deal with Intel
On 16.10.2017 22:41, Florian Klämpfl wrote:
P.S.: I am currently working on another version of CompareByte that might have
a slightly higher
latency for very small len but a higher throughput (2 cycles per iteration vs.
3 cycles on an Intel
Arrandale CPU (Westmere microarchitecture)). But this
Am 16.10.2017 um 22:33 schrieb Markus Beth:
> Sorry for the late reply. I had a weekend off(line).
>
> The instructions were chosen on purpose and Sergey already cited the part of
> the Intel documentation
> that explains why this is correct. You can find a similar part in AMD "AMD64
>
Sorry for the late reply. I had a weekend off(line).
The instructions were chosen on purpose and Sergey already cited the
part of the Intel documentation that explains why this is correct. You
can find a similar part in AMD "AMD64 Architecture Programmer’s Manual
Volume 1: Application
On 15 Oct 2017 Florian Klämpfl wrote:
> I had a look and tested it and it worked, I didn't notice the problem below
> either.
Sorry for wrong warning. I cannot provide any example where my suggestions
are true. The reason for it is described on page Vol. 1 3-13 of Intel 64
and IA-32 Architectures
Am 12.10.2017 um 20:37 schrieb sserg...@gmail.com:
> Hi.
>
> Sorry for late message. But nobody still have said about possible problem
> with
> suggested patch.
Well, it's always very hard to review such highly optimized code. I had a look
and tested it and it
worked, I didn't notice the
Hi.
Sorry for late message. But nobody still have said about possible problem with
suggested patch. So I decide to pay attention on that proposed code may be
incorrect under some circumstances IMHO.
Instruction on line 657
subq %rcx, %rax
decreases value in %rax on %rcx, but previous code
Oops, I just realised now that that byte sequence is your workaround, not
actually already present in the
code! My bad, but yes, that is the correct byte sequence (although it's worth
putting a comment in to
actually state what they are).
If the bug is known to still be present in GAS, then
Hi Markus,
Nice to see there's more than one person working to improve compiled code on
x86-64!
I can answer one question... the byte sequence 0F B6 01 is the direct machine
code representation of movzbl
(%rcx),%eax - this might be due to a bug with the assembler or movzbl not being
It did some changes to CompareByte in rtl/x86_64/x86_64.inc to reduce
the code size and make it run faster (see attached path). I was
successful with the code size deduction (47 bytes vs. 62 bytes) and also
with the speed (according to a micro benchmark [1] run on an Ivy Bridge
desktop).
To
21 matches
Mail list logo