Am 23.10.2017 um 22:58 schrieb Markus Beth: > Here are the numbers for on ivy bridge CPU: > The output for [1] using the current RTL CompareByte is: > 9.001.275.281 cycles:u ( +- 0,00% ) > 28.000.560.462 instructions:u # 3,11 insn per cycle ( +- 0,00% ) > 2,654735815 seconds time elapsed ( +- 0,00% ) > > The output for [1] using the x86_64_comparebyte3.patch CompareByte is: > 9.002.038.628 cycles:u ( +- 0,01% ) > 26.000.559.441 instructions:u # 2,89 insn per cycle ( +- 0,00% ) > 2,655002891 seconds time elapsed ( +- 0,01% ) > > The output for [2] using the current RTL CompareByte is: > 227.941.173.371 cycles:u ( +- 0,00% ) > 734.077.388.160 instructions:u # 3,22 insn per cycle ( +- 0,00% ) > 67,215188648 seconds time elapsed ( +- 0,00% ) > > The output for [2] using the x86_64_comparebyte3.patch CompareByte is: > 210.694.292.040 cycles:u ( +- 0,00% ) > 524.341.215.569 instructions:u # 2,49 insn per cycle ( +- 0,00% ) > 62,129294243 seconds time elapsed ( +- 0,00% ) > > > With Florian's benchmark I also observe that the patched version is > slightly slower than the original. But I have no idea why this is so.
I have committed your lastest patch with a few changes: the loop entry is aligned now to 16 bytes, I used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this works better. I think also these changes are better on average. _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel