On 20 Nov 2009, at 19:03, Sergei Gorelkin wrote:

> I did, but using Linux+valgrind rather than cygwin+gprof. IMHO valgrind (in 
> its callgrind flavour) outputs more useful profile information.
> Some time ago I was able to optimize away about 20% of executed CPU 
> instructions in the compiler, which however didn't decrease its execution 
> time by any noticeable amount. So, going for another 20% will be a much more 
> complicated task, beware.

Most of the time spent in the compiler is waiting for memory. Below are some 
numbers I collected with Shark (a sampling based profiler for Mac OS X) when 
compiling the compiler with itself and with DWARF debug info (DWARF adds a lot 
of individual data elements to the assembler output) on Mac OS X/i386:

Before r14137:
6.1%    ppn19sl AGGAS_TGNUASSEMBLER_$__WRITETREE$TASMLIST
5.7%    ppn19sl SYSTEM_SYSGETMEM_FIXED$LONGWORD$$POINTER
3.5%    libSystem.B.dylib       __bzero
2.7%    ppn19sl 
CCLASSES_TFPHASHLIST_$__INTERNALFIND$LONGWORD$SHORTSTRING$LONGINT$$LONGINT
2.6%    ppn19sl SYSTEM_TOBJECT_$__CLEANUPINSTANCE
2.2%    libSystem.B.dylib       __memcpy
2.1%    ppn19sl SYSTEM_SYSFREEMEM_FIXED$PFREELISTS$PMEMCHUNK_FIXED$$LONGWORD
1.7%    ppn19sl fpc_shortstr_to_shortstr
1.7%    ppn19sl SYSTEM_SYSFREEMEM$POINTER$$LONGWORD
1.5%    ppn19sl CCLASSES_TLINKEDLIST_$__CLEAR

After r14137:
6.4%    ppn19sl SYSTEM_SYSGETMEM_FIXED$LONGWORD$$POINTER
4.9%    ppn19sl AGGAS_TGNUASSEMBLER_$__WRITETREE$TASMLIST
3.3%    libSystem.B.dylib       __bzero
2.7%    ppn19sl 
CCLASSES_TFPHASHLIST_$__INTERNALFIND$LONGWORD$SHORTSTRING$LONGINT$$LONGINT
2.6%    ppn19sl SYSTEM_TOBJECT_$__CLEANUPINSTANCE
2.3%    libSystem.B.dylib       __memcpy
2.0%    ppn19sl SYSTEM_SYSFREEMEM_FIXED$PFREELISTS$PMEMCHUNK_FIXED$$LONGWORD
1.9%    ppn19sl fpc_shortstr_to_shortstr
1.7%    ppn19sl CCLASSES_TLINKEDLIST_$__CLEAR
1.6%    ppn19sl SYSTEM_SYSFREEMEM$POINTER$$LONGWORD

The only thing that changed in r14137 was adding a prefetch statement to 
tgnuassembler.writetree (on i386 you have to compile with -Cppentium4 or higher 
for the prefetch statement to do anything though). As you can see, the total 
number of samples in that function was reduced by 1.2% (and they no longer 
mostly occurred right after the instruction that loads the assembler 
instruction type field of the new instruction for the huge case statement).

I've tried to optimize sysgetmem_fixed also with some prefetch statements (the 
above already includes those, because even though I only committed them in 
r14197, I had them locally applied already since quite a while) but it still 
takes up quite a bit of time. Adding prefetches there also didn't help that 
much; they helped more in freemem (I believe it went from 3-4% to its current 
1.6-1.7%).


Jonas_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to