On Sun, 2006-08-20 at 00:12 +0000, Aivars Kalvans wrote: > That would be an improvement on ARM, but on x86 branching costs more > than doing the math and simple inlining will perform better: > > Athlon XP > [EMAIL PROTECTED] ~/dev/poo $ ./cycles > branching: 24 cycles > function call: 698 cycles > inline multiply: 15 cycles
Out of interest, where are you getting these numbers from ? they certainly don't tally with my understanding of processors, assembler, branch prediction. etc. __builtin_expect of course, can be used to straighten all the common paths so there is no icache hit on the fast paths, if we know unit matricees are the common case. Also - * 698 * cycles per function call seems amazingly high; of course - if you include all the 1-off lazy linking overhead and you run the function only a handful of times, I guess you could get that number ;-) but ... > P.S. Does anyone know if it's possible to inline __adddf3() and > __muldf3() when compiling with -msoft-float ? It might be that function > call overhead is the bottleneck and most problems can be solved by > compiler flags. Just read that code: cf. gcc/gcc/gcc/config/arm/ieee754-df.S, interestingly it has a nice check for: @ Convert mantissa to unsigned integer. @ If power of two, branch to a separate path. some interesting assembler for sure; if you inline it you may find you bloat the code substantially, and make it slower. Either way - IMHO you're confused wrt. the cost of a function call: that is unless you're passing a ton of huge in-line arguments on the stack or something :-) When you the generated assembler, about the only really odd inefficiency to be seen in PIC code is the ebx fixup: 8b 1c 24 mov (%esp),%ebx c3 ret necessary to access any functions / variables [ and not present on the x86_84 ]. HTH, Michael. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot _______________________________________________ Performance-list mailing list Performance-list@gnome.org http://mail.gnome.org/mailman/listinfo/performance-list