On Oct 23, 2006, at 6:44 PM, David Balmain wrote:
The cost of a double-dereference to dispatch a method is
insignificant. However, the syntax is godawful...
len = instream->m->length_i(instream);
... which is why Dave has macro-fied it:
#define is_length(mis) mis->m->length_i(mis)
How about if we implement every method call in Lucy that way?
Sounds good to me. I would have implemented all of Ferret's classes
this way except that I was naively worrying about the performance
detriment of the extra point dereference. I thought the extra speed
may be worth the cost in memory.
I've pondered this a little more, and I think you have a point. The
extra deref is a cost in this design, and for a very tight loop, it
could make a difference. In general, pointer dereferencing is such a
cheap operation it's not something worth worrying about. But this is
a major design decision. We're not talking about a localized area of
code. This is everywhere, so it's important to choose correctly.
Maybe we can treat the i/o classes as a special case, since I think
that's where the majority of our function calls occur. Potentially,
we could just have the Streams' macros alias functions directly...
#define InStream_Write_VInt(instream) InStream_write_vint(instream)
... which is akin to declaring InStream_Write_VInt a "final" method.
(I assume that at the compiler level, that's how "final" methods are
implemented.)
That's an option if Lucy adopts the same architecture that Ferret and
KinoSearch have: no subclasses for InStream and OutStream.
If the function calls are all macrofied, it's easy enough to switch
up the macro definition and see if there's any effect. My guess is
that there will be a small but measurable difference.
This being C, we can also copy a function pointer out of the vtable
and use it directly if we identify a really tight loop somewhere as a
bottleneck.
InStream_read_vint_t read_vint = instream->_->read_vint;
for (1 .. a_lot) {
foo++ = read_vint(instream);
}
However, I suspect that if we treat the i/o classes as special cases
and call their functions directly everywhere, we'll have addressed
90% of the bottlenecks.
Anyway, I changed the Streams to test
the difference and I mean to refactor the rest of the classes like
this when I have time.
Below my sig you'll find the output of a simple benchmarking app
"speed.c" written by a guy named Michael Lee, run on my G4 Laptop and
then also on a Pentium 4. It shows the relative costs of some simple
operations. You can download it from <http://www.vmlinux.org/jocke/
programming/www.ontek.com/mikey/speed.c>. His original 1997 article
on C optimization (which has a broken link to speed.c) is at <http://
www.prism.uvsq.fr/~cedb/local_copies/lee.html>.
Since that article is nearly a decade old, it doesn't spend a lot of
time on CPU caching, which supposedly is becoming more important
these days. I don't have any practical experience doing those kinds
of optimizations, though. Maybe we could stuff all the vtables in
one giant struct and hope that with all our function pointers jammed
up against one another we'd get less cache thrash. But I think
that's chasing our tail.
The bottom line for me is that I'd prefer that the method calls be
macrofied because it's going to clean up the source quite a bit. The
best way to optimize is to find a better algorithm, and simpler,
cleaner, smaller source code makes refactoring easier.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
================================================
PowerMac G4 1.67 Laptop
slothbear:~/Desktop marvin$ ./speed
clocks_per_sec = 100
worst case resolution = 500.0000 usec
precision = 0 decimal digits
(cache & vm load) 0.100 usec 10.044 Mhz
(loop overhead) 0.100 usec 10.047 Mhz
empty 0.000 usec25391.792 Mhz
/* comments */ 0.000 usec21680.308 Mhz
#define 0.000 usec23079.365 Mhz
declaration 0.000 usec27676.983 Mhz
array[] 0.003 usec 376.814 Mhz
*pointer 0.003 usec 366.292 Mhz
int = 0.003 usec 377.055 Mhz
empty func() 0.010 usec 102.779 Mhz
bit shift 0.002 usec 412.078 Mhz
if-then-else 0.000 usec n/a
int + int 0.004 usec 280.621 Mhz
int - int 0.001 usec 1450.834 Mhz
int ^ int 0.001 usec 1361.626 Mhz
int * int 0.004 usec 262.790 Mhz
int / int 0.015 usec 65.456 Mhz
(int) float 0.029 usec 33.933 Mhz
float + float 0.006 usec 171.687 Mhz
float * float 0.006 usec 158.656 Mhz
float / float 0.015 usec 66.674 Mhz
strcpy() 0.036 usec 28.010 Mhz
strcmp() 0.000 usec 7761.078 Mhz
rand() 0.032 usec 31.578 Mhz
sqrt() 0.108 usec 9.296 Mhz
malloc/free 0.392 usec 2.553 Mhz
fopen/fclose 7.405 usec 0.135 Mhz
system() 280.327 usec 0.004 Mhz
slothbear:~/Desktop marvin$
================================================
FreeBSD 5.3
Intel(R) Pentium(R) 4 CPU 3.20GHz
$ ./speed
clocks_per_sec = 128
worst case resolution = 390.6250 usec
precision = 0 decimal digits
(cache & vm load) 0.121 usec 8.235 Mhz
(loop overhead) 0.134 usec 7.455 Mhz
empty 0.000 usec n/a
/* comments */ 0.000 usec n/a
#define 0.000 usec n/a
declaration 0.000 usec n/a
array[] 0.001 usec 1203.097 Mhz
*pointer 0.001 usec 1374.304 Mhz
int = 0.001 usec 1153.964 Mhz
empty func() 0.000 usec n/a
bit shift 0.000 usec n/a
if-then-else 0.000 usec n/a
int + int 0.000 usec n/a
int - int 0.001 usec 1279.043 Mhz
int ^ int 0.001 usec 1052.546 Mhz
int * int 0.001 usec 1102.379 Mhz
int / int 0.008 usec 118.546 Mhz
(int) float 0.001 usec 1141.288 Mhz
float + float 0.001 usec 1173.412 Mhz
float * float 0.001 usec 1343.009 Mhz
float / float 0.006 usec 167.827 Mhz
strcpy() 0.044 usec 22.755 Mhz
strcmp() 0.000 usec n/a
rand() 0.020 usec 50.176 Mhz
sqrt() 0.034 usec 29.739 Mhz
malloc/free 0.218 usec 4.586 Mhz
fopen/fclose 3.499 usec 0.286 Mhz
system() 111.848 usec 0.009 Mhz
$