Re: OO Design -- performance

Marvin Humphrey Mon, 30 Oct 2006 20:25:22 -0800


On Oct 23, 2006, at 6:44 PM, David Balmain wrote:

The cost of a double-dereference to dispatch a method is
insignificant.  However, the syntax is godawful...


      len = instream->m->length_i(instream);

... which is why Dave has macro-fied it:

      #define is_length(mis) mis->m->length_i(mis)

How about if we implement every method call in Lucy that way?


Sounds good to me. I would have implemented all of Ferret's classes
this way except that I was naively worrying about the performance
detriment of the extra point dereference. I thought the extra speed
may be worth the cost in memory.

I've pondered this a little more, and I think you have a point. Theextra deref is a cost in this design, and for a very tight loop, itcould make a difference. In general, pointer dereferencing is such acheap operation it's not something worth worrying about. But this isa major design decision. We're not talking about a localized area ofcode. This is everywhere, so it's important to choose correctly.

Maybe we can treat the i/o classes as a special case, since I thinkthat's where the majority of our function calls occur. Potentially,we could just have the Streams' macros alias functions directly...


    #define InStream_Write_VInt(instream) InStream_write_vint(instream)

... which is akin to declaring InStream_Write_VInt a "final" method.(I assume that at the compiler level, that's how "final" methods areimplemented.)

That's an option if Lucy adopts the same architecture that Ferret andKinoSearch have: no subclasses for InStream and OutStream.

If the function calls are all macrofied, it's easy enough to switchup the macro definition and see if there's any effect. My guess isthat there will be a small but measurable difference.

This being C, we can also copy a function pointer out of the vtableand use it directly if we identify a really tight loop somewhere as abottleneck.


    InStream_read_vint_t read_vint = instream->_->read_vint;
    for (1 .. a_lot) {
        foo++ = read_vint(instream);
    }

However, I suspect that if we treat the i/o classes as special casesand call their functions directly everywhere, we'll have addressed90% of the bottlenecks.

Anyway, I changed the Streams to test
the difference and I mean to refactor the rest of the classes like
this when I have time.

Below my sig you'll find the output of a simple benchmarking app"speed.c" written by a guy named Michael Lee, run on my G4 Laptop andthen also on a Pentium 4. It shows the relative costs of some simpleoperations. You can download it from <http://www.vmlinux.org/jocke/programming/www.ontek.com/mikey/speed.c>. His original 1997 articleon C optimization (which has a broken link to speed.c) is at <http://www.prism.uvsq.fr/~cedb/local_copies/lee.html>.

Since that article is nearly a decade old, it doesn't spend a lot oftime on CPU caching, which supposedly is becoming more importantthese days. I don't have any practical experience doing those kindsof optimizations, though. Maybe we could stuff all the vtables inone giant struct and hope that with all our function pointers jammedup against one another we'd get less cache thrash. But I thinkthat's chasing our tail.

The bottom line for me is that I'd prefer that the method calls bemacrofied because it's going to clean up the source quite a bit. Thebest way to optimize is to find a better algorithm, and simpler,cleaner, smaller source code makes refactoring easier.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


================================================
PowerMac G4 1.67 Laptop

slothbear:~/Desktop marvin$ ./speed
clocks_per_sec = 100
worst case resolution = 500.0000 usec
precision = 0 decimal digits
   (cache & vm load)     0.100 usec   10.044 Mhz
     (loop overhead)     0.100 usec   10.047 Mhz
               empty     0.000 usec25391.792 Mhz
      /* comments */     0.000 usec21680.308 Mhz
             #define     0.000 usec23079.365 Mhz
         declaration     0.000 usec27676.983 Mhz
             array[]     0.003 usec  376.814 Mhz
            *pointer     0.003 usec  366.292 Mhz
               int =     0.003 usec  377.055 Mhz
        empty func()     0.010 usec  102.779 Mhz
           bit shift     0.002 usec  412.078 Mhz
        if-then-else     0.000 usec      n/a
           int + int     0.004 usec  280.621 Mhz
           int - int     0.001 usec 1450.834 Mhz
           int ^ int     0.001 usec 1361.626 Mhz
           int * int     0.004 usec  262.790 Mhz
           int / int     0.015 usec   65.456 Mhz
         (int) float     0.029 usec   33.933 Mhz
       float + float     0.006 usec  171.687 Mhz
       float * float     0.006 usec  158.656 Mhz
       float / float     0.015 usec   66.674 Mhz
            strcpy()     0.036 usec   28.010 Mhz
            strcmp()     0.000 usec 7761.078 Mhz
              rand()     0.032 usec   31.578 Mhz
              sqrt()     0.108 usec    9.296 Mhz
         malloc/free     0.392 usec    2.553 Mhz
        fopen/fclose     7.405 usec    0.135 Mhz
            system()   280.327 usec    0.004 Mhz
slothbear:~/Desktop marvin$


================================================
FreeBSD 5.3
Intel(R) Pentium(R) 4 CPU 3.20GHz

$ ./speed
clocks_per_sec = 128
worst case resolution = 390.6250 usec
precision = 0 decimal digits
   (cache & vm load)     0.121 usec    8.235 Mhz
     (loop overhead)     0.134 usec    7.455 Mhz
               empty     0.000 usec      n/a
      /* comments */     0.000 usec      n/a
             #define     0.000 usec      n/a
         declaration     0.000 usec      n/a
             array[]     0.001 usec 1203.097 Mhz
            *pointer     0.001 usec 1374.304 Mhz
               int =     0.001 usec 1153.964 Mhz
        empty func()     0.000 usec      n/a
           bit shift     0.000 usec      n/a
        if-then-else     0.000 usec      n/a
           int + int     0.000 usec      n/a
           int - int     0.001 usec 1279.043 Mhz
           int ^ int     0.001 usec 1052.546 Mhz
           int * int     0.001 usec 1102.379 Mhz
           int / int     0.008 usec  118.546 Mhz
         (int) float     0.001 usec 1141.288 Mhz
       float + float     0.001 usec 1173.412 Mhz
       float * float     0.001 usec 1343.009 Mhz
       float / float     0.006 usec  167.827 Mhz
            strcpy()     0.044 usec   22.755 Mhz
            strcmp()     0.000 usec      n/a
              rand()     0.020 usec   50.176 Mhz
              sqrt()     0.034 usec   29.739 Mhz
         malloc/free     0.218 usec    4.586 Mhz
        fopen/fclose     3.499 usec    0.286 Mhz
            system()   111.848 usec    0.009 Mhz
$

Re: OO Design -- performance

Reply via email to