Re: optimizing numerical code.

Harvey J. Stein Tue, 14 Dec 2004 16:30:43 -0800


Thomas Fischbacher <[EMAIL PROTECTED]> writes:

 > On Tue, 14 Dec 2004, Harvey J. Stein wrote:
 > 
 > > Why do I get no notes on the fixnum version, but get a cost 13 float
 > > to pointer coercion in the double-float version?
 > 
 > A fixnum is just an ordinary 32-bit value whose type tag in the two 
 > lowestmost bits is 00. A floatingpoint value cannot be represented as a 
 > 32-bit value, hence has to be boxed - the function will return a pointer 
 > to a freshly consed number.

Your comments, plus a more careful reading of the fixnum section of
the manual helped.  But, it seems to be subtle.  With bsmc, the
compute time is around 4.8 seconds, with over 10% of the time in in
gc:

   * (time (bsmc 100d0 .3d0 1d0 100d0 .1d0 100 100000))
   Compiling LAMBDA NIL: 
   Compiling Top-Level Form: 

   Evaluation took:
     4.82 seconds of real time
     4.46 seconds of user run time
     0.36 seconds of system run time
     [Run times include 0.57 seconds GC run time]
     0 page faults and
     165795696 bytes consed.
   16.76700488226663d0

This is substantially slower than C, which is 2.8 to 3.6 seconds,
depending on the random number generator.

Requesting inlining loses the consing and speeds up the function
substantially:

   * (time (bsmc 100d0 .3d0 1d0 100d0 .1d0 100 100000))
   Compiling LAMBDA NIL: 
   Compiling Top-Level Form: 

   Evaluation took:
     3.44 seconds of real time
     3.44 seconds of user run time
     0.0 seconds of system run time
     0 page faults and
     0 bytes consed.
   16.675575968106124d0

It's now faster than the slowest C code, but the slowest one generates
53 bits of randomness in doubles by using 2 32 bit random numbers
generated from the Mersenne twister random number generator.  cmucl
seems to be using the same generator, but is probably only putting 32
bits of randomness into each double, so probably should be compared to
the 2.8 second run, but I'll have to check.  If that is the case,
that makes it ~20% slower.

I can understand why the functions have to return boxed floats.  And I
guess I understand why the compiler can't optimize that away.  I
suppose it breaks common lisp semantics if the compiler were to make
use of the current definition when compiling another function - if the
function is redefined at a later date, everything would break.  I
guess this is what block compilation is for.

But, what about in a bigger system?  It's one thing to use block
compilation to get this within a file, and block compilation can be
used across files, but it's a heavy thing to do just to get rid of a
little boxing, and packs everything into one .o, which seems
problematic.

The compiler manual references semi-inline functions, which seem to do
this, but I'm still a little fuzzy on that.  Note, however, that this
saved 10% on the runtime!  Proclaiming one-normal-rand and clalpay to
be extensions:maybe-inline instead of inline gives the fastest code to
date:

   * (time (bsmc 100d0 .3d0 1d0 100d0 .1d0 100 100000))
   Compiling LAMBDA NIL: 
   Compiling Top-Level Form: 

   Evaluation took:
     3.07 seconds of real time
     3.06 seconds of user run time
     0.0 seconds of system run time
     0 page faults and
     0 bytes consed.
   16.733940092255875d0

Now we've gotten within 10%.  But, can someone clarify what
maybe-inline does?  It looks like it makes the specified functions
available as local calls.  Is this the only function of it?  Does this
also mean that when they're redefined, the redefinition won't affect
code that was compiled and loaded prior to redefinition?  Most
importantly, why did it end up being faster than inline?

It seems like declarations can get rid of boxing and unboxing inside
of functions, but in general not across function calls, unless
functions are inlined, maybe-inlined, or block compiled.  Is this the
case?

If so, it would be nice to be able to tell the compiler that a
particular function can be redefined, but it's always going to return
a double-float, so don't bother boxing and unboxing the result.
Similarly for function arguments.  This would be something lighter
weight than the above methods, but more flexible, while potentially
giving just as much optimization.  Alternatively, the function itself
can be compiled with arguments and returns unboxed, with the function
calls themselves doing the boxing and unboxing if necessary.  This
would be like hoisting the unboxing out of the function.  Any thoughts
on the viability of such an extension?

-- 
Harvey Stein
Bloomberg LP
[EMAIL PROTECTED]

Re: optimizing numerical code.

Reply via email to