Improve "busy" numeric code's performance [was: Re: Big Integers]

Peter Bex Wed, 22 May 2024 05:43:20 -0700

Hello Doug and CHICKEN hackers,

Thanks for the benchmarking program.  Somehow your e-mail is a bit
garbled, hence the top-posting, for which I apologize.


The benchmark in question deals mostly with small-sized bignums, which
means we're not even trigger the fancier division algorithms.  This is
all well within the range where the classical Knuth / schoolbook
division algorithm performs fine (which Chez also uses in this case).

In particular, the algorithm calls modulo many many times, which causes
some thrashing in the "scratch space" that we set aside for temporary
calculations when the code in question cannot allocate in the heap.
You can see this when you run your program with -:d (for debug).
Note: the code gets compiled to almost optimal code - it results in a
tight C loop that only occasionally jumps out to display something.
So that particular loop is very "busy" in that it calls out to numeric
code (which allocates in the scratch space) a lot.

I've had a look and I think we can improve this considerably by not
deallocating the scratch space memory area on every minor GC.  I think I
originally did this because I didn't fully trust the scratch space code
and didn't want it to eat up too much memory.  But I think the code is
stable enough now that we can postpone the deallocation (or who knows,
we could even get rid of it entirely?  This would be at the cost of
slightly higher memory use than strictly necessary).

Attached are two patches, one which has this bigger improvement, and
another which is a minor improvement which translates to shaving about
a second of runtime off your program (at least on my machine).

These are for -hackers to sign off on and push to git, but you should
also be able to apply them to the runtime.c of a downloaded copy of
CHICKEN so you don't need to go through the bootstrapping procedure if
you don't want to build from git.

For me, together this reduces the runtime from 58 seconds to 39 seconds,
which is about a 30% speedup.  If that translates to the same kind of
speedup on your machine, we'll be a lot closer to Racket.  I think the
rest is probably overhead which we might be able to shave off but it'll
be a lot more effort (though I could be wrong of course).

Finally, you can squeeze out a little bit more performance by wrapping
the body of "rho" in an assume, like so:

(define (rho n u v c iter prod)
  (assume ((n integer)
           (u integer)
           (v integer)
           (c integer)
           (iter integer)
           (prod integer))
   (let* ( [u1 (modulo (+ (* u u) c) n)]
     ....))))

This will avoid emitting generic code which has to do a type check every
single time it calls modulo, + or *.  The somewhat limited flow analysis
we have apparently is unable to figure out that these integer constants
make their way into rho's arguments, even in block compilation mode.

Cheers,
Peter

On Tue, May 21, 2024 at 09:35:33PM +0000, T.D. Telford wrote:
>  Hello Peter,
> Thanks for the reply.  The elapsed timings for the program rho3rec are:
> chicken 5.3.0:  33.6 secondsRacket v8.2 [cs]  : 18.1 secondsDr Racket : 20.6 
> seconds  (10000 MB memory)
> The program uses the Pollard rho algorithm to find a factor of 2^257 -1.  The 
> program is highly recursive, but I have a non recursive version that has the 
> same timing as the above.  I am using an AMD Ryzen 5600 cpu and 32 GB of main 
> memory.
> I tried all of the csc options that appeared related to speed, and the 
> maximum improvement was 0.1 seconds. 
> The one option I did not get to work was    -heap-size 1000M
> where I varied the size from 1000M to 10000M.  In all cases I would get the 
> run time message    [panic] out of memory - heap full - execution terminated
> 
> I have include the program listing below and also as an attachment.
> Regards,Doug
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;#lang
>  racket        ;; uncomment for racket
> 
> (define (rho n u v c iter prod)  (let* ( [u1 (modulo (+ (* u u) c) n)]        
>   [v1 (modulo (+ (* v v) c) n)]          [v2 (modulo (+ (* v1 v1) c) n)]      
>     [done #f]          [max_iter 30000000]          [iter2 (+ iter 1)]        
>   [prod2 (modulo (* prod (- u1 v2)) n)] )
>     (if (= (modulo iter2 150) 0)      (begin    ; modulo true        (let ( 
> [g (gcd prod2 n) ] )          (if (and (> g 1) (< g n))            (begin ; 
> factor              (display "factor = ") (display g) (newline)              
> (display "iterations = ") (display iter2) (newline)              (set! done 
> #t)            )            (set! prod2 1) ; no factor          ) ; end if 
> factor        ) ; end let       ) ; end begin for modulo true      #f ;action 
> for modulo false    ) ; end major if
>     (if (and (< iter2 max_iter) (not done))       (rho n u1 v2 c iter2 prod2) 
>      (if done ; either found factor or max iterations        (display "normal 
> termination \n")        #f      ) ; if done    ) ; if and   ) ; end let*)
> (let ( [n (- (expt 2 257) 1)] [u 2] [v 11] [c 7] [iter 1] [prod 1] )    
> (display "factor n = ") (display n) (newline)    (time (rho n u v c iter 
> prod)))
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>     On Tuesday, May 21, 2024 at 12:13:55 AM MDT, Peter Bex 
> <pe...@more-magic.net> wrote:  
>  
>  (sending again, forgot to CC the users list)
> 
> On Mon, May 20, 2024 at 03:23:54PM +0000, T.D. Telford wrote:
> > With the csc compiler and the -f or -fixnum-arithmetic option (Assume all 
> > numbers are fixnums) my benchmarks appear to be quite fast compared to 
> > racket of chez scheme.  When running a benchmark that uses big integers 
> > (such as the pollard rho), execution times are almost twice as long as 
> > racket or chez.
> 
> Hello there!
> 
> When we initially added bignum support in core (it used to be an egg),
> we spent quite a bit of effort optimizing various benchmarks, and
> on several of these (but not all of course), at the end CHICKEN
> performed *better* than several other Schemes, including Racket (which
> wasn't Chez-based at the time).  It's quite possible that Chez Scheme
> has some optimized routines that other Schemes don't have, which we
> could learn from.
> 
> If you have a specific benchmark that's slow, it would be helpful if you
> could share the code so we can have a look what it is exactly that slows
> things down.
> 
> > Is there an egg or alternate code to improve the efficiency of big 
> > integers?  I would have thought there would be interface to gmp.
> 
> While GMP is optimized to the hilt, it won't necessarily improve speed.
> The original "numbers" egg which provided add-on support for bignums
> was based on GMP and slow as molasses.  Mediocre numeric code that's
> deeply integrated with the garbage collector and compilation strategy
> can easily outperform supremely bummed code that has to go through the
> FFI on every call.
> 
> Again, if you can share some benchmarking code we can have a look if
> there are any obvious bottlenecks.
> 
> Cheers,
> Peter
>   

> ;; recursive
> ;;#lang racket        ;; uncomment for racket
> 
> 
> (define (rho n u v c iter prod)
>   (let* ( [u1 (modulo (+ (* u u) c) n)]
>           [v1 (modulo (+ (* v v) c) n)]
>           [v2 (modulo (+ (* v1 v1) c) n)]
>           [done #f]
>           [max_iter 30000000]
>           [iter2 (+ iter 1)]
>           [prod2 (modulo (* prod (- u1 v2)) n)] )
> 
>     (if (= (modulo iter2 150) 0)
>       (begin    ; modulo true
>         (let ( [g (gcd prod2 n) ] )
>           (if (and (> g 1) (< g n))
>             (begin ; factor
>               (display "factor = ") (display g) (newline)
>               (display "iterations = ") (display iter2) (newline)
>               (set! done #t)
>             )
>             (set! prod2 1) ; no factor
>           ) ; end if factor
>         ) ; end let 
>       ) ; end begin for modulo true
>       #f ;action for modulo false
>     ) ; end major if
> 
>     (if (and (< iter2 max_iter) (not done)) 
>       (rho n u1 v2 c iter2 prod2)
>       (if done ; either found factor or max iterations
>         (display "normal termination \n")
>         #f
>       ) ; if done
>     ) ; if and 
>   ) ; end let*
> )
> 
> (let ( [n (- (expt 2 257) 1)] [u 2] [v 11] [c 7] [iter 1] [prod 1] )
>     (display "factor n = ") (display n) (newline)
>     (time (rho n u v c iter prod))
> )
>

>From d2999bb77a6ba8665be9ca997a85481b670705ea Mon Sep 17 00:00:00 2001
From: Peter Bex <pe...@more-magic.net>
Date: Wed, 22 May 2024 13:53:28 +0200
Subject: [PATCH 1/2] Don't free the memory for the scratchspace on minor GC

This is too costly if we're in a loop that's operating on the
scratchspace, because it will just reallocate the scratchspace, and
use a small scratchspace which will cause excessive reallocation.
---
 runtime.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/runtime.c b/runtime.c
index 9becd2aa..5116ed37 100644
--- a/runtime.c
+++ b/runtime.c
@@ -3675,13 +3675,16 @@ C_regparm void C_fcall C_reclaim(void *trampoline, 
C_word c)
   }
 
   /* GC will have copied any live objects out of scratch space: clear it */
-  if (C_scratchspace_start != NULL) {
-    C_free(C_scratchspace_start);
-    C_scratchspace_start = NULL;
-    C_scratchspace_top = NULL;
-    C_scratchspace_limit = NULL;
+  if (C_scratchspace_start != C_scratchspace_top) {
+    /* And drop the scratchspace in case of a major or reallocating collection 
*/
+    if (gc_mode != GC_MINOR) {
+      C_free(C_scratchspace_start);
+      C_scratchspace_start = NULL;
+      C_scratchspace_limit = NULL;
+      scratchspace_size = 0;
+    }
+    C_scratchspace_top = C_scratchspace_start;
     C_scratch_usage = 0;
-    scratchspace_size = 0;
   }
 
   if(gc_mode == GC_MAJOR) {
-- 
2.42.0

>From 7c44693437240bb38e781fde3a5e7f7901968832 Mon Sep 17 00:00:00 2001
From: Peter Bex <pe...@more-magic.net>
Date: Wed, 22 May 2024 13:58:11 +0200
Subject: [PATCH 2/2] Call integer_divrem straight from the modulo operations

These would call C_s_a_i_remainder initially, which performs many of
the same checks that we've just already done before actually calling
integer_divrem.  It also allocates more memory on the stack which is
not necessary.

In the case of the generic operator, doing so requires duplicating the
code which converts floating-point integers to exact integers and back
again.

In the case of the integer-specific operator, there's no point in
calling the *generic* remainder function in the first place!
---
 runtime.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/runtime.c b/runtime.c
index 5116ed37..07be36e5 100644
--- a/runtime.c
+++ b/runtime.c
@@ -9243,7 +9243,8 @@ C_s_a_u_i_integer_remainder(C_word **ptr, C_word n, 
C_word x, C_word y)
 C_regparm C_word C_fcall
 C_s_a_i_modulo(C_word **ptr, C_word n, C_word x, C_word y)
 {
-  C_word ab[C_SIZEOF_FIX_BIGNUM], *a = ab, r;
+  C_word ab[C_SIZEOF_FIX_BIGNUM], *a = ab, r,
+         nx = C_SCHEME_FALSE, ny = C_SCHEME_FALSE;
 
   if (!C_truep(C_i_integerp(x)))
     barf(C_BAD_ARGUMENT_TYPE_NO_INTEGER_ERROR, "modulo", x);
@@ -9251,13 +9252,36 @@ C_s_a_i_modulo(C_word **ptr, C_word n, C_word x, C_word 
y)
     barf(C_BAD_ARGUMENT_TYPE_NO_INTEGER_ERROR, "modulo", y);
   if (C_truep(C_i_zerop(y))) C_div_by_zero_error("modulo");
 
-  r = C_s_a_i_remainder(&a, 2, x, y);
+  if (C_truep(C_i_flonump(x))) {
+    if C_truep(C_i_flonump(y)) {
+      double dx = C_flonum_magnitude(x), dy = C_flonum_magnitude(y), tmp;
+
+      C_modf(dx / dy, &tmp);
+      return C_flonum(ptr, dx - tmp * dy);
+    }
+    x = nx = C_s_a_u_i_flo_to_int(&a, 1, x);
+  }
+  if (C_truep(C_i_flonump(y))) {
+    y = ny = C_s_a_u_i_flo_to_int(&a, 1, y);
+  }
+
+  integer_divrem(&a, x, y, NULL, &r);
   if (C_i_positivep(y) != C_i_positivep(r) && !C_truep(C_i_zerop(r))) {
     C_word m = C_s_a_i_plus(ptr, 2, r, y);
     m = move_buffer_object(ptr, ab, m);
     clear_buffer_object(ab, r);
     r = m;
   }
+
+  if (C_truep(nx) || C_truep(ny)) {
+    C_word newr = C_a_i_exact_to_inexact(ptr, 1, r);
+    clear_buffer_object(ab, r);
+    r = newr;
+
+    clear_buffer_object(ab, nx);
+    clear_buffer_object(ab, ny);
+  }
+
   return move_buffer_object(ptr, ab, r);
 }
 
@@ -9267,7 +9291,7 @@ C_s_a_u_i_integer_modulo(C_word **ptr, C_word n, C_word 
x, C_word y)
   C_word ab[C_SIZEOF_FIX_BIGNUM], *a = ab, r;
   if (y == C_fix(0)) C_div_by_zero_error("modulo");
 
-  r = C_s_a_i_remainder(&a, 2, x, y);
+  integer_divrem(&a, x, y, NULL, &r);
   if (C_i_positivep(y) != C_i_positivep(r) && r != C_fix(0)) {
     C_word m = C_s_a_u_i_integer_plus(ptr, 2, r, y);
     m = move_buffer_object(ptr, ab, m);
-- 
2.42.0

Improve "busy" numeric code's performance [was: Re: Big Integers]

Reply via email to