Re: [Chicken-users] performance of bignums

Peter Bex Mon, 06 Jul 2015 15:35:01 -0700

On Sun, Jun 28, 2015 at 09:47:30PM +0200, Peter Bex wrote:
> So far it seems my implementation of Burnikel/Ziegler division is rather
> unstable, performance-wise.  If I disable burnikel/ziegler so it falls
> back to the traditional gradebook method, the benchmark finishes in
> a quarter of the time it takes to run with BZ enabled.


I have gotten a handle on this problem now: It looks a lot like the
problem is either in the original paper's description of the algorithm
or in my interpretation of it:

It does some shifting magic to make the two numbers get a length so that
the dividend is twice as large as the divisor, then it passes the length
of the smaller one, which is used in the recursion (read the paper if you
want the details).

In the recursion, the size is always taken from this value that's passed
around, which is divided by two or four, depending on the part of the
algorithm you're in.  However, if you have a number with sparse digits,
this is completely unrealistic.  Splitting the number up into parts may
cause very small numbers to be passed to the recursion.  If you follow
the paper strictly, you will perform the divisions on these small numbers
assuming they are exactly half the size of the original number.  So you'd
be performing a lot of work down into the recursion until you hit the
cutoff point (which is 300 in our case), instead of just jumping back into
the regular gradebook division.

At least, that's my (admittedly very) limited understanding of what may
be going "wrong" here.

I had a quick look at GMP, but of course that code is inscrutable;
it *looks* like it is also ignoring the size propagation, instead
relying on the numbers' _actual_ size.  I read up on the various
descriptions of the algorithm and found that the algorithm's description
in "Modern Computer Arithmetic" by Brent & Zimmermann is completely
different (and a little bit simpler): it also looks at the number's
*actual* size.  Furthermore, it uses the difference in sizes as the
determining factor when to stop.

I tried using Brent & Zimmermann's algorithm as-is, but unfortunately
that caused the performance of another benchmark to be rather worse, and
it would anyway require me to convert that pile of code to C again for
CHICKEN 5 (so it's inlineable).  Then I tried applying the difference
check instead of the size check and *bingo*, the algorithm performed as
expected.

While investigating, I figured out that others have had the same basic
problem, but nobody seems to have written down exactly what's going
wrong (neither have I, but at least this mail is an attempt at
documenting my findings):

The following blog post mentions
also having extremely bad performance until they switched to using
a simple "split the numbers in half" approach:
https://deamentiaemundi.wordpress.com/2014/11/16/multiple-precision-in-javascript-fast-division-i/

There's a patch for Tcl to use Burnikel-Ziegler division that never
was applied due to "disappointing results":
http://sourceforge.net/p/tcl/patches/589/

Then, I also tried a Python implementation against the Python version
of the pidigits benchmark (disabling the GMP stuff):
http://bugs.python.org/issue3451 (file fast_div.py)
This had the same problem that the "fast" division was much,
much slower than the straightforward gradebook/school division.

Anyway, I've now applied a patch to numbers trunk (I am awaiting
feedback on another bug report that I may have fixed before releasing)
and to CHICKEN 5 master.  The performance is now a lot closer to what
one might reasonably expect from a Scheme that doesn't use GMP.

On my 64-bit laptop, the results of CHICKEN 5 on pidigits is:
7.764s CPU time, 1.796s GC time (major), 208425/208209 mutations 
(total/tracked), 3628/4917 GCs (major/minor)

Versus Guile 2 (with GMP):
clock utime stime cutime cstime gctime
 6.05  5.81  0.24   0.00   0.00   4.51

Not too bad, I'd say :)

Numbers is a little slower, as is expected:
9.428s CPU time, 5.196s GC time (major), 208425 mutations, 8391/797 GCs 
(major/minor)


Unfortunately, this is the best of all possible cases.  On my 32-bit VPS
the performance gap between Guile and CHICKEN is much bigger:

CHICKEN 5 on 32-bits:
37.734s CPU time, 16.917s GC time (major), 209346/209296 mutations 
(total/tracked), 17038/14731 GCs (major/minor)

Guile 2 on 32-bits:
clock utime stime cutime cstime gctime
12.96 12.74  0.20   0.00   0.00   0.00

Numbers is again somewhat slower than CHICKEN 5:
46.806s CPU time, 30.346s GC time (major), 209346 mutations, 22827/12045 GCs 
(major/minor)

I'd still be interested in seeing comparisons with other Schemes.
Remember, the improved version of numbers is not released yet, so
you'll have to get it from Subversion for the time being:
https://code.call-cc.org/svn/chicken-eggs/release/4/numbers/trunk
(username: anonymous, password: <the empty string>)

If anyone else wants to take a stab at whittling down the
performance difference even further, please be my guest!  For now
I need a break from working on bignum code...  Though I don't know
if I can keep down the obsessive addition of tweaking it more :)

Cheers,
Peter

signature.asc
Description: Digital signature

_______________________________________________
Chicken-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] performance of bignums

Reply via email to