On Sun, Jun 28, 2015 at 09:47:30PM +0200, Peter Bex wrote: > So far it seems my implementation of Burnikel/Ziegler division is rather > unstable, performance-wise. If I disable burnikel/ziegler so it falls > back to the traditional gradebook method, the benchmark finishes in > a quarter of the time it takes to run with BZ enabled.
I have gotten a handle on this problem now: It looks a lot like the problem is either in the original paper's description of the algorithm or in my interpretation of it: It does some shifting magic to make the two numbers get a length so that the dividend is twice as large as the divisor, then it passes the length of the smaller one, which is used in the recursion (read the paper if you want the details). In the recursion, the size is always taken from this value that's passed around, which is divided by two or four, depending on the part of the algorithm you're in. However, if you have a number with sparse digits, this is completely unrealistic. Splitting the number up into parts may cause very small numbers to be passed to the recursion. If you follow the paper strictly, you will perform the divisions on these small numbers assuming they are exactly half the size of the original number. So you'd be performing a lot of work down into the recursion until you hit the cutoff point (which is 300 in our case), instead of just jumping back into the regular gradebook division. At least, that's my (admittedly very) limited understanding of what may be going "wrong" here. I had a quick look at GMP, but of course that code is inscrutable; it *looks* like it is also ignoring the size propagation, instead relying on the numbers' _actual_ size. I read up on the various descriptions of the algorithm and found that the algorithm's description in "Modern Computer Arithmetic" by Brent & Zimmermann is completely different (and a little bit simpler): it also looks at the number's *actual* size. Furthermore, it uses the difference in sizes as the determining factor when to stop. I tried using Brent & Zimmermann's algorithm as-is, but unfortunately that caused the performance of another benchmark to be rather worse, and it would anyway require me to convert that pile of code to C again for CHICKEN 5 (so it's inlineable). Then I tried applying the difference check instead of the size check and *bingo*, the algorithm performed as expected. While investigating, I figured out that others have had the same basic problem, but nobody seems to have written down exactly what's going wrong (neither have I, but at least this mail is an attempt at documenting my findings): The following blog post mentions also having extremely bad performance until they switched to using a simple "split the numbers in half" approach: https://deamentiaemundi.wordpress.com/2014/11/16/multiple-precision-in-javascript-fast-division-i/ There's a patch for Tcl to use Burnikel-Ziegler division that never was applied due to "disappointing results": http://sourceforge.net/p/tcl/patches/589/ Then, I also tried a Python implementation against the Python version of the pidigits benchmark (disabling the GMP stuff): http://bugs.python.org/issue3451 (file fast_div.py) This had the same problem that the "fast" division was much, much slower than the straightforward gradebook/school division. Anyway, I've now applied a patch to numbers trunk (I am awaiting feedback on another bug report that I may have fixed before releasing) and to CHICKEN 5 master. The performance is now a lot closer to what one might reasonably expect from a Scheme that doesn't use GMP. On my 64-bit laptop, the results of CHICKEN 5 on pidigits is: 7.764s CPU time, 1.796s GC time (major), 208425/208209 mutations (total/tracked), 3628/4917 GCs (major/minor) Versus Guile 2 (with GMP): clock utime stime cutime cstime gctime 6.05 5.81 0.24 0.00 0.00 4.51 Not too bad, I'd say :) Numbers is a little slower, as is expected: 9.428s CPU time, 5.196s GC time (major), 208425 mutations, 8391/797 GCs (major/minor) Unfortunately, this is the best of all possible cases. On my 32-bit VPS the performance gap between Guile and CHICKEN is much bigger: CHICKEN 5 on 32-bits: 37.734s CPU time, 16.917s GC time (major), 209346/209296 mutations (total/tracked), 17038/14731 GCs (major/minor) Guile 2 on 32-bits: clock utime stime cutime cstime gctime 12.96 12.74 0.20 0.00 0.00 0.00 Numbers is again somewhat slower than CHICKEN 5: 46.806s CPU time, 30.346s GC time (major), 209346 mutations, 22827/12045 GCs (major/minor) I'd still be interested in seeing comparisons with other Schemes. Remember, the improved version of numbers is not released yet, so you'll have to get it from Subversion for the time being: https://code.call-cc.org/svn/chicken-eggs/release/4/numbers/trunk (username: anonymous, password: <the empty string>) If anyone else wants to take a stab at whittling down the performance difference even further, please be my guest! For now I need a break from working on bignum code... Though I don't know if I can keep down the obsessive addition of tweaking it more :) Cheers, Peter
signature.asc
Description: Digital signature
_______________________________________________ Chicken-users mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/chicken-users
