Re: [Chicken-users] performance of bignums
On Sun, Jun 28, 2015 at 09:47:30PM +0200, Peter Bex wrote: So far it seems my implementation of Burnikel/Ziegler division is rather unstable, performance-wise. If I disable burnikel/ziegler so it falls back to the traditional gradebook method, the benchmark finishes in a quarter of the time it takes to run with BZ enabled. I have gotten a handle on this problem now: It looks a lot like the problem is either in the original paper's description of the algorithm or in my interpretation of it: It does some shifting magic to make the two numbers get a length so that the dividend is twice as large as the divisor, then it passes the length of the smaller one, which is used in the recursion (read the paper if you want the details). In the recursion, the size is always taken from this value that's passed around, which is divided by two or four, depending on the part of the algorithm you're in. However, if you have a number with sparse digits, this is completely unrealistic. Splitting the number up into parts may cause very small numbers to be passed to the recursion. If you follow the paper strictly, you will perform the divisions on these small numbers assuming they are exactly half the size of the original number. So you'd be performing a lot of work down into the recursion until you hit the cutoff point (which is 300 in our case), instead of just jumping back into the regular gradebook division. At least, that's my (admittedly very) limited understanding of what may be going wrong here. I had a quick look at GMP, but of course that code is inscrutable; it *looks* like it is also ignoring the size propagation, instead relying on the numbers' _actual_ size. I read up on the various descriptions of the algorithm and found that the algorithm's description in Modern Computer Arithmetic by Brent Zimmermann is completely different (and a little bit simpler): it also looks at the number's *actual* size. Furthermore, it uses the difference in sizes as the determining factor when to stop. I tried using Brent Zimmermann's algorithm as-is, but unfortunately that caused the performance of another benchmark to be rather worse, and it would anyway require me to convert that pile of code to C again for CHICKEN 5 (so it's inlineable). Then I tried applying the difference check instead of the size check and *bingo*, the algorithm performed as expected. While investigating, I figured out that others have had the same basic problem, but nobody seems to have written down exactly what's going wrong (neither have I, but at least this mail is an attempt at documenting my findings): The following blog post mentions also having extremely bad performance until they switched to using a simple split the numbers in half approach: https://deamentiaemundi.wordpress.com/2014/11/16/multiple-precision-in-javascript-fast-division-i/ There's a patch for Tcl to use Burnikel-Ziegler division that never was applied due to disappointing results: http://sourceforge.net/p/tcl/patches/589/ Then, I also tried a Python implementation against the Python version of the pidigits benchmark (disabling the GMP stuff): http://bugs.python.org/issue3451 (file fast_div.py) This had the same problem that the fast division was much, much slower than the straightforward gradebook/school division. Anyway, I've now applied a patch to numbers trunk (I am awaiting feedback on another bug report that I may have fixed before releasing) and to CHICKEN 5 master. The performance is now a lot closer to what one might reasonably expect from a Scheme that doesn't use GMP. On my 64-bit laptop, the results of CHICKEN 5 on pidigits is: 7.764s CPU time, 1.796s GC time (major), 208425/208209 mutations (total/tracked), 3628/4917 GCs (major/minor) Versus Guile 2 (with GMP): clock utime stime cutime cstime gctime 6.05 5.81 0.24 0.00 0.00 4.51 Not too bad, I'd say :) Numbers is a little slower, as is expected: 9.428s CPU time, 5.196s GC time (major), 208425 mutations, 8391/797 GCs (major/minor) Unfortunately, this is the best of all possible cases. On my 32-bit VPS the performance gap between Guile and CHICKEN is much bigger: CHICKEN 5 on 32-bits: 37.734s CPU time, 16.917s GC time (major), 209346/209296 mutations (total/tracked), 17038/14731 GCs (major/minor) Guile 2 on 32-bits: clock utime stime cutime cstime gctime 12.96 12.74 0.20 0.00 0.00 0.00 Numbers is again somewhat slower than CHICKEN 5: 46.806s CPU time, 30.346s GC time (major), 209346 mutations, 22827/12045 GCs (major/minor) I'd still be interested in seeing comparisons with other Schemes. Remember, the improved version of numbers is not released yet, so you'll have to get it from Subversion for the time being: https://code.call-cc.org/svn/chicken-eggs/release/4/numbers/trunk (username: anonymous, password: the empty string) If anyone else wants to take a stab at whittling down the performance difference even further, please be my guest! For now I need a break from
Re: [Chicken-users] performance of bignums
On Thu, Jun 25, 2015 at 10:09:19PM +0200, Peter Bex wrote: I already removed the use of the format egg (the code contains a commented-out version that relies only on display), and even completely disable the println call. It's hard to be sure, but it *looks* like the majority of the time is being spent in the division procedures. So far it seems my implementation of Burnikel/Ziegler division is rather unstable, performance-wise. If I disable burnikel/ziegler so it falls back to the traditional gradebook method, the benchmark finishes in a quarter of the time it takes to run with BZ enabled. It's not just that the BZ implementation is really bad: it does improve performance on Chudnovsky's digits of pi calculation benchmark by a factor of four or so. I will do further experimentation. I also noticed a bug in the gradebook method, which I will have to address as well. Cheers, Peter signature.asc Description: Digital signature ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
On Sun, Jun 28, 2015 at 09:47:30PM +0200, Peter Bex wrote: So far it seems my implementation of Burnikel/Ziegler division is rather unstable, performance-wise. If I disable burnikel/ziegler so it falls back to the traditional gradebook method, the benchmark finishes in a quarter of the time it takes to run with BZ enabled. It's not just that the BZ implementation is really bad: it does improve performance on Chudnovsky's digits of pi calculation benchmark by a factor of four or so. I will do further experimentation. I also noticed a bug in the gradebook method, which I will have to address as well. Thanks a lot for putting so much effort in this! If you need anything please do not hesitate to ask, I’m really excited about that feature in CHICKEN 5 (and surely I’m not alone). :) -- Envoyé depuis ma GameBoy. ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
SBCL is Public Domain/MIT/BSD, depending on the component. GMP is dual licensed as LGPL3 and GPL2. The licensing issues would impact those who distribute binaries built with SBCL, but only if they are statically linked to GMP. -Dan Stephen Eilert spedr...@gmail.com writes: On Thu, Jun 25, 2015 at 4:50 PM, Peter Bex pe...@more-magic.net wrote: On Thu, Jun 25, 2015 at 11:39:50AM -0700, Martin DeMello wrote: Post to /r/scheme about chicken's bignum performance. (Not my post, just figured it could use some eyeballs.) http://www.reddit.com/r/scheme/comments/3b1ujw/performance_of_chicken_scheme_numbers_bignums/ Hello Martin, Thanks for posting this. We had already been discussing it earlier today in #chicken. I had another look at the code but I can't really find any obvious inefficiencies. It is indeed a bit faster with CHICKEN 5, but not by much. Of course, Guile is cheating by using GMP. If I compare it to another Scheme which has its own bignum implementation like Gauche, we perform about the same. It's interesting that sbcl is doing so well. Maybe I'm overlooking something seemingly minor but important? Cheers, Peter Not sure about the status of this particular GSOC, but SBCL could also be cheating. http://www.sbcl.org/gsoc2013/ideas/#sec-1.2 Now, I thought GMP were GPL'd and SBCL not, so I'm unsure about the legal implications, if it does bundle GMP now. — Stephen ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users -- -Dan Leslie ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
On Thu, Jun 25, 2015 at 11:39:50AM -0700, Martin DeMello wrote: Post to /r/scheme about chicken's bignum performance. (Not my post, just figured it could use some eyeballs.) http://www.reddit.com/r/scheme/comments/3b1ujw/performance_of_chicken_scheme_numbers_bignums/ Hello Martin, Thanks for posting this. We had already been discussing it earlier today in #chicken. I had another look at the code but I can't really find any obvious inefficiencies. It is indeed a bit faster with CHICKEN 5, but not by much. Of course, Guile is cheating by using GMP. If I compare it to another Scheme which has its own bignum implementation like Gauche, we perform about the same. It's interesting that sbcl is doing so well. Maybe I'm overlooking something seemingly minor but important? Cheers, Peter signature.asc Description: Digital signature ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
Peter Bex scripsit: Thanks for posting this. We had already been discussing it earlier today in #chicken. I had another look at the code but I can't really find any obvious inefficiencies. It is indeed a bit faster with CHICKEN 5, but not by much. Potential confounders are the I/O and the call to format. When I get a chance, I'll try running the core of the code on the various implementations on my Scheme machine (which I only have access to nights and weekends now, thanks to $EMPLOYER's paranoia). -- John Cowan http://www.ccil.org/~cowanco...@ccil.org An observable characteristic is not necessarily a functional requirement. --John Hudson ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
On Thu, Jun 25, 2015 at 10:26:53PM +0200, Peter Bex wrote: On Thu, Jun 25, 2015 at 05:04:17PM -0300, Stephen Eilert wrote: Not sure about the status of this particular GSOC, but SBCL could also be cheating. http://www.sbcl.org/gsoc2013/ideas/#sec-1.2 Now, I thought GMP were GPL'd and SBCL not, so I'm unsure about the legal implications, if it does bundle GMP now. Thanks for posting that; I hadn't looked at the sbcl implementation of the pi-digits benchmark, but indeed that page says Pi digits on the benchmarks game is currently a bunch of calls to GMP. This would *really* be cheating in my opinion, as it's not testing the language implementation but using nonstandard extensions that replace stuff that's already part of the core. I had just assumed it would be using the sbcl native implementation (which indeed is pretty damn good and almost 100% pure Lisp code). In fact, there are two implementations, one of which uses native operations. It is still faster than CHICKEN and very close to Guile/GMP: http://benchmarksgame.alioth.debian.org/u32/program.php?test=pidigitslang=sbclid=2#log It does use some type annotations, though. Adding some to the CHICKEN version makes it go a tiny bit faster, but not by much. Cheers, Peter signature.asc Description: Digital signature ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
On Thu, Jun 25, 2015 at 4:50 PM, Peter Bex pe...@more-magic.net wrote: On Thu, Jun 25, 2015 at 11:39:50AM -0700, Martin DeMello wrote: Post to /r/scheme about chicken's bignum performance. (Not my post, just figured it could use some eyeballs.) http://www.reddit.com/r/scheme/comments/3b1ujw/performance_of_chicken_scheme_numbers_bignums/ Hello Martin, Thanks for posting this. We had already been discussing it earlier today in #chicken. I had another look at the code but I can't really find any obvious inefficiencies. It is indeed a bit faster with CHICKEN 5, but not by much. Of course, Guile is cheating by using GMP. If I compare it to another Scheme which has its own bignum implementation like Gauche, we perform about the same. It's interesting that sbcl is doing so well. Maybe I'm overlooking something seemingly minor but important? Cheers, Peter Not sure about the status of this particular GSOC, but SBCL could also be cheating. http://www.sbcl.org/gsoc2013/ideas/#sec-1.2 Now, I thought GMP were GPL'd and SBCL not, so I'm unsure about the legal implications, if it does bundle GMP now. — Stephen ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
On Thu, Jun 25, 2015 at 03:59:59PM -0400, co...@ccil.org wrote: Peter Bex scripsit: Thanks for posting this. We had already been discussing it earlier today in #chicken. I had another look at the code but I can't really find any obvious inefficiencies. It is indeed a bit faster with CHICKEN 5, but not by much. Potential confounders are the I/O and the call to format. When I get a chance, I'll try running the core of the code on the various implementations on my Scheme machine (which I only have access to nights and weekends now, thanks to $EMPLOYER's paranoia). I already removed the use of the format egg (the code contains a commented-out version that relies only on display), and even completely disable the println call. It's hard to be sure, but it *looks* like the majority of the time is being spent in the division procedures. Seeing how it fares on other Schemes would be interesting though! Cheers, Peter signature.asc Description: Digital signature ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
On Thu, Jun 25, 2015 at 05:04:17PM -0300, Stephen Eilert wrote: Not sure about the status of this particular GSOC, but SBCL could also be cheating. http://www.sbcl.org/gsoc2013/ideas/#sec-1.2 Now, I thought GMP were GPL'd and SBCL not, so I'm unsure about the legal implications, if it does bundle GMP now. Thanks for posting that; I hadn't looked at the sbcl implementation of the pi-digits benchmark, but indeed that page says Pi digits on the benchmarks game is currently a bunch of calls to GMP. This would *really* be cheating in my opinion, as it's not testing the language implementation but using nonstandard extensions that replace stuff that's already part of the core. I had just assumed it would be using the sbcl native implementation (which indeed is pretty damn good and almost 100% pure Lisp code). Cheers, Peter signature.asc Description: Digital signature ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
Peter Bex scripsit: Of course, Guile is cheating by using GMP. If I compare it to another Scheme which has its own bignum implementation like Gauche, we perform about the same. It's interesting that sbcl is doing so well. Maybe I'm overlooking something seemingly minor but important? Here's a remark from the SBLC ideas page that may be helpful: Integer division is notorious for being slow. However, it is also known that the divisor is constant in the vast majority of cases, and serious compilers exploit that fact to simplify divisions into sequences of simpler multiplications, shifts, and additions. SBCL implements such a simplification only for truncated division of unsigned machine words. Floor and ceiling are less commonly supported natively in programming languages, and there is a dearth of literature describing how to simplify them. However, it is possible to do so, for both signed and unsigned machine integers. It is also possible to specialise the routines for tagged arithmetic. A complete execution of this project would include the development of simplification routines for signed and unsigned truncate, floor and ceiling divisions by integer constants. Some of the simplifications, particularly those concerning tagged integers, will be widely applicable and likely novel. ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] performance of bignums
On Thu, Jun 25, 2015 at 5:53 PM, Peter Bex pe...@more-magic.net wrote: On Thu, Jun 25, 2015 at 10:26:53PM +0200, Peter Bex wrote: On Thu, Jun 25, 2015 at 05:04:17PM -0300, Stephen Eilert wrote: Not sure about the status of this particular GSOC, but SBCL could also be cheating. http://www.sbcl.org/gsoc2013/ideas/#sec-1.2 Now, I thought GMP were GPL'd and SBCL not, so I'm unsure about the legal implications, if it does bundle GMP now. Thanks for posting that; I hadn't looked at the sbcl implementation of the pi-digits benchmark, but indeed that page says Pi digits on the benchmarks game is currently a bunch of calls to GMP. This would *really* be cheating in my opinion, as it's not testing the language implementation but using nonstandard extensions that replace stuff that's already part of the core. I had just assumed it would be using the sbcl native implementation (which indeed is pretty damn good and almost 100% pure Lisp code). In fact, there are two implementations, one of which uses native operations. It is still faster than CHICKEN and very close to Guile/GMP: http://benchmarksgame.alioth.debian.org/u32/program.php?test=pidigitslang=sbclid=2#log It does use some type annotations, though. Adding some to the CHICKEN version makes it go a tiny bit faster, but not by much. Cheers, Peter That compilation output is amazing! ; note: forced to do GENERIC-* (cost 30) ; unable to do inline fixnum arithmetic (cost 4) because: ; The first argument is a INTEGER, not a FIXNUM. ; The result is a (VALUES INTEGER OPTIONAL), not a (VALUES FIXNUM REST T). ; unable to do inline (signed-byte 32) arithmetic (cost 5) because: ; The first argument is a INTEGER, not a (SIGNED-BYTE 32). ; The result is a (VALUES INTEGER OPTIONAL), not a (VALUES ; (SIGNED-BYTE 32) REST ; T). From the benchmark page ( http://benchmarksgame.alioth.debian.org/u32/performance.php?test=pidigits#about ): In addition to language specific multiprecision arithmetic, we will accept programs that use GMP http://gmplib.org/. So, the cheating is official. Let's cheat too! *I fail to see the point though. If it's a programming language shootout, does it make sense to offload the meat of the processing to GMP?* *Do they intend to measure the FFI performance?* — Stephen ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
[Chicken-users] performance of bignums
Post to /r/scheme about chicken's bignum performance. (Not my post, just figured it could use some eyeballs.) http://www.reddit.com/r/scheme/comments/3b1ujw/performance_of_chicken_scheme_numbers_bignums/ martin ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users