> As for opinion pieces on bigint libraries in general, I also wrote a little > note on the rather sad state of GMP vs “anything else” at > <https://github.com/fsh/integers#why-gmp>
> ## Why GMP? > > Unfortunately there is just no beating GMP when it comes to performance. It > is the unequivocal gold standard. > > All “home-made” or “from scratch” integer implementations by hobbyists, such > as Nim’s official `bigints` (and even Python’s built-in `int`) are _laughably > slow_ compared to GMP, often on the order of magnitudes. > > Alternative projects such as `libtomath` come much closer, but are still not > quite as optimized as GMP. > > The reason projects shy away from GMP is two-fold: > > * The “serious” reason is that GMP is dual-licensed with LGPLv3 and GPLv2, > and many consider even the LGPL too restrictive or convoluted (e.g. static > linking can be problematic), so this makes it simply a no-go for many > projects. > * The less practical reason is that implementing one’s own arbitrary > precision integers from scratch is (at the onset) a very attractive and > exciting project for many programmers, so it is a common mind trap to fall > into. > > > The first reason is unavoidable and unfortunate. The second reason is very > _understandable_ (I feel the same pull), although it is incredibly > frustrating when I’m stuck with these cowboy implemenations as a user. > > Since I am of the opinion that (big) integers is a _core_ data type that > should be universally available, thus also implicitly demanding it should be > _as efficient as humanly possible_ , I’m always going to bite the bullet on > the (L)GPL license. Being as fast as GMP is a three-fold issue: 1. You need knowledge of your target architecture 2. You need knowledge of big integers algorithm 3. You need time But it is doable, for example for size suitable for cryptography my library 3.8x faster than GMP on basic big int multiplication: * bench code: <https://gist.github.com/mratsim/9bf2801d06e9df86dc3c98efcc825666> * library: <https://github.com/mratsim/constantine> The ratio is probably x8 on modular multiplication. How? 1. In my case I benefit from knowing the sizes at compile time which removes a lot of need for if/then/else. 2. I have an assembler that allows me to use specific bigint instructions MULX, ADCX, ADOX (<https://github.com/mratsim/constantine/blob/928f5155/constantine/math/arithmetic/assembly/limbs_asm_mul_x86_adx_bmi2.nim#L94-L105>) using pure C or pure Nim loses me up to 60% performance (<https://github.com/mratsim/constantine#inline-assembly>) especially when using GCC as a compiler and despite using intrinsics: <https://gcc.godbolt.org/z/2h768y> C code for add with carry on 256-bit #include <stdint.h> #include <x86intrin.h> void add256(uint64_t a[4], uint64_t b[4]){ uint8_t carry = 0; for (int i = 0; i < 4; ++i) carry = _addcarry_u64(carry, a[i], b[i], &a[i]); } Run GCC 9.2 ASM add256: movq (%rsi), %rax addq (%rdi), %rax setc %dl movq %rax, (%rdi) movq 8(%rdi), %rax addb $-1, %dl adcq 8(%rsi), %rax setc %dl movq %rax, 8(%rdi) movq 16(%rdi), %rax addb $-1, %dl adcq 16(%rsi), %rax setc %dl movq %rax, 16(%rdi) movq 24(%rsi), %rax addb $-1, %dl adcq %rax, 24(%rdi) ret Clang 9.0 ASM add256: movq (%rsi), %rax addq %rax, (%rdi) movq 8(%rsi), %rax adcq %rax, 8(%rdi) movq 16(%rsi), %rax adcq %rax, 16(%rdi) movq 24(%rsi), %rax adcq %rax, 24(%rdi) retq Now porting my cryptographic optimizations to a generic arbitrary precision libary is something I started in <https://github.com/SciNim/theo> however then you get new time consuming engineering challenges beyond just the algorithmic ones, for example for add: <https://github.com/SciNim/theo/blob/9172900/theo/op_addsub.nim#L27-L35> func add*(r {.noalias.}: var BigInt, a, b: BigInt) = ## BigInt addition # TODO: # - relax the aliasing constraint # - dispatch on fixed size add for add that fits in registers # and build add recursively # to avoid loop counters resetting carry chains. # - Support negative inputs # - Support compile-time Run 1. It is fairly common to reuse a buffer and do `r.add(a, b)` to limit expensive allocations. But to detect aliasing you need to check the pointer of r and a. 2. But then you might want your library to work at compile-time as well, and there is no pointer at compile-time, and suddenly you need 2 implementations for RT and CT. 3. And then you need to support negative integers and for each implementations you need to handle the 2 cases same sign and different sign 4. And then you need to handle the case where the number of limbs of bigints are different, to reduce codepath, if b < a you swap a and b pointers in your implementation, except at compile-time. 5. Then reimplement all in assembly to ensure codegen is decent 5. And then you need to come up with tests to ensure all of this is covered 6. And then you need to fuzz 7. Rince and repeat for all operations. For multiplication, you need then to implement in level of overhead/acceleration: product scanning (comba) multiplication, Karatsuba multiplication, Toom-Cook multiplication (now that's if we have thousands of digits so stretch goal), FFT multiplication. For division, expect days of debugging, everyone gets it wrong, even Google in Go and Javascript: <https://skanthak.homepage.t-online.de/division.html> Now if someone is interested in implementing a fast bigint library for Nim, I can guide them, lots of engineering challenges were already solved in Constantine.