BigNum and BigInt timings on the Pollard Rho algorithm

mratsim Tue, 03 Jan 2023 05:45:10 -0800

> As for opinion pieces on bigint libraries in general, I also wrote a little 
> note on the rather sad state of GMP vs “anything else” at 
> <https://github.com/fsh/integers#why-gmp>


> ## Why GMP?
> 
> Unfortunately there is just no beating GMP when it comes to performance. It 
> is the unequivocal gold standard.
> 
> All “home-made” or “from scratch” integer implementations by hobbyists, such 
> as Nim’s official `bigints` (and even Python’s built-in `int`) are _laughably 
> slow_ compared to GMP, often on the order of magnitudes.
> 
> Alternative projects such as `libtomath` come much closer, but are still not 
> quite as optimized as GMP.
> 
> The reason projects shy away from GMP is two-fold:
> 
>   * The “serious” reason is that GMP is dual-licensed with LGPLv3 and GPLv2, 
> and many consider even the LGPL too restrictive or convoluted (e.g. static 
> linking can be problematic), so this makes it simply a no-go for many 
> projects.
>   * The less practical reason is that implementing one’s own arbitrary 
> precision integers from scratch is (at the onset) a very attractive and 
> exciting project for many programmers, so it is a common mind trap to fall 
> into.
> 

> 
> The first reason is unavoidable and unfortunate. The second reason is very 
> _understandable_ (I feel the same pull), although it is incredibly 
> frustrating when I’m stuck with these cowboy implemenations as a user.
> 
> Since I am of the opinion that (big) integers is a _core_ data type that 
> should be universally available, thus also implicitly demanding it should be 
> _as efficient as humanly possible_ , I’m always going to bite the bullet on 
> the (L)GPL license.

Being as fast as GMP is a three-fold issue:

  1. You need knowledge of your target architecture
  2. You need knowledge of big integers algorithm
  3. You need time



But it is doable, for example for size suitable for cryptography my library 
3.8x faster than GMP on basic big int multiplication:

  * bench code: 
<https://gist.github.com/mratsim/9bf2801d06e9df86dc3c98efcc825666>
  * library: <https://github.com/mratsim/constantine>



The ratio is probably x8 on modular multiplication.

How?

  1. In my case I benefit from knowing the sizes at compile time which removes 
a lot of need for if/then/else.
  2. I have an assembler that allows me to use specific bigint instructions 
MULX, ADCX, ADOX 
(<https://github.com/mratsim/constantine/blob/928f5155/constantine/math/arithmetic/assembly/limbs_asm_mul_x86_adx_bmi2.nim#L94-L105>)
 using pure C or pure Nim loses me up to 60% performance 
(<https://github.com/mratsim/constantine#inline-assembly>) especially when 
using GCC as a compiler and despite using intrinsics: 
<https://gcc.godbolt.org/z/2h768y>



C code for add with carry on 256-bit
    
    
    #include <stdint.h>
    #include <x86intrin.h>
    
    void add256(uint64_t a[4], uint64_t b[4]){
      uint8_t carry = 0;
      for (int i = 0; i < 4; ++i)
        carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
    }
    
    Run

GCC 9.2 ASM
    
    
    add256:
            movq    (%rsi), %rax
            addq    (%rdi), %rax
            setc    %dl
            movq    %rax, (%rdi)
            movq    8(%rdi), %rax
            addb    $-1, %dl
            adcq    8(%rsi), %rax
            setc    %dl
            movq    %rax, 8(%rdi)
            movq    16(%rdi), %rax
            addb    $-1, %dl
            adcq    16(%rsi), %rax
            setc    %dl
            movq    %rax, 16(%rdi)
            movq    24(%rsi), %rax
            addb    $-1, %dl
            adcq    %rax, 24(%rdi)
            ret

Clang 9.0 ASM
    
    
    add256:
            movq    (%rsi), %rax
            addq    %rax, (%rdi)
            movq    8(%rsi), %rax
            adcq    %rax, 8(%rdi)
            movq    16(%rsi), %rax
            adcq    %rax, 16(%rdi)
            movq    24(%rsi), %rax
            adcq    %rax, 24(%rdi)
            retq

Now porting my cryptographic optimizations to a generic arbitrary precision 
libary is something I started in <https://github.com/SciNim/theo> however then 
you get new time consuming engineering challenges beyond just the algorithmic 
ones, for example for add: 
<https://github.com/SciNim/theo/blob/9172900/theo/op_addsub.nim#L27-L35>
    
    
    func add*(r {.noalias.}: var BigInt, a, b: BigInt) =
      ## BigInt addition
      # TODO:
      # - relax the aliasing constraint
      # - dispatch on fixed size add for add that fits in registers
      #   and build add recursively
      #   to avoid loop counters resetting carry chains.
      # - Support negative inputs
      # - Support compile-time
    
    
    Run

  1. It is fairly common to reuse a buffer and do `r.add(a, b)` to limit 
expensive allocations. But to detect aliasing you need to check the pointer of 
r and a.
  2. But then you might want your library to work at compile-time as well, and 
there is no pointer at compile-time, and suddenly you need 2 implementations 
for RT and CT.
  3. And then you need to support negative integers and for each 
implementations you need to handle the 2 cases same sign and different sign
  4. And then you need to handle the case where the number of limbs of bigints 
are different, to reduce codepath, if b < a you swap a and b pointers in your 
implementation, except at compile-time.
  5. Then reimplement all in assembly to ensure codegen is decent



  5. And then you need to come up with tests to ensure all of this is covered
  6. And then you need to fuzz
  7. Rince and repeat for all operations.



For multiplication, you need then to implement in level of 
overhead/acceleration: product scanning (comba) multiplication, Karatsuba 
multiplication, Toom-Cook multiplication (now that's if we have thousands of 
digits so stretch goal), FFT multiplication.

For division, expect days of debugging, everyone gets it wrong, even Google in 
Go and Javascript: <https://skanthak.homepage.t-online.de/division.html>

Now if someone is interested in implementing a fast bigint library for Nim, I 
can guide them, lots of engineering challenges were already solved in 
Constantine.

BigNum and BigInt timings on the Pollard Rho algorithm

Reply via email to