> In Nim (and most of programming languages I know), a mod b returns r such 
> that a = b * q + r (0 <= r < b).

Not completely true.
    
    
    -4 mod 3 # = -1
    
    
    Run

Is your goal to optimize stdlib's gcd? Then you should care about negative 
inputs as well.

Which algorithm do other languages implement in their stdlib? Always start by 
defining the state-of-the-art in actual libraries.

I do not understand your benchmark from your text description only. The 
settings mean nothing to me. What does each of these flags do?

I am lazy to read your whole code but why is std/bitops in the middle of the 
program?

GPU is unnecessary (and probably not the best suited hardware for this 
seemingly but actually not sequential task) but assembly, simd, and 
multi-threading?

If you want to understand the difference between those two different hardware, 
you probably have no choice than looking at the assembly instructions generated.

Timings without context are trash. What are the types of your inputs ? How does 
it fare with i16, i32, i64, u16, u32 and u64 output? Changing the values is not 
enough, you should obtain similar results for 1 and high(int64) if 1 is stored 
as a int64. Please change your program accordingly.

Reply via email to