> In Nim (and most of programming languages I know), a mod b returns r such > that a = b * q + r (0 <= r < b).
Not completely true. -4 mod 3 # = -1 Run Is your goal to optimize stdlib's gcd? Then you should care about negative inputs as well. Which algorithm do other languages implement in their stdlib? Always start by defining the state-of-the-art in actual libraries. I do not understand your benchmark from your text description only. The settings mean nothing to me. What does each of these flags do? I am lazy to read your whole code but why is std/bitops in the middle of the program? GPU is unnecessary (and probably not the best suited hardware for this seemingly but actually not sequential task) but assembly, simd, and multi-threading? If you want to understand the difference between those two different hardware, you probably have no choice than looking at the assembly instructions generated. Timings without context are trash. What are the types of your inputs ? How does it fare with i16, i32, i64, u16, u32 and u64 output? Changing the values is not enough, you should obtain similar results for 1 and high(int64) if 1 is stored as a int64. Please change your program accordingly.