Note that if you look at the implementation for `proc rand*(r: var Rand; max: Natural)`, it is possible to do this more quickly in a loop if `max` does not change in said loop. The `mod` is constant over the loop and `randMax mod Ui(max)` can be computed just once instead of every function call.
There is still a final range reduction `mod`. I am pretty sure that can be turned into a floating point multiply (probably making the result even faster than the gcc-const-optimized variant). Of course, dirtying the FP registers/state can make all your context switches slower, but that probably will not matter in numerics heavy workloads.