[email protected] writes: Actually, I did not touch the inner loop, I just simplified the outer one, removing the unneeded rems[] array, and the unnecessary acc variable.
Right. The time needed to initialize the computation, and the effect of cache missis change a lot for different bases, not far from one another. Not only to use this strategy we have to write an efficient inner-loop, but we also have to think how to handle "thresholds"... Always a pain. Does ARM have SIMD 64-bits addition with carry? Really? Interesting! I am not aware of any add-with-carry SIMD insns. Arm has means of computing carry-out for all elements of a vector register (CMHI, CMHS). (I have not looked at the newer variable-length vector stuff (SVG?).) IIRC, PowerPC have even more powerful instructions, even add with carry-in in a 3rd input vector register, and separate instruvtions for generating carry-out. There are machines which impelement this in the gcc compiler farm. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list [email protected] https://gmplib.org/mailman/listinfo/gmp-devel
