[email protected] writes:

  Actually, I did not touch the inner loop, I just simplified the outer one,
  removing the unneeded rems[] array, and the unnecessary acc variable.

Right.

  The time needed to initialize the computation, and the effect of cache missis 
change a lot
  for different bases, not far from one another. Not only to use this strategy 
we have to write
  an efficient inner-loop, but we also have to think how to handle 
"thresholds"...

Always a pain.

  Does ARM have SIMD 64-bits addition with carry? Really? Interesting!

I am not aware of any add-with-carry SIMD insns.

Arm has means of computing carry-out for all elements of a vector
register (CMHI, CMHS).  (I have not looked at the newer variable-length
vector stuff (SVG?).)

IIRC, PowerPC have even more powerful instructions, even add with
carry-in in a 3rd input vector register, and separate instruvtions for
generating carry-out.

There are machines which impelement this in the gcc compiler farm.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
[email protected]
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to