ni...@lysator.liu.se (Niels Möller) writes:

  Here's a div2 function that does that. Together with the deleted div1,
  it gives a 25% (!) speed up of gcd in the range 3-10 limbs, benchmarked
  on my broadwell machine. And should do well on all machines with decent
  small-quotient division.

Cool!

And broadwell (while not on my list) presumably have the same slowish
division as haswell and skylake, as it came between those chips.

Amd and ARM processors should get an even greater boost.

I am curious how these GCD implementations compare:

1. The gcd_33 attached for its only use case n = 3.  It runs on Haswell
   and newer and on Ryzen (but also on bd4 aka Excavator)

2. The present GNP code with your improved div1 and div2.

3. Marco's mpn_gcd_basecase.

It would be premature to decide if there is room for a mpn_gcd_basecase,
but it would be interesting to know in what ballpark performance is.

Only mpn_gcd_33 is close to optimal, perhaps a few percent could be
shaven off by swapping some instructions around.  A table based
mpn_gcd_33 could be perhaps 50% faster, but that remains to be tested.

Marco's mpn_gcd_basecase code could also probably be improved, but I
have no estimate here.  If nothing else, we could certainly use larger
tables therein.

If a fixed-size entity like mpn_gcd_33 outperforms anything else we can
come with, I see no reason not to keep going to mpn_gcd_44 etc.  Before
you ask, we should not provide mpn_mul_33 or mpn_mul_44; gcd is a
special case as it runs for such a long time, so saving on bookkeeping
has a much bigger impact than for other elementary operations.

Attachment: gcd_33.asm
Description: Binary data


-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to