bodr...@mail.dm.unipi.it writes: To compare them I wrote a quick and dirty specialization of the rootrem algorithm for the k==2 case (use sqr instead of mul, lshift instead of mul_1...)
Cool! We should probably use sqr whenever possible, perhaps it is worth having a condition for this in rootrem.c, (I believe mpn_pow_1 gets it right, and uses sqr whenever possible, since I looked into that as an explanation.) $ tune/speed -p 100000000 -s 1-1100000 -f16 mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem overhead 0.000000002 secs, precision 100000000 units of 2.86e-10 secs, CPU freq 3500.08 MHz mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem 1 0.000000518 0.000000011 0.000000516 #0.000000009 16 0.000001103 0.000000244 0.000001117 #0.000000244 256 0.000009316 #0.000008321 0.000014106 0.000008374 4096 #0.000633967 0.000659180 0.000894513 0.000654562 65536 #0.026071234 0.026756061 0.033005667 0.026762939 1048576 0.732938835 0.734521311 0.962774086 #0.731488979 With the attached specialised function, rootrem.2 gains a 10% in speed and is faster than sqrt when the reminder is not needed and size grows. $ tune/speed -p 100000000 -s 1-1100000 -f16 mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem overhead 0.000000002 secs, precision 100000000 units of 2.86e-10 secs, CPU freq 3500.08 MHz mpn_root.2 mpn_sqrt mpn_rootrem.2 mpn_sqrtrem 1 0.000000376 #0.000000010 0.000000383 0.000000010 16 0.000000884 #0.000000241 0.000000894 0.000000243 256 #0.000007743 0.000008262 0.000012138 0.000008282 4096 #0.000563298 0.000657382 0.000816271 0.000650694 65536 #0.023296001 0.026694774 0.030225570 0.026717529 1048576 #0.654117324 0.736489574 0.832002002 0.733319729 Your timing leaps hide some of the actual timing characteristics... Using tune/speed -r -p10000000 -s16-10000000 -f1.438 mpn_sqrt mpn_root.2 I get about 0.97 on average. And tune/speed -r -p10000000 -s16-10000000 -f1.438 mpn_sqrtrem mpn_rootrem.2 gives around 1.25. (In both cases without your rootrem hack.) I added some TRACE (printf...) in the code to show the primitives with a super-linear cost used in computation. The last functions called by mpn_sqrtrem when computing the root of a 2000 limb operand are: mpn_divrem nn=500/dn=250 -> qn=250,rn=250 mpn_sqr an=250 ^ 2 -> rn=500 mpn_divrem nn=1000/dn=500 -> qn=500,rn=500 mpn_sqr an=500 ^ 2 -> rn=1000 Similarly, for mpn_rootrem: mpn_div_q nn=500/dn=251 -> qn=249 mpn_sqr an=501 ^ 2 -> rn=1002 mpn_div_q nn=1000/dn=501 -> qn=499 mpn_sqr an=1000 ^ 2 -> rn=2000 Interesting indeed. I wonder why they need different multiply sizes. It uses div_q instead of divrem, but squares used by rootrem are doubled in size... The operation are similar if compared to sqrt (the order changes), probably div_q is faster than divrem. mpn_sqrt wins only when the huge speed-up given by the specialised code for the first limb is relevant... I am not sure that div_q vs divrem matters a whole lot for this usage (i.e., 2n/n sizes). -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel