t...@gmplib.org (Torbjörn Granlund) writes: > Ideally, one would compile hgcd2.c in all possible variants (presumable > through hand-crafted hgcd2-1-1.c, hgcd2-2-1.c, etc., and then call the > selected hgcd2's entry point through a function pointer for further > measuring.
Hmm. So the hgcd2-x-y.c would be there als for speed, and hgcd2.c would have something like #if TUNE_PROGRAM_BUILD int mpn_hgcd2(...) { return (*hgcd2_method_pointer)(...); } > I'm pretty confused about the algorithms used for gcdext. Is it at all > helped by our hgcd2 improvements? If you look at mpn_gcdext_lehmer_n, that's a loop around hgcd2, applying each matrix to the numbers being reduced using mpn_matrix22_mul1_inverse_vector, and to the cofactors using mpn_hgcd_mul_matrix1_vector. At the end, it calls mpn_gcdext_1. mpn_gcdext_22 is missing. > What is suboptimal about gcdext for large operands? Subquadratic gcd is a loop calling hgcd on the high part of the numbers (iirc, n/3 size), then applies the matrix also to the low part, and repeat until numbers are in lehmer range. Current subquadratic gcdext does the same, but also builds up cofactors along the way. So numbers getting smaller, and cofactors larger. For one, that makes the optimal splitting for the hgcd call complicated, one should take into account not only the size of the numbers, but also the size of the cofactors. That's done in an adhoc way, different splitting for the very first iteration, then a fixed size ratio. More fundamentally, the multiplications used to build up the cofactors get more and more unbalanced for each iteration. That doesn't seem healthy for a divide-and-conquer algorithm. I suspect a better way is to do it like this: Given A, B of size n, split off high part, say size n/2 or so. Call hgcd, so we get reduced numbers a, b, with (A;B) = M (a;b) where M is if size n/4 and a,b of size 3n/4. Recursively compute gcdext(a,b), to get g = s a + t b Then construct one or both cofactors for g = S A + T B based on s, t and the matrix M. Then we'll reduce numbers in size as we recurse down, and build up the cofactors as we return back, without getting increasingly unbalanced multiply instructions. But I haven't tried it out. > * Try out left-to-right binary hgcd2. > > I had not realised you meant left-to-right. I thought you talked about > right-to-left. Interesting! There are a couple of different cases, I'm afraid. Say we have a of a_size bits and b of b_size bits. If a_size == b_size, set {a, b} <-- {|a-b|, min(a,b)} Like a binary step, but we don't divide out powers of two, and we eliminate at least one bit at the high end. When a_size > b_size, then to avoid underflow I'm afraid we need a <-- a - 2^{a_size - b_size - 1} b with progress of 0.5 bits or so. Or we could go half way-towards computing a quotient by using a <-- a - 2^{a_size - b_size} b in case that is non-negative, which would give us better progress. And similarly for a_size < b_size. So a fair amount of logic that needs to be branch free. Updates of u0 and u1 are cheap, but they make any needed conditional swap more costly. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel