>From 4x unrolling to 8x unrolling with pipelinging we gain ~40%. Wow, that's much more than I would have expected (or actually seen at my end).
Which brings up another question: Which zarch pipelines does it make sense to optimise for? I thought I had z196 access, but that is probably not true as I didcovered that it supports the vector insns. It must be a z13 or later, I think. I don't mind adding code for older CPUs at all. I want to avoid slowing down GMP for any CPU, old or new, so adding code which affects a CPU should not be done without benchmarking on that CPU. However, on that machine, a 4x variant of my code runs addmul_1 at about 3.4 cycles/limb and an 8x variant runs at about 3.3 cycles/limb. That difference is far smaller than what you see, either because we are on different CPUs or because your 8x variant is far better than mine. I tested a mlgr-only loop without any dependencies. It indicates an mlgr throughput of 0.5 insn/cycle. Thus, we cannot hope for anymul_1 to run at less than 2 cycles/limb. If your cycles/limb numbers are not better than mine, there is still considerable room for improvements, 3.3 is still far from 2.0! I haven't thought carefully about it, but an addmul_2 with vector accumulation might help. An addmul_2 loop use half the number of loads and stores, and use fewer vpdi, compared to addmul_1. And perhaps one could accumulate cleverly to avoid some of the carries. Or not. By the way, have you tried using vlerg/vsterg to avoid vpdi? (I am having problems understanding which instructions are implemented in a particular CPU. The "Principles of Operations" manual seem to skip that aspect. I tried vlerg but it was not even recognised by the assembler.) > Be careful about the lead-in times too. With deep unrolling, one needs > a table indexed by n % W, where W is the unrolling arity. Since I found code complexity (and debugging effort, to be honest) raise sharply when supporting flexible lead-ins, I am currently wrapping up a hybrid variant for addmul_1 and mul_1: use the much simpler 2x unrolled variant from my first patch when N is less than the pipelined lead-in + one iteration + lead-out; when N is above the threshold, use that loop for the n%W and then use the pipelined variant. As a plus, we need fewer registers below the threshold. That variant achieves ~2x speedup for N < 18 and gets to within 20% of peak multiplication throughput above that. Interesting! A mistake can be made here, and that is optimising addmul_k for too large limb counts. We've made that mistake a lot with GMP in the past. So why is that a mistake? Because the use of divide-and-conquer algorithms make large trip counts exceedingly rare. I tested on the z13(?) and found that Karatsuba's algorithm is used already from 8 limbs for multiplication and 11 limbs for squaring. That's with the 4x vector accumulation addmul_1. I expect mul_basecase/sqr_basecase would increase these numbers a but, as they cut O(n) of the execution time. I would still be surprised if 8x addmul_1 unrolling help in practice, but I would love to be proven wrong. For completeness of my reasoning, there is one rare case where addmul trip counts will be large: a very unbalaned multiplication where one operand is >> than the other. To make mul_basecase as low-overhead as possible, one need to do the un mod k (k being the unrolling factor) before the k outer loops. Yes, outer loops, plural. The lead-in of the k inner loops then can be condition-less straight-line code. Yes, major code expansion, in particular for large k, but mul_basecase and sqr_basecase are super important. The structure of sqr_basecase will become different, as the inner loop trip counts will decrease by one in the outer loop. Still, one should do not n mod k except in the function header code. Instead, one need to branch into k different places inside the one outer loop. Tricky, yes. Neat, not at all. I am all for fat support for zaxrh. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel