Richard Henderson <r...@twiddle.net> writes: Yeah, but I use up the entire multiplication portion doing the integer bookkeeping for the round. I want to get that done asap so that the load instructions for the next round are issued as early as possible. And as far as I know, no ARM pipeline does more than dual issue. Isn't A15 3-issue?
> The way addmul_ is used is best seen in gmp/mpn/generic/mul_basecase.c. I see. And any amount of adding under the toom limit is reasonable? Sorry, I don't understand. I've started work on a mul_basecase.asm. I'm calling out to the normal routines for mul_[12] and addmul_2. I've got internal primitives for k=14, 8, and 6, which all share some code. This gives: 31: 1+14+14+2 30: 2+14+14 29: 1+14+14 ... 14: 2+6+6 13: 1+6+6 12: 2+8+2 11: 1+8+1 10: 2+8 9: 1+8 8: 2+6 7: 1+6 6: 2+2+2 5: 1+2+2 ... It will loop on the 14, so it can still be invoked above the toom limit, but I also like the table at the end to handle the last 12 so that we get 6+6 instead of 8+2+2. An addmul_14? Zany. :-) There is a technique which is extremely important in mul_basecase, in particular when basing it on deeply sw pipelined addmuls. I call it overlapping software pipelining, or OSP... Let's start by identifying some building blocks. IL = Inner loop FI = Feed-in code for IL WD = Wind-down code for IL Grapically we could do something like this: |\ | \ | \ |FI \ |____\ | | | | | IL | | | |_____| \ | \WD | \ | \ | \| The IL block will use execution resourses well. FI will start with loads, then after some cycles perform multiplies, then several cycles after that start accumulating things. The WD block will make no loads, but in teh beginning perform a mix of old accumulations and new multiplies, and later just perform accumulations, and finally just stores. For mul_basecase we can overlap the previous iteration WD with the new iteration FI, which will save many, many cycles. |\ | \ | \ |FI \ |____\ | | | | | IL | | | |_____| |\ | | \WD | | \ | |FI \ | |____\| | | | | | IL | | | |_____| |\ | | \WD | | \ | |FI \ | |____\| . . . _____ \ | \WD | \ | \ | \| OK, so perhaps there in one way of beating your code. Let's consider vertical summation for a while! If we form product sums like S_i = u_{k-1} * v_{0} + u_{k-2} * v{1} + ... + u_{0} * v{k-1} akin to a convolution, we will (as long as k is kept < 2^b for a b-bit word size) will form a 3-word sum S_i. The word with (relative) significance 1 is to be written "here", the word with significance 2^b is to be sent as carry-in to the next sum, S_{i+1}, and the word with significance 2^{2b} is to be sent as carry in the sum S_{i+2}. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel