Good evening, As a hobby project I've been building an out-of-order superscalar CPU for use in FPGAs. Everything has been going pretty smoothly, but I'd like to be able to support gmplib efficiently on it. To keep the issue stage small+fast, instructions on this machine can only accept two register arguments at once. No carry flags. I was wondering if there is a clever way to implement bigint addition with less instructions/word, by adding a custom instruction or two.
So I can already do the generic gmp version: do { ul = *up++; vl = *vp++; sl = ul + vl; cy1 = sl < ul; rl = sl + cy; cy2 = rl < sl; cy = cy1 | cy2; *rp++ = rl; } while (--n != 0); In my assembly language this would best be written as: loop: iload ul, up, i iload vl, vp, i add sl, ul, vl addhi cy1, ul, vl ; cuts dependency on sl => gives carry immediately add rl, sl, cy addhi cy2, sl, cy or cy, cy1, cy2 istore rl, rp, i add i, i, 1 bne i, n, loop ... while that looks awful, it's actually not so bad. Between the loads and store are only two instructions (2* add), the minimum possible. Between loop iterations, the critical path path is cy, which goes through an addhi and an or. I can't yet reorder loads before stores, but in the future once I can, this means each loop iteration has latency of 2 cycles (assuming the reorder buffer is large enough). Since the loop has 10 instructions and I can only issue 4 at once, I am actually bottlenecked by the poor code density and L1-cache interface. Hmm. Now I wonder why I am writing this email and perhaps there is nothing that needs to be done! For comparison, if I had addc: loop: iload ul, up, i iload vl, vp, i addc sl, ul, vl ; can not implement this! (3-inputs: CF, ul, vl) istore sl, rp, i add i, i, 1 bne i, n, loop ... that is only 6 instructions and the latency is just 1 cycle. However, I'm still going to be bottlenecked by the L1 and issue-width. So, now that I've more carefully analyzed the loop, I am actually not as concerned as I was initially! However, I think the question still stands: aside from an "addhi" instruction, are there any other custom instructions I should be adding for bigint arithmetic? I also already have "umulhi" which should help with multiplication. I am guessing that when optimizing gmp, there were times when you guys were irritated by the lack of some specific instruction that would have made things a lot quicker. This is the sort of information I'd love to have. Thanks for your time! _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel