From: Torbjorn Granlund <t...@gmplib.org> Date: Wed, 27 Mar 2013 22:07:47 +0100
> David Miller <da...@davemloft.net> writes: > > As an aside I think we can get it down to 2.5 cycles per limb on > T4 with 4-way unrolling, and 3.0 cycles per limb with 2-way > unrolling. > > The idea is to decrease the bookkeeping instructions by only > maintaining base pointers which do not change, and then we have an > offset which operates as the loop index. > > So we'd instead have an 'n_off' instead of 'n', and then in some local > registers we'd hold: > > l3: up - 8 > l4: up - 16 > l5: rp - 8 > l6: rp - 16 > > A clever trick! But you will probably get 2.75 c/l for 4-way, not 2.5 > c/l. We'll need infinite unrolling for 2.5... Hmmm, not sure I understand. But anyways, it turns out we don't need to use the fused compare-and-branch, here is what the 2-way loop looks like: 1: sllx u0, cnt, %l2 or %l4, %l3, r0 ldx [n + u_off0], u0 srlx u1, tcnt, %l5 stx r0, [n + r_off0] sllx u1, cnt, %l3 or %l2, %l5, r1 ldx [n + u_off1], u1 subcc n, (2 * 8), n srlx u0, tcnt, %l4 bge,pt %xcc, 1b stx r1, [n + r_off0] Which is 6 cycles per loop, 3 cycles per limb. On UltraSparc-1 et al. we could shuffle things around and get it to schedule like this: 1: sllx u0, cnt, %l2 or %l4, %l3, r0 ldx [n + u_off0], u0 srlx u1, tcnt, %l5 stx r0, [n + r_off0] subcc n, (2 * 8), n sllx u1, cnt, %l3 or %l2, %l5, r1 ldx [n + u_off1], u1 srlx u0, tcnt, %l4 bge,pt %xcc, 1b stx r1, [n + r_off0] The usual rules about these older chips applies, one shift per group, when there are multiple integer instructions in a group the shift must appear first. Loads have a 2 cycle latency, and the above loop places the read dependency 3 cycles after the load. Unless I've made a mistake that's 4 cycles per loop, 2 cycles per limb. That's what the current 4-way unrolling code obtains. In fact I think that a 4-way unrolling using the above bookkeeping simplification wouldn't execute any faster, because we'll have a troublesome odd instruction at the end of the loop. Note that the {u,r}_off{0,1} values need to be adjusted depending upon where exactly the "subcc n, (2 * 8), n" instruction is placed. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel