Eric Richter <eric...@linux.ibm.com> writes: > I suspect with four li instructions, those are issued 4x in parallel, and > then the subsequent (slower) lxvw4x instructions are queued 2x. By removing > the other three li instructions, that li is queued with the first lxvw4x, > but not the second -- causing a stall as the second lxv has to wait for the > parallel queue of the li + lxv before, as it depends on the li completing > first.
I don't know any details on powerpc instruction issue and pipelining works. But some dependence on alignment seems likely. So great that you found that; it would seem rather odd to get a performance regression for this fix. Since .align 4 means 16 byte alignment, and instructions are 4 bytes, that's enough to group instructions 4-by-4, is that what you want or is it overkill? I'm also a bit surprised that an align at this point, outside the loop, makes a significant difference. Maybe it's the alignment of the code in the loop that matters, which is changed indirectly by this .align? Maybe it would make more sense to add the align directive just before the loop: entry, and/or before the blocks of instructions in the loop that should be aligned? Nettle uses aligned loop entry points at many places for several architectures, although I'm not sure how much of that makes a measurable difference in performance, and how much was just done out of habit. > Additional note: I did also try rearranging the LOAD macros with the > shifts, as well as moving around the requisite byte-swap vperms, but did > not receive any performance benefits. It appears doing the load, vperm, > shift, addi in that order appears to be the fastest order. To what degree does the powerpc processors do out of order execution? If you have the time to experiment more, I'd be curious to see what the results would be, e.g., if either doing all the loads back to back, lxvd2x A lxvd2x B lxvd2x C lxvd2x D vperm A vperm B vperm C vperm D ...shifts... or alternatively, trying to schedule each load a few instrucctions before value is used. > -define(`TC4', `r11') > -define(`TC8', `r12') > -define(`TC12', `r14') > -define(`TC16', `r15') > +define(`TC16', `r11') One nice thing is that you can now eliminate the save and restore of r14 and r15. Please do that. > C State registers > define(`VSA', `v0') > @@ -187,24 +184,24 @@ define(`LOAD', ` > define(`DOLOADS', ` > IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)') > LOAD(0, TC0) > - LOAD(1, TC4) > - LOAD(2, TC8) > - LOAD(3, TC12) > + vsldoi IV(1), IV(0), IV(0), 4 > + vsldoi IV(2), IV(0), IV(0), 8 > + vsldoi IV(3), IV(0), IV(0), 12 > addi INPUT, INPUT, 16 > LOAD(4, TC0) You can eliminate 2 of the 4 addi instructions by using LOAD(4, TC16) here and similarly for LOAD(12, TC16). > + .align 4 > > C Load state values > lxvw4x VSR(VSA), 0, STATE C VSA contains A,B,C,D Please add a brief comment on the .align, saying that it appears to enable more efficient issue of the lxvw4x instructions (or your own wording explaining why it's needed). (For the .align directive in general, there's also an ALIGN macro which takes an non-logarithmic alignment regardless of architecture and assembler, but it's not used consistently in the nettle assembly files). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se