Eric Richter <eric...@linux.ibm.com> writes:

> I suspect with four li instructions, those are issued 4x in parallel, and
> then the subsequent (slower) lxvw4x instructions are queued 2x. By removing
> the other three li instructions, that li is queued with the first lxvw4x,
> but not the second -- causing a stall as the second lxv has to wait for the
> parallel queue of the li + lxv before, as it depends on the li completing
> first.

I don't know any details on powerpc instruction issue and pipelining
works. But some dependence on alignment seems likely. So great that you
found that; it would seem rather odd to get a performance regression for
this fix.

Since .align 4 means 16 byte alignment, and instructions are 4 bytes,
that's enough to group instructions 4-by-4, is that what you want or is
it overkill?

I'm also a bit surprised that an align at this point, outside the loop,
makes a significant difference. Maybe it's the alignment of the code in
the loop that matters, which is changed indirectly by this .align? Maybe
it would make more sense to add the align directive just before the
loop: entry, and/or before the blocks of instructions in the loop that
should be aligned? Nettle uses aligned loop entry points at many places
for several architectures, although I'm not sure how much of that makes
a measurable difference in performance, and how much was just done out
of habit.

> Additional note: I did also try rearranging the LOAD macros with the
> shifts, as well as moving around the requisite byte-swap vperms, but did
> not receive any performance benefits. It appears doing the load, vperm,
> shift, addi in that order appears to be the fastest order.

To what degree does the powerpc processors do out of order execution? If
you have the time to experiment more, I'd be curious to see what the
results would be, e.g., if either doing all the loads back to back,

  lxvd2x A
  lxvd2x B
  lxvd2x C
  lxvd2x D
  vperm A
  vperm B
  vperm C
  vperm D
  ...shifts...

or alternatively, trying to schedule each load a few instrucctions
before value is used.

> -define(`TC4', `r11')
> -define(`TC8', `r12')
> -define(`TC12', `r14')
> -define(`TC16', `r15')
> +define(`TC16', `r11')

One nice thing is that you can now eliminate the save and restore of r14
and r15. Please do that.

>  C State registers
>  define(`VSA', `v0')
> @@ -187,24 +184,24 @@ define(`LOAD', `
>  define(`DOLOADS', `
>       IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
>       LOAD(0, TC0)
> -     LOAD(1, TC4)
> -     LOAD(2, TC8)
> -     LOAD(3, TC12)
> +     vsldoi  IV(1), IV(0), IV(0), 4
> +     vsldoi  IV(2), IV(0), IV(0), 8
> +     vsldoi  IV(3), IV(0), IV(0), 12
>       addi    INPUT, INPUT, 16
>       LOAD(4, TC0)

You can eliminate 2 of the 4 addi instructions by using

        LOAD(4, TC16)

here and similarly for LOAD(12, TC16). 

> +     .align 4
>  
>       C Load state values
>       lxvw4x  VSR(VSA), 0, STATE      C VSA contains A,B,C,D

Please add a brief comment on the .align, saying that it appears to
enable more efficient issue of the lxvw4x instructions (or your
own wording explaining why it's needed).

(For the .align directive in general, there's also an ALIGN macro which
takes an non-logarithmic alignment regardless of architecture and
assembler, but it's not used consistently in the nettle assembly files).

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to