[PATCH v2] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-09-01 Thread Eric Biggers
From: Eric Biggers 

Optimize ChaCha20 NEON performance by:

- Implementing the 8-bit rotations using the 'vtbl.8' instruction.
- Streamlining the part that adds the original state and XORs the data.
- Making some other small tweaks.

On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
about 12.08 cycles per byte to about 11.37 -- a 5.9% improvement.

There is a tradeoff involved with the 'vtbl.8' rotation method since
there is at least one CPU (Cortex-A53) where it's not fastest.  But it
seems to be a better default; see the added comment.  Overall, this
patch reduces Cortex-A53 performance by less than 0.5%.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/chacha20-neon-core.S | 277 ++-
 1 file changed, 143 insertions(+), 134 deletions(-)

diff --git a/arch/arm/crypto/chacha20-neon-core.S 
b/arch/arm/crypto/chacha20-neon-core.S
index 451a849ad5186..50e7b98968189 100644
--- a/arch/arm/crypto/chacha20-neon-core.S
+++ b/arch/arm/crypto/chacha20-neon-core.S
@@ -18,6 +18,34 @@
  * (at your option) any later version.
  */
 
+ /*
+  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
less:
+  *
+  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
+  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
+  * (c)  vrev32.16 (16-bit rotations only)
+  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
+  * needs index vector)
+  *
+  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
+  * rotations, the only choices are (a) and (b).  We use (a) since it takes
+  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
+  *
+  * For the 16-bit rotation, we use vrev32.16 since it's consistently fastest
+  * and doesn't need a temporary register.
+  *
+  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
sequence
+  * is twice as fast as (a), even when doing (a) on multiple registers
+  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
+  * parallelizes better when temporary registers are scarce.
+  *
+  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same speed 
as
+  * (a), so the need to load the rotation table actually makes the vtbl method
+  * slightly slower overall on that CPU (~1.3% slower ChaCha20).  Still, it
+  * seems to be a good compromise to get a more significant speed boost on some
+  * CPUs, e.g. ~4.8% faster ChaCha20 on Cortex-A7.
+  */
+
 #include 
 
.text
@@ -46,7 +74,9 @@ ENTRY(chacha20_block_xor_neon)
vmovq10, q2
vmovq11, q3
 
+   adr ip, .Lrol8_table
mov r3, #10
+   vld1.8  {d10}, [ip, :64]
 
 .Ldoubleround:
// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
@@ -62,9 +92,9 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #8
-   vsri.u32q3, q4, #24
+   veorq3, q3, q0
+   vtbl.8  d6, {d6}, d10
+   vtbl.8  d7, {d7}, d10
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
vadd.i32q2, q2, q3
@@ -92,9 +122,9 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #8
-   vsri.u32q3, q4, #24
+   veorq3, q3, q0
+   vtbl.8  d6, {d6}, d10
+   vtbl.8  d7, {d7}, d10
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
vadd.i32q2, q2, q3
@@ -139,13 +169,17 @@ ENTRY(chacha20_block_xor_neon)
bx  lr
 ENDPROC(chacha20_block_xor_neon)
 
+   .align  4
+.Lctrinc:  .word   0, 1, 2, 3
+.Lrol8_table:  .byte   3, 0, 1, 2, 7, 4, 5, 6
+
.align  5
 ENTRY(chacha20_4block_xor_neon)
-   push{r4-r6, lr}
-   mov ip, sp  // preserve the stack pointer
-   sub r3, sp, #0x20   // allocate a 32 byte buffer
-   bic r3, r3, #0x1f   // aligned to 32 bytes
-   mov sp, r3
+   push{r4-r5}
+   mov r4, sp  // preserve the stack pointer
+   sub ip, sp, #0x20   // allocate a 32 byte buffer
+   bic ip, ip, #0x1f   // aligned to 32 bytes
+   mov sp, ip
 
// r0: Input state matrix, s
// r1: 4 data blocks output, o
@@ -155,25 +189,24 @@ ENTRY(chacha20_4block_xor_neon)
// This function encrypts four consecutive ChaCha20 blocks by loading
// the state matrix in NEON registers four times. The algorithm performs
// each operation on the corresponding word of each state matrix, hence
-   // requires no word 

Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-09-01 Thread Eric Biggers
On Fri, Aug 31, 2018 at 06:51:34PM +0200, Ard Biesheuvel wrote:
> >>
> >> +   adr ip, .Lrol8_table
> >> mov r3, #10
> >>
> >>  .Ldoubleround4:
> >> @@ -238,24 +268,25 @@ ENTRY(chacha20_4block_xor_neon)
> >> // x1 += x5, x13 = rotl32(x13 ^ x1, 8)
> >> // x2 += x6, x14 = rotl32(x14 ^ x2, 8)
> >> // x3 += x7, x15 = rotl32(x15 ^ x3, 8)
> >> +   vld1.8  {d16}, [ip, :64]
> 
> Also, would it perhaps be more efficient to keep the rotation vector
> in a pair of GPRs, and use something like
> 
> vmov d16, r4, r5
> 
> here?
> 

I tried that, but it doesn't help on either Cortex-A7 or Cortex-A53.
In fact it's very slightly worse.

- Eric


Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-09-01 Thread Eric Biggers
Hi Ard,

On Fri, Aug 31, 2018 at 05:56:24PM +0200, Ard Biesheuvel wrote:
> Hi Eric,
> 
> On 31 August 2018 at 10:01, Eric Biggers  wrote:
> > From: Eric Biggers 
> >
> > Optimize ChaCha20 NEON performance by:
> >
> > - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
> > - Streamlining the part that adds the original state and XORs the data.
> > - Making some other small tweaks.
> >
> > On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
> > about 11.9 cycles per byte to 11.3.
> >
> > There is a tradeoff involved with the 'vtbl.8' rotation method since
> > there is at least one CPU where it's not fastest.  But it seems to be a
> > better default; see the added comment.
> >
> > Signed-off-by: Eric Biggers 
> > ---
> >  arch/arm/crypto/chacha20-neon-core.S | 289 ++-
> >  1 file changed, 147 insertions(+), 142 deletions(-)
> >
> > diff --git a/arch/arm/crypto/chacha20-neon-core.S 
> > b/arch/arm/crypto/chacha20-neon-core.S
> > index 3fecb2124c35a..d381cebaba31d 100644
> > --- a/arch/arm/crypto/chacha20-neon-core.S
> > +++ b/arch/arm/crypto/chacha20-neon-core.S
> > @@ -18,6 +18,33 @@
> >   * (at your option) any later version.
> >   */
> >
> > + /*
> > +  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
> > less:
> > +  *
> > +  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
> > +  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
> > +  * (c)  vrev32.16 (16-bit rotations only)
> > +  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
> > +  * needs index vector)
> > +  *
> > +  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
> > +  * rotations, the only choices are (a) and (b).  We use (a) since it takes
> > +  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
> > +  *
> > +  * For the 16-bit rotation, we use vrev32.16 since it's consistently 
> > fastest
> > +  * and doesn't need a temporary register.
> > +  *
> > +  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
> > sequence
> > +  * is twice as fast as (a), even when doing (a) on multiple registers
> > +  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
> > +  * parallelizes better when temporary registers are scarce.
> > +  *
> > +  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same 
> > speed as
> > +  * (a), so the need to load the rotation table actually makes the vtbl 
> > method
> > +  * slightly slower overall on that CPU.  Still, it seems to be a good
> > +  * compromise to get a significant speed boost on some CPUs.
> > +  */
> > +
> 
> Thanks for sharing these results. I have been working on 32-bit ARM
> code under the assumption that the A53 pipeline more or less resembles
> the A7 one, but this is obviously not the case looking at your
> results. My contributions to arch/arm/crypto mainly involved Crypto
> Extensions code, which the A7 does not support in the first place, so
> it does not really matter, but I will keep this in mind going forward.
> 
> >  #include 
> >
> > .text
> > @@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon)
> > vmovq10, q2
> > vmovq11, q3
> >
> > +   ldr ip, =.Lrol8_table
> > +   vld1.8  {d10}, [ip, :64]
> > +
> 
> I usually try to avoid the =literal ldr notation, because it involves
> an additional load via the D-cache. Could you use a 64-bit literal
> instead of a byte array and use vldr instead? Or switch to adr? (and
> move the literal in range, I suppose)

'adr' works if I move rol8_table to between chacha20_block_xor_neon() and
chacha20_4block_xor_neon().

> >  ENTRY(chacha20_4block_xor_neon)
> > -   push{r4-r6, lr}
> > -   mov ip, sp  // preserve the stack 
> > pointer
> > -   sub r3, sp, #0x20   // allocate a 32 byte buffer
> > -   bic r3, r3, #0x1f   // aligned to 32 bytes
> > -   mov sp, r3
> > +   push{r4}
> 
> The ARM EABI mandates 8 byte stack alignment, and if you take an
> interrupt right at this point, you will enter the interrupt handler
> with a misaligned stack. Whether this could actually cause any
> problems is a different question, but it is better to keep it 8-byte
> aligned to be sure.
> 
> I'd suggest you just push r4 and lr, so you can return from the
> function by popping r4 and pc, as is customary on ARM.

Are you sure that's an actual constraint?  That would imply you could never
push/pop an odd number of ARM registers to/from the stack... but if I
disassemble an ARM kernel, such pushes/pops occur frequently in generated code.
So 8-byte alignment must only be required when a subroutine is called.

If I understand the interrupt handling code correctly, the kernel stack is being
aligned in the svc_entry