Re: [PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates

2018-08-03 Thread Herbert Xu
On Tue, Jul 24, 2018 at 06:29:07PM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
> but the one-way code (used on remainder blocks) implements it with
> vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.
> 
> Signed-off-by: Eric Biggers 

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 03:29, Eric Biggers  wrote:
> From: Eric Biggers 
>
> The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
> but the one-way code (used on remainder blocks) implements it with
> vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.
>
> Signed-off-by: Eric Biggers 

Acked-by: Ard Biesheuvel 

> ---
>  arch/arm/crypto/chacha20-neon-core.S | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/crypto/chacha20-neon-core.S 
> b/arch/arm/crypto/chacha20-neon-core.S
> index 3fecb2124c35..451a849ad518 100644
> --- a/arch/arm/crypto/chacha20-neon-core.S
> +++ b/arch/arm/crypto/chacha20-neon-core.S
> @@ -51,9 +51,8 @@ ENTRY(chacha20_block_xor_neon)
>  .Ldoubleround:
> // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #16
> -   vsri.u32q3, q4, #16
> +   veorq3, q3, q0
> +   vrev32.16   q3, q3
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
> vadd.i32q2, q2, q3
> @@ -82,9 +81,8 @@ ENTRY(chacha20_block_xor_neon)
>
> // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #16
> -   vsri.u32q3, q4, #16
> +   veorq3, q3, q0
> +   vrev32.16   q3, q3
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
> vadd.i32q2, q2, q3
> --
> 2.18.0
>


[PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates

2018-07-24 Thread Eric Biggers
From: Eric Biggers 

The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
but the one-way code (used on remainder blocks) implements it with
vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/chacha20-neon-core.S | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/arm/crypto/chacha20-neon-core.S 
b/arch/arm/crypto/chacha20-neon-core.S
index 3fecb2124c35..451a849ad518 100644
--- a/arch/arm/crypto/chacha20-neon-core.S
+++ b/arch/arm/crypto/chacha20-neon-core.S
@@ -51,9 +51,8 @@ ENTRY(chacha20_block_xor_neon)
 .Ldoubleround:
// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #16
-   vsri.u32q3, q4, #16
+   veorq3, q3, q0
+   vrev32.16   q3, q3
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
vadd.i32q2, q2, q3
@@ -82,9 +81,8 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #16
-   vsri.u32q3, q4, #16
+   veorq3, q3, q0
+   vrev32.16   q3, q3
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
vadd.i32q2, q2, q3
-- 
2.18.0