Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-24 Thread Herbert Xu
Daniel Axtens  wrote:
> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
> 
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
> 
> Cc: Anton Blanchard 
> Signed-off-by: Daniel Axtens 

All patches applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread Daniel Axtens
> So although this sits in arch/powerpc, it's heavy on the crypto which is
> not my area of expertise (to say the least!), so I think it should
> probably go via Herbert and the crypto tree?

That was my thought as well. Sorry - probably should have put that in
the comments somewhere.

Regards,
Daniel


Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread Anton Blanchard
Hi David,

> While not part of this change, the unrolled loops look as though
> they just destroy the cpu cache.
> I'd like be convinced that anything does CRC over long enough buffers
> to make it a gain at all.

btrfs data checksumming is one area.

> With modern (not that modern now) superscalar cpus you can often
> get the loop instructions 'for free'.

A branch on POWER8 is a three cycle redirect. The vpmsum instructions
are 6 cycles.

> Sometimes pipelining the loop is needed to get full throughput.
> Unlike the IP checksum, you don't even have to 'loop carry' the
> cpu carry flag.

It went through quite a lot of simulation to reach peak performance.
The loop is quite delicate, we have to pace it just right to avoid
some pipeline reject conditions.

Note also that we already modulo schedule the loop across three
iterations, required to hide the latency of the vpmsum instructions.

Anton


Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread Michael Ellerman
Daniel Axtens  writes:

> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
>
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
>
> Cc: Anton Blanchard 
> Signed-off-by: Daniel Axtens 
>
> --
>
> It's possible at this point to argue that the address
> of the constant tables should be passed in to the function,
> rather than doing this somewhat unconventional #include.
>
> However, we're about to add further #ifdef's back into the core
> that will be provided by the encapsulaing code, and which couldn't
> be done as a variable without performance loss.
> ---
>  arch/powerpc/crypto/crc32-vpmsum_core.S | 726 
> 
>  arch/powerpc/crypto/crc32c-vpmsum_asm.S | 714 +--
>  2 files changed, 729 insertions(+), 711 deletions(-)
>  create mode 100644 arch/powerpc/crypto/crc32-vpmsum_core.S

So although this sits in arch/powerpc, it's heavy on the crypto which is
not my area of expertise (to say the least!), so I think it should
probably go via Herbert and the crypto tree?

cheers


RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread David Laight
From: Daniel Axtens
> Sent: 15 March 2017 22:30
> Hi David,
> 
> > While not part of this change, the unrolled loops look as though
> > they just destroy the cpu cache.
> > I'd like be convinced that anything does CRC over long enough buffers
> > to make it a gain at all.
> >
> > With modern (not that modern now) superscalar cpus you can often
> > get the loop instructions 'for free'.
> > Sometimes pipelining the loop is needed to get full throughput.
> > Unlike the IP checksum, you don't even have to 'loop carry' the
> > cpu carry flag.
> 
> Internal testing on a NVMe device with T10DIF enabled on 4k blocks
> shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic
> uses over 60% of CPU time - with these patches CRC drops to single
> digits.
> 
> I should probably have lead with that, sorry.

I'm not doubting that using the cpu instruction for crcs gives a
massive performance boost.

Just that the heavily unrolled loop is unlikely to help overall.
Some 'cold cache' tests on shorter buffers might be illuminating.
 
> FWIW, the original patch showed a 3.7x gain on btrfs as well -
> 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")
> 
> When Anton wrote the original code he had access to IBM's internal
> tooling for looking at how instructions flow through the various stages
> of the CPU, so I trust it's pretty much optimal from that point of view.

Doesn't mean he used it :-)

David




RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread Daniel Axtens
Hi David,

> While not part of this change, the unrolled loops look as though
> they just destroy the cpu cache.
> I'd like be convinced that anything does CRC over long enough buffers
> to make it a gain at all.
>
> With modern (not that modern now) superscalar cpus you can often
> get the loop instructions 'for free'.
> Sometimes pipelining the loop is needed to get full throughput.
> Unlike the IP checksum, you don't even have to 'loop carry' the
> cpu carry flag.

Internal testing on a NVMe device with T10DIF enabled on 4k blocks
shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic
uses over 60% of CPU time - with these patches CRC drops to single
digits.

I should probably have lead with that, sorry.

FWIW, the original patch showed a 3.7x gain on btrfs as well -
6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")

When Anton wrote the original code he had access to IBM's internal
tooling for looking at how instructions flow through the various stages
of the CPU, so I trust it's pretty much optimal from that point of view.

Regards,
Daniel


RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread David Laight
From: Linuxppc-dev Daniel Axtens
> Sent: 15 March 2017 12:38
> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
> 
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
...

While not part of this change, the unrolled loops look as though
they just destroy the cpu cache.
I'd like be convinced that anything does CRC over long enough buffers
to make it a gain at all.

With modern (not that modern now) superscalar cpus you can often
get the loop instructions 'for free'.
Sometimes pipelining the loop is needed to get full throughput.
Unlike the IP checksum, you don't even have to 'loop carry' the
cpu carry flag.

David



[PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread Daniel Axtens
The core nuts and bolts of the crc32c vpmsum algorithm will
also work for a number of other CRC algorithms with different
polynomials. Factor out the function into a new asm file.

To handle multiple users of the function, a user simply
provides constants, defines the name of their CRC function,
and then #includes the core algorithm file.

Cc: Anton Blanchard 
Signed-off-by: Daniel Axtens 

--

It's possible at this point to argue that the address
of the constant tables should be passed in to the function,
rather than doing this somewhat unconventional #include.

However, we're about to add further #ifdef's back into the core
that will be provided by the encapsulaing code, and which couldn't
be done as a variable without performance loss.
---
 arch/powerpc/crypto/crc32-vpmsum_core.S | 726 
 arch/powerpc/crypto/crc32c-vpmsum_asm.S | 714 +--
 2 files changed, 729 insertions(+), 711 deletions(-)
 create mode 100644 arch/powerpc/crypto/crc32-vpmsum_core.S

diff --git a/arch/powerpc/crypto/crc32-vpmsum_core.S 
b/arch/powerpc/crypto/crc32-vpmsum_core.S
new file mode 100644
index ..629244ef170e
--- /dev/null
+++ b/arch/powerpc/crypto/crc32-vpmsum_core.S
@@ -0,0 +1,726 @@
+/*
+ * Core of the accelerated CRC algorithm.
+ * In your file, define the constants and CRC_FUNCTION_NAME
+ * Then include this file.
+ *
+ * Calculate the checksum of data that is 16 byte aligned and a multiple of
+ * 16 bytes.
+ *
+ * The first step is to reduce it to 1024 bits. We do this in 8 parallel
+ * chunks in order to mask the latency of the vpmsum instructions. If we
+ * have more than 32 kB of data to checksum we repeat this step multiple
+ * times, passing in the previous 1024 bits.
+ *
+ * The next step is to reduce the 1024 bits to 64 bits. This step adds
+ * 32 bits of 0s to the end - this matches what a CRC does. We just
+ * calculate constants that land the data in this 32 bits.
+ *
+ * We then use fixed point Barrett reduction to compute a mod n over GF(2)
+ * for n = CRC using POWER8 instructions. We use x = 32.
+ *
+ * http://en.wikipedia.org/wiki/Barrett_reduction
+ *
+ * Copyright (C) 2015 Anton Blanchard , IBM
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+*/
+   
+#include 
+#include 
+
+#define MAX_SIZE   32768
+
+   .text
+
+#if defined(__BIG_ENDIAN__)
+#define BYTESWAP_DATA
+#else
+#undef BYTESWAP_DATA
+#endif
+
+#define off16  r25
+#define off32  r26
+#define off48  r27
+#define off64  r28
+#define off80  r29
+#define off96  r30
+#define off112 r31
+
+#define const1 v24
+#define const2 v25
+
+#define byteswap   v26
+#definemask_32bit  v27
+#definemask_64bit  v28
+#define zeroes v29
+
+#ifdef BYTESWAP_DATA
+#define VPERM(A, B, C, D) vpermA, B, C, D
+#else
+#define VPERM(A, B, C, D)
+#endif
+
+/* unsigned int CRC_FUNCTION_NAME(unsigned int crc, void *p, unsigned long 
len) */
+FUNC_START(CRC_FUNCTION_NAME)
+   std r31,-8(r1)
+   std r30,-16(r1)
+   std r29,-24(r1)
+   std r28,-32(r1)
+   std r27,-40(r1)
+   std r26,-48(r1)
+   std r25,-56(r1)
+
+   li  off16,16
+   li  off32,32
+   li  off48,48
+   li  off64,64
+   li  off80,80
+   li  off96,96
+   li  off112,112
+   li  r0,0
+
+   /* Enough room for saving 10 non volatile VMX registers */
+   subir6,r1,56+10*16
+   subir7,r1,56+2*16
+
+   stvxv20,0,r6
+   stvxv21,off16,r6
+   stvxv22,off32,r6
+   stvxv23,off48,r6
+   stvxv24,off64,r6
+   stvxv25,off80,r6
+   stvxv26,off96,r6
+   stvxv27,off112,r6
+   stvxv28,0,r7
+   stvxv29,off16,r7
+
+   mr  r10,r3
+
+   vxorzeroes,zeroes,zeroes
+   vspltisw v0,-1
+
+   vsldoi  mask_32bit,zeroes,v0,4
+   vsldoi  mask_64bit,zeroes,v0,8
+
+   /* Get the initial value into v8 */
+   vxorv8,v8,v8
+   MTVRD(v8, R3)
+   vsldoi  v8,zeroes,v8,8  /* shift into bottom 32 bits */
+
+#ifdef BYTESWAP_DATA
+   addis   r3,r2,.byteswap_constant@toc@ha
+   addir3,r3,.byteswap_constant@toc@l
+
+   lvx byteswap,0,r3
+   addir3,r3,16
+#endif
+
+   cmpdi   r5,256
+   blt .Lshort
+
+   rldicr  r6,r5,0,56
+
+   /* Checksum in blocks of MAX_SIZE */
+1: lis r7,MAX_SIZE@h
+   ori r7,r7,MAX_SIZE@l
+   mr  r9,r7
+   cmpdr6,r7
+   bgt 2f
+   mr  r7,r6
+2: subfr6,r7,r6
+
+   /* our main loop does 128 bytes at a time */
+   srdi