Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-24 Thread Herbert Xu
Daniel Axtens  wrote:
> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
> 
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
> 
> Cc: Anton Blanchard 
> Signed-off-by: Daniel Axtens 

All patches applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread Daniel Axtens
> So although this sits in arch/powerpc, it's heavy on the crypto which is
> not my area of expertise (to say the least!), so I think it should
> probably go via Herbert and the crypto tree?

That was my thought as well. Sorry - probably should have put that in
the comments somewhere.

Regards,
Daniel


Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread Anton Blanchard
Hi David,

> While not part of this change, the unrolled loops look as though
> they just destroy the cpu cache.
> I'd like be convinced that anything does CRC over long enough buffers
> to make it a gain at all.

btrfs data checksumming is one area.

> With modern (not that modern now) superscalar cpus you can often
> get the loop instructions 'for free'.

A branch on POWER8 is a three cycle redirect. The vpmsum instructions
are 6 cycles.

> Sometimes pipelining the loop is needed to get full throughput.
> Unlike the IP checksum, you don't even have to 'loop carry' the
> cpu carry flag.

It went through quite a lot of simulation to reach peak performance.
The loop is quite delicate, we have to pace it just right to avoid
some pipeline reject conditions.

Note also that we already modulo schedule the loop across three
iterations, required to hide the latency of the vpmsum instructions.

Anton


Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread Michael Ellerman
Daniel Axtens  writes:

> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
>
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
>
> Cc: Anton Blanchard 
> Signed-off-by: Daniel Axtens 
>
> --
>
> It's possible at this point to argue that the address
> of the constant tables should be passed in to the function,
> rather than doing this somewhat unconventional #include.
>
> However, we're about to add further #ifdef's back into the core
> that will be provided by the encapsulaing code, and which couldn't
> be done as a variable without performance loss.
> ---
>  arch/powerpc/crypto/crc32-vpmsum_core.S | 726 
> 
>  arch/powerpc/crypto/crc32c-vpmsum_asm.S | 714 +--
>  2 files changed, 729 insertions(+), 711 deletions(-)
>  create mode 100644 arch/powerpc/crypto/crc32-vpmsum_core.S

So although this sits in arch/powerpc, it's heavy on the crypto which is
not my area of expertise (to say the least!), so I think it should
probably go via Herbert and the crypto tree?

cheers


RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread David Laight
From: Daniel Axtens
> Sent: 15 March 2017 22:30
> Hi David,
> 
> > While not part of this change, the unrolled loops look as though
> > they just destroy the cpu cache.
> > I'd like be convinced that anything does CRC over long enough buffers
> > to make it a gain at all.
> >
> > With modern (not that modern now) superscalar cpus you can often
> > get the loop instructions 'for free'.
> > Sometimes pipelining the loop is needed to get full throughput.
> > Unlike the IP checksum, you don't even have to 'loop carry' the
> > cpu carry flag.
> 
> Internal testing on a NVMe device with T10DIF enabled on 4k blocks
> shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic
> uses over 60% of CPU time - with these patches CRC drops to single
> digits.
> 
> I should probably have lead with that, sorry.

I'm not doubting that using the cpu instruction for crcs gives a
massive performance boost.

Just that the heavily unrolled loop is unlikely to help overall.
Some 'cold cache' tests on shorter buffers might be illuminating.
 
> FWIW, the original patch showed a 3.7x gain on btrfs as well -
> 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")
> 
> When Anton wrote the original code he had access to IBM's internal
> tooling for looking at how instructions flow through the various stages
> of the CPU, so I trust it's pretty much optimal from that point of view.

Doesn't mean he used it :-)

David




RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread Daniel Axtens
Hi David,

> While not part of this change, the unrolled loops look as though
> they just destroy the cpu cache.
> I'd like be convinced that anything does CRC over long enough buffers
> to make it a gain at all.
>
> With modern (not that modern now) superscalar cpus you can often
> get the loop instructions 'for free'.
> Sometimes pipelining the loop is needed to get full throughput.
> Unlike the IP checksum, you don't even have to 'loop carry' the
> cpu carry flag.

Internal testing on a NVMe device with T10DIF enabled on 4k blocks
shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic
uses over 60% of CPU time - with these patches CRC drops to single
digits.

I should probably have lead with that, sorry.

FWIW, the original patch showed a 3.7x gain on btrfs as well -
6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")

When Anton wrote the original code he had access to IBM's internal
tooling for looking at how instructions flow through the various stages
of the CPU, so I trust it's pretty much optimal from that point of view.

Regards,
Daniel


RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread David Laight
From: Linuxppc-dev Daniel Axtens
> Sent: 15 March 2017 12:38
> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
> 
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
...

While not part of this change, the unrolled loops look as though
they just destroy the cpu cache.
I'd like be convinced that anything does CRC over long enough buffers
to make it a gain at all.

With modern (not that modern now) superscalar cpus you can often
get the loop instructions 'for free'.
Sometimes pipelining the loop is needed to get full throughput.
Unlike the IP checksum, you don't even have to 'loop carry' the
cpu carry flag.

David