Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
Daniel Axtens wrote: > The core nuts and bolts of the crc32c vpmsum algorithm will > also work for a number of other CRC algorithms with different > polynomials. Factor out the function into a new asm file. > > To handle multiple users of the function, a user simply > provides constants, defines the name of their CRC function, > and then #includes the core algorithm file. > > Cc: Anton Blanchard > Signed-off-by: Daniel Axtens All patches applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
> So although this sits in arch/powerpc, it's heavy on the crypto which is > not my area of expertise (to say the least!), so I think it should > probably go via Herbert and the crypto tree? That was my thought as well. Sorry - probably should have put that in the comments somewhere. Regards, Daniel
Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
Hi David, > While not part of this change, the unrolled loops look as though > they just destroy the cpu cache. > I'd like be convinced that anything does CRC over long enough buffers > to make it a gain at all. btrfs data checksumming is one area. > With modern (not that modern now) superscalar cpus you can often > get the loop instructions 'for free'. A branch on POWER8 is a three cycle redirect. The vpmsum instructions are 6 cycles. > Sometimes pipelining the loop is needed to get full throughput. > Unlike the IP checksum, you don't even have to 'loop carry' the > cpu carry flag. It went through quite a lot of simulation to reach peak performance. The loop is quite delicate, we have to pace it just right to avoid some pipeline reject conditions. Note also that we already modulo schedule the loop across three iterations, required to hide the latency of the vpmsum instructions. Anton
Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
Daniel Axtens writes: > The core nuts and bolts of the crc32c vpmsum algorithm will > also work for a number of other CRC algorithms with different > polynomials. Factor out the function into a new asm file. > > To handle multiple users of the function, a user simply > provides constants, defines the name of their CRC function, > and then #includes the core algorithm file. > > Cc: Anton Blanchard > Signed-off-by: Daniel Axtens > > -- > > It's possible at this point to argue that the address > of the constant tables should be passed in to the function, > rather than doing this somewhat unconventional #include. > > However, we're about to add further #ifdef's back into the core > that will be provided by the encapsulaing code, and which couldn't > be done as a variable without performance loss. > --- > arch/powerpc/crypto/crc32-vpmsum_core.S | 726 > > arch/powerpc/crypto/crc32c-vpmsum_asm.S | 714 +-- > 2 files changed, 729 insertions(+), 711 deletions(-) > create mode 100644 arch/powerpc/crypto/crc32-vpmsum_core.S So although this sits in arch/powerpc, it's heavy on the crypto which is not my area of expertise (to say the least!), so I think it should probably go via Herbert and the crypto tree? cheers
RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
From: Daniel Axtens > Sent: 15 March 2017 22:30 > Hi David, > > > While not part of this change, the unrolled loops look as though > > they just destroy the cpu cache. > > I'd like be convinced that anything does CRC over long enough buffers > > to make it a gain at all. > > > > With modern (not that modern now) superscalar cpus you can often > > get the loop instructions 'for free'. > > Sometimes pipelining the loop is needed to get full throughput. > > Unlike the IP checksum, you don't even have to 'loop carry' the > > cpu carry flag. > > Internal testing on a NVMe device with T10DIF enabled on 4k blocks > shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic > uses over 60% of CPU time - with these patches CRC drops to single > digits. > > I should probably have lead with that, sorry. I'm not doubting that using the cpu instruction for crcs gives a massive performance boost. Just that the heavily unrolled loop is unlikely to help overall. Some 'cold cache' tests on shorter buffers might be illuminating. > FWIW, the original patch showed a 3.7x gain on btrfs as well - > 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c") > > When Anton wrote the original code he had access to IBM's internal > tooling for looking at how instructions flow through the various stages > of the CPU, so I trust it's pretty much optimal from that point of view. Doesn't mean he used it :-) David
RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
Hi David, > While not part of this change, the unrolled loops look as though > they just destroy the cpu cache. > I'd like be convinced that anything does CRC over long enough buffers > to make it a gain at all. > > With modern (not that modern now) superscalar cpus you can often > get the loop instructions 'for free'. > Sometimes pipelining the loop is needed to get full throughput. > Unlike the IP checksum, you don't even have to 'loop carry' the > cpu carry flag. Internal testing on a NVMe device with T10DIF enabled on 4k blocks shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic uses over 60% of CPU time - with these patches CRC drops to single digits. I should probably have lead with that, sorry. FWIW, the original patch showed a 3.7x gain on btrfs as well - 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c") When Anton wrote the original code he had access to IBM's internal tooling for looking at how instructions flow through the various stages of the CPU, so I trust it's pretty much optimal from that point of view. Regards, Daniel
RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
From: Linuxppc-dev Daniel Axtens > Sent: 15 March 2017 12:38 > The core nuts and bolts of the crc32c vpmsum algorithm will > also work for a number of other CRC algorithms with different > polynomials. Factor out the function into a new asm file. > > To handle multiple users of the function, a user simply > provides constants, defines the name of their CRC function, > and then #includes the core algorithm file. ... While not part of this change, the unrolled loops look as though they just destroy the cpu cache. I'd like be convinced that anything does CRC over long enough buffers to make it a gain at all. With modern (not that modern now) superscalar cpus you can often get the loop instructions 'for free'. Sometimes pipelining the loop is needed to get full throughput. Unlike the IP checksum, you don't even have to 'loop carry' the cpu carry flag. David