Re: 答复: [PATCH 1/3] crypto: skcipher - fix crash flushing dcache in error path
Hi GaoKui, On Thu, Jul 26, 2018 at 02:44:30AM +, gaokui (A) wrote: > Hi, Eric, > Thanks for your reply. > > I have run your program on an original kernel and it reproduced the > crash. And I also run the program on a kernel with our patch, but there was > no crash. > > I think the reason of the crash is the parameter buffer is aligned > with the page . So the address of the parameter buffer starts at the > beginning of the page, which making "walk->offset = 0" and generating the > crash. I add some logs in "scatterwalk_pagedone()" to print the value of > walk->offset, and the log before the crash shows that "walk->offset = 0". > > And I do not understand why "walk->offset = 0" means no data to be > processed. In the structure " scatterlist", the member "offset" represents > the offset of the buffer in the page, and the member length represents the > length of the buffer. In function "af_alg_make_sg()", if a buffer occupies > more than one pages, the offset will also be set to 0 in the second and > following pages. And In function scatterwalk_done(), walk->offset = 0 will > also allow to call "scatterwalk_pagedone()". So I think that when > "walk->offset = 0" the page needs to be flushed as well. > > BRs > GaoKui > Did you test my patches or just yours? Your patch fixes the crash, but I don't agree that it's the best fix. What you're missing is that walk->offset has already been increased by scatterwalk_advance() to the offset of the *end* of the data region processed. Hence, walk->offset = 0 implies that 0 bytes were processed (as walk->offset must have been 0 initially, then had 0 added to it), which I think isn't meant to be a valid case. And in particular it does *not* make sense to flush any page when 0 bytes were processed. Note that this could also be a problem for empty scatterlist elements, but AFAICS the scatterlist walk code doesn't actually support those when the total length isn't 0. I think that needs improvement too, but AFAICS other changes would be needed to properly fix that limitation, and you apparently cannot generate empty scatterlist elements via AF_ALG anyway so only in-kernel users would be affected. - Eric
答复: [PATCH 1/3] crypto: skcipher - fix crash flushing dcache in error path
Hi, Eric, Thanks for your reply. I have run your program on an original kernel and it reproduced the crash. And I also run the program on a kernel with our patch, but there was no crash. I think the reason of the crash is the parameter buffer is aligned with the page . So the address of the parameter buffer starts at the beginning of the page, which making "walk->offset = 0" and generating the crash. I add some logs in "scatterwalk_pagedone()" to print the value of walk->offset, and the log before the crash shows that "walk->offset = 0". And I do not understand why "walk->offset = 0" means no data to be processed. In the structure " scatterlist", the member "offset" represents the offset of the buffer in the page, and the member length represents the length of the buffer. In function "af_alg_make_sg()", if a buffer occupies more than one pages, the offset will also be set to 0 in the second and following pages. And In function scatterwalk_done(), walk->offset = 0 will also allow to call "scatterwalk_pagedone()". So I think that when "walk->offset = 0" the page needs to be flushed as well. BRs GaoKui > -邮件原件- > 发件人: linux-crypto-ow...@vger.kernel.org [mailto:linux-crypto- > ow...@vger.kernel.org] 代表 Eric Biggers > 发送时间: 2018年7月24日 1:55 > 收件人: linux-crypto@vger.kernel.org; Herbert Xu > > 抄送: Liuchao (H) ; 罗新强 > ; gaokui (A) ; Eric > Biggers > 主题: [PATCH 1/3] crypto: skcipher - fix crash flushing dcache in error path > > From: Eric Biggers > > scatterwalk_done() is only meant to be called after a nonzero number of > bytes have been processed, since scatterwalk_pagedone() will flush the > dcache of the *previous* page. But in the error case of > skcipher_walk_done(), e.g. if the input wasn't an integer number of blocks, > scatterwalk_done() was actually called after advancing 0 bytes. > This caused a crash ("BUG: unable to handle kernel paging request") during > '!PageSlab(page)' on architectures like arm and arm64 that define > ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, provided that the input was > page-aligned as in that case walk->offset == 0. > > Fix it by reorganizing skcipher_walk_done() to skip the > scatterwalk_advance() and scatterwalk_done() if an error has occurred. > > This bug was found by syzkaller fuzzing. > > Reproducer, assuming ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE: > > #include > #include > #include > > int main() > { > struct sockaddr_alg addr = { > .salg_type = "skcipher", > .salg_name = "cbc(aes-generic)", > }; > char buffer[4096] __attribute__((aligned(4096))) = { 0 }; > int fd; > > fd = socket(AF_ALG, SOCK_SEQPACKET, 0); > bind(fd, (void *), sizeof(addr)); > setsockopt(fd, SOL_ALG, ALG_SET_KEY, buffer, 16); > fd = accept(fd, NULL, NULL); > write(fd, buffer, 15); > read(fd, buffer, 15); > } > > Reported-by: Liu Chao > Fixes: b286d8b1a690 ("crypto: skcipher - Add skcipher walk interface") > Cc: # v4.10+ > Signed-off-by: Eric Biggers > --- > crypto/skcipher.c | 53 --- > 1 file changed, 27 insertions(+), 26 deletions(-) > > diff --git a/crypto/skcipher.c b/crypto/skcipher.c index > 7d6a49fe3047..5f7017b36d75 100644 > --- a/crypto/skcipher.c > +++ b/crypto/skcipher.c > @@ -95,7 +95,7 @@ static inline u8 *skcipher_get_spot(u8 *start, > unsigned int len) > return max(start, end_page); > } > > -static int skcipher_done_slow(struct skcipher_walk *walk, unsigned int > bsize) > +static void skcipher_done_slow(struct skcipher_walk *walk, unsigned int > +bsize) > { > u8 *addr; > > @@ -103,23 +103,24 @@ static int skcipher_done_slow(struct > skcipher_walk *walk, unsigned int bsize) > addr = skcipher_get_spot(addr, bsize); > scatterwalk_copychunks(addr, >out, bsize, > (walk->flags & SKCIPHER_WALK_PHYS) ? 2 : 1); > - return 0; > } > > int skcipher_walk_done(struct skcipher_walk *walk, int err) { > - unsigned int n = walk->nbytes - err; > - unsigned int nbytes; > - > - nbytes = walk->total - n; > - > - if (unlikely(err < 0)) { > - nbytes = 0; > - n = 0; > - } else if (likely(!(walk->flags & (SKCIPHER_WALK_PHYS | > -SKCIPHER_WALK_SLOW | > -SKCIPHER_WALK_COPY | > -SKCIPHER_WALK_DIFF { > + unsigned int n; /* bytes processed */ > + bool more; > + > + if (unlikely(err < 0)) > + goto finish; > + > + n = walk->nbytes - err; > + walk->total -= n; > + more = (walk->total != 0); > + > + if (likely(!(walk->flags & (SKCIPHER_WALK_PHYS | > + SKCIPHER_WALK_SLOW | > +
Re: [PATCH v6 10/18] x86/power/64: Remove VLA usage
On Wed, Jul 25, 2018 at 4:32 AM, Rafael J. Wysocki wrote: > On Tue, Jul 24, 2018 at 6:49 PM, Kees Cook wrote: >> In the quest to remove all stack VLA usage from the kernel[1], this >> removes the discouraged use of AHASH_REQUEST_ON_STACK by switching to >> shash directly and allocating the descriptor in heap memory (which should >> be fine: the tfm has already been allocated there too). >> >> [1] >> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com >> >> Signed-off-by: Kees Cook >> Acked-by: Pavel Machek > > I think I can queue this up if there are no objections from others. > > Do you want me to do that? Sure thing. It looks like the other stand-alone patches like this one are getting taken into the non-crypto trees, so that's fine. Thanks! -Kees -- Kees Cook Pixel Security
Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks
On 2018-07-25 11:54:53 [+0200], Ard Biesheuvel wrote: > Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a > 1000 cycle limit to the quantum of work performed with preemption > disabled is unreasonably low, we can increase the yield block counts > and approach the optimal numbers a bit closer. But with diminishing > returns. So I tested on SoftIron Overdrive 1000 which has A57 cores. I added this series and didn't notice any spikes. This means cyclictest reported a max value of like ~20us (which means the crypto code was not noticeable). I played a little with it and tcrypt tests for aes/sha1 and also no huge spikes. So at this point this looks fantastic. I also setup cryptsetup / dm-crypt with the usual xts(aes) mode and saw no spikes. At this point, on this hardware if you want to raise the block count, I wouldn't mind. I remember on x86 the SIMD accelerated ciphers led to ~1ms+ spikes once dm-crypt started its jobs. Sebastian
Re: [PATCH 1/4] treewide: convert ISO_8859-1 text comments to utf-8
On Wed, 2018-07-25 at 15:12 +0200, Arnd Bergmann wrote: > tools/perf/tests/.gitignore: > LLVM byte-codes, uncompressed > On Wed, Jul 25, 2018 at 2:55 AM, Andrew Morton > wrote: > > On Tue, 24 Jul 2018 17:13:20 -0700 Joe Perches wrote: > > > > > On Tue, 2018-07-24 at 14:00 -0700, Andrew Morton wrote: > > > > On Tue, 24 Jul 2018 13:13:25 +0200 Arnd Bergmann wrote: > > > > > Almost all files in the kernel are either plain text or UTF-8 > > > > > encoded. A couple however are ISO_8859-1, usually just a few > > > > > characters in a C comments, for historic reasons. > > > > > This converts them all to UTF-8 for consistency. > > > > > > [] > > > > Will we be getting a checkpatch rule to keep things this way? > > > > > > How would that be done? > > > > I'm using this, seems to work. > > > > if ! file $p | grep -q -P ", ASCII text|, UTF-8 Unicode text" > > then > > echo $p: weird charset > > fi > > There are a couple of files that my version of 'find' incorrectly identified > as > something completely different, like: > > Documentation/devicetree/bindings/pinctrl/pinctrl-sx150x.txt: > SemOne archive data > Documentation/devicetree/bindings/rtc/epson,rtc7301.txt: > Microsoft Document Imaging Format > Documentation/filesystems/nfs/pnfs-block-server.txt: > PPMN archive data > arch/arm/boot/dts/bcm283x-rpi-usb-host.dtsi: > Sendmail frozen configuration - version = "host"; > Documentation/networking/segmentation-offloads.txt: > StuffIt Deluxe Segment (data) : gmentation Offloads in the > Linux Networking Stack > arch/sparc/include/asm/visasm.h: SAS 7+ > arch/xtensa/kernel/setup.c: , > init=0x454c, stat=0x090a, dev=0x2009, bas=0x2020 > drivers/cpufreq/powernow-k8.c: > TI-XX Graphing Calculator (FLASH) > tools/testing/selftests/net/forwarding/tc_shblocks.sh: > Minix filesystem, V2 (big endian) > tools/perf/tests/.gitignore: > LLVM byte-codes, uncompressed > > All of the above seem to be valid ASCII or UTF-8 files, so the check > above will lead > to false-positives, but it may be good enough as they are the > exception, and may be > bugs in 'file'. > > Not sure if we need to worry about 'file' not being installed. checkpatch works on patches so I think the test isn't really relevant. It has to use the appropriate email header that sets the charset. perhaps: --- scripts/checkpatch.pl | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 34e4683de7a3..57355fbd2d28 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2765,9 +2765,13 @@ sub process { # Check if there is UTF-8 in a commit log when a mail header has explicitly # declined it, i.e defined some charset where it is missing. if ($in_header_lines && - $rawline =~ /^Content-Type:.+charset="(.+)".*$/ && - $1 !~ /utf-8/i) { - $non_utf8_charset = 1; + $rawline =~ /^Content-Type:.+charset="?([^\s;"]+)/) { + my $charset = $1; + $non_utf8_charset = 1 if ($charset !~ /^utf-8$/i); + if ($charset !~ /^(?:us-ascii|utf-8|iso-8859-1)$/) { + WARN("PATCH_CHARSET", +"Unpreferred email header charset '$charset'\n" . $herecurr); + } } if ($in_commit_log && $non_utf8_charset && $realfile =~ /^$/ &&
Re: [PATCH 1/4] treewide: convert ISO_8859-1 text comments to utf-8
tools/perf/tests/.gitignore: LLVM byte-codes, uncompressed On Wed, Jul 25, 2018 at 2:55 AM, Andrew Morton wrote: > On Tue, 24 Jul 2018 17:13:20 -0700 Joe Perches wrote: > >> On Tue, 2018-07-24 at 14:00 -0700, Andrew Morton wrote: >> > On Tue, 24 Jul 2018 13:13:25 +0200 Arnd Bergmann wrote: >> > > Almost all files in the kernel are either plain text or UTF-8 >> > > encoded. A couple however are ISO_8859-1, usually just a few >> > > characters in a C comments, for historic reasons. >> > > This converts them all to UTF-8 for consistency. >> [] >> > Will we be getting a checkpatch rule to keep things this way? >> >> How would that be done? > > I'm using this, seems to work. > > if ! file $p | grep -q -P ", ASCII text|, UTF-8 Unicode text" > then > echo $p: weird charset > fi There are a couple of files that my version of 'find' incorrectly identified as something completely different, like: Documentation/devicetree/bindings/pinctrl/pinctrl-sx150x.txt: SemOne archive data Documentation/devicetree/bindings/rtc/epson,rtc7301.txt: Microsoft Document Imaging Format Documentation/filesystems/nfs/pnfs-block-server.txt: PPMN archive data arch/arm/boot/dts/bcm283x-rpi-usb-host.dtsi: Sendmail frozen configuration - version = "host"; Documentation/networking/segmentation-offloads.txt: StuffIt Deluxe Segment (data) : gmentation Offloads in the Linux Networking Stack arch/sparc/include/asm/visasm.h: SAS 7+ arch/xtensa/kernel/setup.c: , init=0x454c, stat=0x090a, dev=0x2009, bas=0x2020 drivers/cpufreq/powernow-k8.c: TI-XX Graphing Calculator (FLASH) tools/testing/selftests/net/forwarding/tc_shblocks.sh: Minix filesystem, V2 (big endian) tools/perf/tests/.gitignore: LLVM byte-codes, uncompressed All of the above seem to be valid ASCII or UTF-8 files, so the check above will lead to false-positives, but it may be good enough as they are the exception, and may be bugs in 'file'. Not sure if we need to worry about 'file' not being installed. Arnd
Re: [PATCH v6 10/18] x86/power/64: Remove VLA usage
On Tue, Jul 24, 2018 at 6:49 PM, Kees Cook wrote: > In the quest to remove all stack VLA usage from the kernel[1], this > removes the discouraged use of AHASH_REQUEST_ON_STACK by switching to > shash directly and allocating the descriptor in heap memory (which should > be fine: the tfm has already been allocated there too). > > [1] > https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com > > Signed-off-by: Kees Cook > Acked-by: Pavel Machek I think I can queue this up if there are no objections from others. Do you want me to do that? > --- > arch/x86/power/hibernate_64.c | 36 --- > 1 file changed, 21 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c > index 67ccf64c8bd8..f8e3b668d20b 100644 > --- a/arch/x86/power/hibernate_64.c > +++ b/arch/x86/power/hibernate_64.c > @@ -233,29 +233,35 @@ struct restore_data_record { > */ > static int get_e820_md5(struct e820_table *table, void *buf) > { > - struct scatterlist sg; > - struct crypto_ahash *tfm; > + struct crypto_shash *tfm; > + struct shash_desc *desc; > int size; > int ret = 0; > > - tfm = crypto_alloc_ahash("md5", 0, CRYPTO_ALG_ASYNC); > + tfm = crypto_alloc_shash("md5", 0, 0); > if (IS_ERR(tfm)) > return -ENOMEM; > > - { > - AHASH_REQUEST_ON_STACK(req, tfm); > - size = offsetof(struct e820_table, entries) + sizeof(struct > e820_entry) * table->nr_entries; > - ahash_request_set_tfm(req, tfm); > - sg_init_one(, (u8 *)table, size); > - ahash_request_set_callback(req, 0, NULL, NULL); > - ahash_request_set_crypt(req, , buf, size); > - > - if (crypto_ahash_digest(req)) > - ret = -EINVAL; > - ahash_request_zero(req); > + desc = kmalloc(sizeof(struct shash_desc) + crypto_shash_descsize(tfm), > + GFP_KERNEL); > + if (!desc) { > + ret = -ENOMEM; > + goto free_tfm; > } > - crypto_free_ahash(tfm); > > + desc->tfm = tfm; > + desc->flags = 0; > + > + size = offsetof(struct e820_table, entries) + > + sizeof(struct e820_entry) * table->nr_entries; > + > + if (crypto_shash_digest(desc, (u8 *)table, size, buf)) > + ret = -EINVAL; > + > + kzfree(desc); > + > +free_tfm: > + crypto_free_shash(tfm); > return ret; > } > > -- > 2.17.1 >
Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks
On 25 July 2018 at 11:45, Dave Martin wrote: > On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote: >> On 25 July 2018 at 11:09, Dave Martin wrote: >> > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote: >> >> Vakul reports a considerable performance hit when running the accelerated >> >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have >> >> been updated to take the TIF_NEED_RESCHED flag into account. >> >> >> >> The issue appears to be caused by the fact that Cortex-A53, the core in >> >> question, has a high end implementation of the Crypto Extensions, and >> >> has a shallow pipeline, which means even sequential algorithms that may be >> >> held back by pipeline stalls on high end out of order cores run at maximum >> >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at >> >> a >> >> speed in the order of 2 to 4 cycles per byte, and are currently >> >> implemented >> >> to check the TIF_NEED_RESCHED after each iteration, which may process as >> >> little as 16 bytes (for GHASH). >> >> >> >> Obviously, every cycle of overhead hurts in this context, and given that >> >> the A53's load/store unit is not quite high end, any delays caused by >> >> memory accesses that occur in the inner loop of the algorithms are going >> >> to be quite significant, hence the performance regression. >> >> >> >> So reduce the frequency at which the NEON yield checks are performed, so >> >> that they occur roughly once every 1000 cycles, which is hopefully a >> >> reasonable tradeoff between throughput and worst case scheduling latency. >> > >> > Is there any way to tune this automatically, or it that likely to be more >> > trouble than it's worth? >> > >> >> Good question. I think A53 is a reasonable worst case, and these >> changes reduce the impact to the same ballpark as the impact of >> enabling CONFIG_PREEMPT in the first place. >> >> > Also, how did you come up with 1000 cycles? At what point does >> > preemption latency become more/less important than throughput? >> > >> >> Another good question. I was hoping Sebastian or the other -rt folks >> would be triggered by this. Given the above, I ended up with a ~1000 >> cycles quantum, and hopefully this is considered to be small enough. >> >> > Maybe someone already invented a similar framework somewhere else in the >> > kernel. I seem to remember some automatic selection of memcpy >> > implementation based on a boot-time benchmark, but I may have >> > misremembered. >> > >> >> We have crypto benchmarking code in the kernel, and at some point, I >> even did some work on selecting the best algo based on performance. >> >> But to be honest, I think this is a bit overkill. If you need those >> final 5% of throughput at any cost, you're better off running with >> CONFIG_PREEMPT=n anyway. > > Can't really argue with any of that -- I was just wondering whether > there was precedent. > > Hopefully the ~1000 cycles ballpark will satisfy most people. For > the rest, it's too bad: if somebody is relying on the last 1-2% of > performance, they probably have a broken use case. > Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a 1000 cycle limit to the quantum of work performed with preemption disabled is unreasonably low, we can increase the yield block counts and approach the optimal numbers a bit closer. But with diminishing returns. Also, if the cost of enabling CONFIG_PREEMPT by itself is significantly reduced, e.g., by moving the per-CPU offset into a GPR, we can always revisit this of course.
Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
On 25 July 2018 at 11:48, Dave Martin wrote: > On Wed, Jul 25, 2018 at 10:11:42AM +0100, Ard Biesheuvel wrote: >> On 25 July 2018 at 11:05, Dave Martin wrote: >> > On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote: >> >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every >> >> iteration of the GHASH and AES-GCM core routines is having a considerable >> >> performance impact on cores such as the Cortex-A53 with Crypto Extensions >> >> implemented. >> >> >> >> GHASH performance is down by 22% for large block sizes, and AES-GCM is >> >> down by 16% for large block sizes and 128 bit keys. This appears to be >> >> a result of the high performance of the crypto instructions on the one >> >> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with >> >> the relatively poor load/store performance of this simple core. >> >> >> >> So let's reduce this performance impact by only doing the yield check >> >> once every 32 blocks for GHASH (or 4 when using the version based on >> >> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM. >> >> This way, we recover most of the performance while still limiting the >> >> duration of scheduling blackouts due to disabling preemption to ~1000 >> >> cycles. >> >> >> >> Cc: Vakul Garg >> >> Signed-off-by: Ard Biesheuvel >> >> --- >> >> arch/arm64/crypto/ghash-ce-core.S | 12 +--- >> >> 1 file changed, 9 insertions(+), 3 deletions(-) >> >> >> >> diff --git a/arch/arm64/crypto/ghash-ce-core.S >> >> b/arch/arm64/crypto/ghash-ce-core.S >> >> index dcffb9e77589..9c14beaabeee 100644 >> >> --- a/arch/arm64/crypto/ghash-ce-core.S >> >> +++ b/arch/arm64/crypto/ghash-ce-core.S >> >> @@ -212,7 +212,7 @@ >> >> ushrXL.2d, XL.2d, #1 >> >> .endm >> >> >> >> - .macro __pmull_ghash, pn >> >> + .macro __pmull_ghash, pn, yield_count >> >> frame_push 5 >> >> >> >> mov x19, x0 >> >> @@ -259,6 +259,9 @@ CPU_LE( rev64 T1.16b, T1.16b ) >> >> eor T2.16b, T2.16b, XH.16b >> >> eor XL.16b, XL.16b, T2.16b >> >> >> >> + tst w19, #(\yield_count - 1) >> > >> > This should perhaps be (\yield_count) - 1. >> > >> > It would be a bit silly to pass a non-trivial expression for yield_count >> > though. >> > >> >> + b.ne1b >> >> + >> > >> > Is it worth having a build-time check that \yield_count is a power of two? >> > (i.e., (\yield_count) & ((\yield_count) - 1) != 0). We could have a >> > generic macro for that. >> > >> > Otherwise this code may poll more often than expected. Not the end of >> > the world, though. >> > >> >> Thanks for taking a look. >> >> Given that the macro in question is used in exactly two places in the >> same file, and is unlikely to be reused unless the architecture gains >> support for another optional instruction that can be used as a drop-in >> replacement, I don't think any of this truly matters tbh. > > Fair enough. A comment on the macro definition might help, but beyond > that it is probably overkill. > Sure. I'll add that in v2.
Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
On Wed, Jul 25, 2018 at 10:11:42AM +0100, Ard Biesheuvel wrote: > On 25 July 2018 at 11:05, Dave Martin wrote: > > On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote: > >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every > >> iteration of the GHASH and AES-GCM core routines is having a considerable > >> performance impact on cores such as the Cortex-A53 with Crypto Extensions > >> implemented. > >> > >> GHASH performance is down by 22% for large block sizes, and AES-GCM is > >> down by 16% for large block sizes and 128 bit keys. This appears to be > >> a result of the high performance of the crypto instructions on the one > >> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with > >> the relatively poor load/store performance of this simple core. > >> > >> So let's reduce this performance impact by only doing the yield check > >> once every 32 blocks for GHASH (or 4 when using the version based on > >> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM. > >> This way, we recover most of the performance while still limiting the > >> duration of scheduling blackouts due to disabling preemption to ~1000 > >> cycles. > >> > >> Cc: Vakul Garg > >> Signed-off-by: Ard Biesheuvel > >> --- > >> arch/arm64/crypto/ghash-ce-core.S | 12 +--- > >> 1 file changed, 9 insertions(+), 3 deletions(-) > >> > >> diff --git a/arch/arm64/crypto/ghash-ce-core.S > >> b/arch/arm64/crypto/ghash-ce-core.S > >> index dcffb9e77589..9c14beaabeee 100644 > >> --- a/arch/arm64/crypto/ghash-ce-core.S > >> +++ b/arch/arm64/crypto/ghash-ce-core.S > >> @@ -212,7 +212,7 @@ > >> ushrXL.2d, XL.2d, #1 > >> .endm > >> > >> - .macro __pmull_ghash, pn > >> + .macro __pmull_ghash, pn, yield_count > >> frame_push 5 > >> > >> mov x19, x0 > >> @@ -259,6 +259,9 @@ CPU_LE( rev64 T1.16b, T1.16b ) > >> eor T2.16b, T2.16b, XH.16b > >> eor XL.16b, XL.16b, T2.16b > >> > >> + tst w19, #(\yield_count - 1) > > > > This should perhaps be (\yield_count) - 1. > > > > It would be a bit silly to pass a non-trivial expression for yield_count > > though. > > > >> + b.ne1b > >> + > > > > Is it worth having a build-time check that \yield_count is a power of two? > > (i.e., (\yield_count) & ((\yield_count) - 1) != 0). We could have a > > generic macro for that. > > > > Otherwise this code may poll more often than expected. Not the end of > > the world, though. > > > > Thanks for taking a look. > > Given that the macro in question is used in exactly two places in the > same file, and is unlikely to be reused unless the architecture gains > support for another optional instruction that can be used as a drop-in > replacement, I don't think any of this truly matters tbh. Fair enough. A comment on the macro definition might help, but beyond that it is probably overkill. Cheers ---Dave
Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks
On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote: > On 25 July 2018 at 11:09, Dave Martin wrote: > > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote: > >> Vakul reports a considerable performance hit when running the accelerated > >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have > >> been updated to take the TIF_NEED_RESCHED flag into account. > >> > >> The issue appears to be caused by the fact that Cortex-A53, the core in > >> question, has a high end implementation of the Crypto Extensions, and > >> has a shallow pipeline, which means even sequential algorithms that may be > >> held back by pipeline stalls on high end out of order cores run at maximum > >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a > >> speed in the order of 2 to 4 cycles per byte, and are currently implemented > >> to check the TIF_NEED_RESCHED after each iteration, which may process as > >> little as 16 bytes (for GHASH). > >> > >> Obviously, every cycle of overhead hurts in this context, and given that > >> the A53's load/store unit is not quite high end, any delays caused by > >> memory accesses that occur in the inner loop of the algorithms are going > >> to be quite significant, hence the performance regression. > >> > >> So reduce the frequency at which the NEON yield checks are performed, so > >> that they occur roughly once every 1000 cycles, which is hopefully a > >> reasonable tradeoff between throughput and worst case scheduling latency. > > > > Is there any way to tune this automatically, or it that likely to be more > > trouble than it's worth? > > > > Good question. I think A53 is a reasonable worst case, and these > changes reduce the impact to the same ballpark as the impact of > enabling CONFIG_PREEMPT in the first place. > > > Also, how did you come up with 1000 cycles? At what point does > > preemption latency become more/less important than throughput? > > > > Another good question. I was hoping Sebastian or the other -rt folks > would be triggered by this. Given the above, I ended up with a ~1000 > cycles quantum, and hopefully this is considered to be small enough. > > > Maybe someone already invented a similar framework somewhere else in the > > kernel. I seem to remember some automatic selection of memcpy > > implementation based on a boot-time benchmark, but I may have > > misremembered. > > > > We have crypto benchmarking code in the kernel, and at some point, I > even did some work on selecting the best algo based on performance. > > But to be honest, I think this is a bit overkill. If you need those > final 5% of throughput at any cost, you're better off running with > CONFIG_PREEMPT=n anyway. Can't really argue with any of that -- I was just wondering whether there was precedent. Hopefully the ~1000 cycles ballpark will satisfy most people. For the rest, it's too bad: if somebody is relying on the last 1-2% of performance, they probably have a broken use case. Cheers ---Dave
Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks
On 25 July 2018 at 11:09, Dave Martin wrote: > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote: >> Vakul reports a considerable performance hit when running the accelerated >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have >> been updated to take the TIF_NEED_RESCHED flag into account. >> >> The issue appears to be caused by the fact that Cortex-A53, the core in >> question, has a high end implementation of the Crypto Extensions, and >> has a shallow pipeline, which means even sequential algorithms that may be >> held back by pipeline stalls on high end out of order cores run at maximum >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a >> speed in the order of 2 to 4 cycles per byte, and are currently implemented >> to check the TIF_NEED_RESCHED after each iteration, which may process as >> little as 16 bytes (for GHASH). >> >> Obviously, every cycle of overhead hurts in this context, and given that >> the A53's load/store unit is not quite high end, any delays caused by >> memory accesses that occur in the inner loop of the algorithms are going >> to be quite significant, hence the performance regression. >> >> So reduce the frequency at which the NEON yield checks are performed, so >> that they occur roughly once every 1000 cycles, which is hopefully a >> reasonable tradeoff between throughput and worst case scheduling latency. > > Is there any way to tune this automatically, or it that likely to be more > trouble than it's worth? > Good question. I think A53 is a reasonable worst case, and these changes reduce the impact to the same ballpark as the impact of enabling CONFIG_PREEMPT in the first place. > Also, how did you come up with 1000 cycles? At what point does > preemption latency become more/less important than throughput? > Another good question. I was hoping Sebastian or the other -rt folks would be triggered by this. Given the above, I ended up with a ~1000 cycles quantum, and hopefully this is considered to be small enough. > Maybe someone already invented a similar framework somewhere else in the > kernel. I seem to remember some automatic selection of memcpy > implementation based on a boot-time benchmark, but I may have > misremembered. > We have crypto benchmarking code in the kernel, and at some point, I even did some work on selecting the best algo based on performance. But to be honest, I think this is a bit overkill. If you need those final 5% of throughput at any cost, you're better off running with CONFIG_PREEMPT=n anyway.
Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
On 25 July 2018 at 11:05, Dave Martin wrote: > On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote: >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every >> iteration of the GHASH and AES-GCM core routines is having a considerable >> performance impact on cores such as the Cortex-A53 with Crypto Extensions >> implemented. >> >> GHASH performance is down by 22% for large block sizes, and AES-GCM is >> down by 16% for large block sizes and 128 bit keys. This appears to be >> a result of the high performance of the crypto instructions on the one >> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with >> the relatively poor load/store performance of this simple core. >> >> So let's reduce this performance impact by only doing the yield check >> once every 32 blocks for GHASH (or 4 when using the version based on >> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM. >> This way, we recover most of the performance while still limiting the >> duration of scheduling blackouts due to disabling preemption to ~1000 >> cycles. >> >> Cc: Vakul Garg >> Signed-off-by: Ard Biesheuvel >> --- >> arch/arm64/crypto/ghash-ce-core.S | 12 +--- >> 1 file changed, 9 insertions(+), 3 deletions(-) >> >> diff --git a/arch/arm64/crypto/ghash-ce-core.S >> b/arch/arm64/crypto/ghash-ce-core.S >> index dcffb9e77589..9c14beaabeee 100644 >> --- a/arch/arm64/crypto/ghash-ce-core.S >> +++ b/arch/arm64/crypto/ghash-ce-core.S >> @@ -212,7 +212,7 @@ >> ushrXL.2d, XL.2d, #1 >> .endm >> >> - .macro __pmull_ghash, pn >> + .macro __pmull_ghash, pn, yield_count >> frame_push 5 >> >> mov x19, x0 >> @@ -259,6 +259,9 @@ CPU_LE( rev64 T1.16b, T1.16b ) >> eor T2.16b, T2.16b, XH.16b >> eor XL.16b, XL.16b, T2.16b >> >> + tst w19, #(\yield_count - 1) > > This should perhaps be (\yield_count) - 1. > > It would be a bit silly to pass a non-trivial expression for yield_count > though. > >> + b.ne1b >> + > > Is it worth having a build-time check that \yield_count is a power of two? > (i.e., (\yield_count) & ((\yield_count) - 1) != 0). We could have a > generic macro for that. > > Otherwise this code may poll more often than expected. Not the end of > the world, though. > Thanks for taking a look. Given that the macro in question is used in exactly two places in the same file, and is unlikely to be reused unless the architecture gains support for another optional instruction that can be used as a drop-in replacement, I don't think any of this truly matters tbh.
Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks
On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote: > Vakul reports a considerable performance hit when running the accelerated > arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have > been updated to take the TIF_NEED_RESCHED flag into account. > > The issue appears to be caused by the fact that Cortex-A53, the core in > question, has a high end implementation of the Crypto Extensions, and > has a shallow pipeline, which means even sequential algorithms that may be > held back by pipeline stalls on high end out of order cores run at maximum > speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a > speed in the order of 2 to 4 cycles per byte, and are currently implemented > to check the TIF_NEED_RESCHED after each iteration, which may process as > little as 16 bytes (for GHASH). > > Obviously, every cycle of overhead hurts in this context, and given that > the A53's load/store unit is not quite high end, any delays caused by > memory accesses that occur in the inner loop of the algorithms are going > to be quite significant, hence the performance regression. > > So reduce the frequency at which the NEON yield checks are performed, so > that they occur roughly once every 1000 cycles, which is hopefully a > reasonable tradeoff between throughput and worst case scheduling latency. Is there any way to tune this automatically, or it that likely to be more trouble than it's worth? Also, how did you come up with 1000 cycles? At what point does preemption latency become more/less important than throughput? Maybe someone already invented a similar framework somewhere else in the kernel. I seem to remember some automatic selection of memcpy implementation based on a boot-time benchmark, but I may have misremembered. Cheers ---Dave
Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote: > As reported by Vakul, checking the TIF_NEED_RESCHED flag after every > iteration of the GHASH and AES-GCM core routines is having a considerable > performance impact on cores such as the Cortex-A53 with Crypto Extensions > implemented. > > GHASH performance is down by 22% for large block sizes, and AES-GCM is > down by 16% for large block sizes and 128 bit keys. This appears to be > a result of the high performance of the crypto instructions on the one > hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with > the relatively poor load/store performance of this simple core. > > So let's reduce this performance impact by only doing the yield check > once every 32 blocks for GHASH (or 4 when using the version based on > 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM. > This way, we recover most of the performance while still limiting the > duration of scheduling blackouts due to disabling preemption to ~1000 > cycles. > > Cc: Vakul Garg > Signed-off-by: Ard Biesheuvel > --- > arch/arm64/crypto/ghash-ce-core.S | 12 +--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/crypto/ghash-ce-core.S > b/arch/arm64/crypto/ghash-ce-core.S > index dcffb9e77589..9c14beaabeee 100644 > --- a/arch/arm64/crypto/ghash-ce-core.S > +++ b/arch/arm64/crypto/ghash-ce-core.S > @@ -212,7 +212,7 @@ > ushrXL.2d, XL.2d, #1 > .endm > > - .macro __pmull_ghash, pn > + .macro __pmull_ghash, pn, yield_count > frame_push 5 > > mov x19, x0 > @@ -259,6 +259,9 @@ CPU_LE( rev64 T1.16b, T1.16b ) > eor T2.16b, T2.16b, XH.16b > eor XL.16b, XL.16b, T2.16b > > + tst w19, #(\yield_count - 1) This should perhaps be (\yield_count) - 1. It would be a bit silly to pass a non-trivial expression for yield_count though. > + b.ne1b > + Is it worth having a build-time check that \yield_count is a power of two? (i.e., (\yield_count) & ((\yield_count) - 1) != 0). We could have a generic macro for that. Otherwise this code may poll more often than expected. Not the end of the world, though. [...] Cheers ---Dave
Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
On 25 July 2018 at 09:27, Ard Biesheuvel wrote: > (+ Mark) > > On 25 July 2018 at 08:57, Vakul Garg wrote: >> >> >>> -Original Message- >>> From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org] >>> Sent: Tuesday, July 24, 2018 10:42 PM >>> To: linux-crypto@vger.kernel.org >>> Cc: herb...@gondor.apana.org.au; will.dea...@arm.com; >>> dave.mar...@arm.com; Vakul Garg ; >>> bige...@linutronix.de; Ard Biesheuvel >>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of >>> NEON yield checks >>> >>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every >>> iteration of the GHASH and AES-GCM core routines is having a considerable >>> performance impact on cores such as the Cortex-A53 with Crypto Extensions >>> implemented. >>> >>> GHASH performance is down by 22% for large block sizes, and AES-GCM is >>> down by 16% for large block sizes and 128 bit keys. This appears to be a >>> result of the high performance of the crypto instructions on the one hand >>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the >>> relatively poor load/store performance of this simple core. >>> >>> So let's reduce this performance impact by only doing the yield check once >>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit >>> polynomial multiplication), and once every 16 blocks for AES-GCM. >>> This way, we recover most of the performance while still limiting the >>> duration of scheduling blackouts due to disabling preemption to ~1000 >>> cycles. >> >> I tested this patch. It helped but didn't regain the performance to previous >> level. >> Are there more files remaining to be fixed? (In your original patch series >> for adding >> preemptability check, there were lot more files changed than this series >> with 4 files). >> >> Instead of using hardcoded 32 block/16 block limit, should it be controlled >> using Kconfig? >> I believe that on different cores, these values could be required to be >> different. >> > > Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on > my 24xA53 system, probably because each per-CPU variable access > involves disabling and re-enabling preemption, turning every per-CPU > load into 2 loads and a store, Actually, more like load/store load load/store so 3 loads and 2 stores. > which hurts on this particular core. > Mark and I have played around a bit with using a GPR to record the > per-CPU offset, which would make this unnecessary, but this has its > own set of problems so that is not expected to land any time soon. > > So if you care that much about squeezing the last drop of throughput > out of your system without regard for worst case scheduling latency, > disabling CONFIG_PREEMPT is a much better idea than playing around > with tunables to tweak the maximum quantum of work that is executed > with preemption disabled, especially since distro kernels will pick > the default anyway.
Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
(+ Mark) On 25 July 2018 at 08:57, Vakul Garg wrote: > > >> -Original Message- >> From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org] >> Sent: Tuesday, July 24, 2018 10:42 PM >> To: linux-crypto@vger.kernel.org >> Cc: herb...@gondor.apana.org.au; will.dea...@arm.com; >> dave.mar...@arm.com; Vakul Garg ; >> bige...@linutronix.de; Ard Biesheuvel >> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of >> NEON yield checks >> >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every >> iteration of the GHASH and AES-GCM core routines is having a considerable >> performance impact on cores such as the Cortex-A53 with Crypto Extensions >> implemented. >> >> GHASH performance is down by 22% for large block sizes, and AES-GCM is >> down by 16% for large block sizes and 128 bit keys. This appears to be a >> result of the high performance of the crypto instructions on the one hand >> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the >> relatively poor load/store performance of this simple core. >> >> So let's reduce this performance impact by only doing the yield check once >> every 32 blocks for GHASH (or 4 when using the version based on 8-bit >> polynomial multiplication), and once every 16 blocks for AES-GCM. >> This way, we recover most of the performance while still limiting the >> duration of scheduling blackouts due to disabling preemption to ~1000 >> cycles. > > I tested this patch. It helped but didn't regain the performance to previous > level. > Are there more files remaining to be fixed? (In your original patch series > for adding > preemptability check, there were lot more files changed than this series with > 4 files). > > Instead of using hardcoded 32 block/16 block limit, should it be controlled > using Kconfig? > I believe that on different cores, these values could be required to be > different. > Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on my 24xA53 system, probably because each per-CPU variable access involves disabling and re-enabling preemption, turning every per-CPU load into 2 loads and a store, which hurts on this particular core. Mark and I have played around a bit with using a GPR to record the per-CPU offset, which would make this unnecessary, but this has its own set of problems so that is not expected to land any time soon. So if you care that much about squeezing the last drop of throughput out of your system without regard for worst case scheduling latency, disabling CONFIG_PREEMPT is a much better idea than playing around with tunables to tweak the maximum quantum of work that is executed with preemption disabled, especially since distro kernels will pick the default anyway.
Re: [PATCH v6 13/18] wireless/lib80211: Convert from ahash to shash
On Tue, 2018-07-24 at 09:49 -0700, Kees Cook wrote: > In preparing to remove all stack VLA usage from the kernel[1], this > removes the discouraged use of AHASH_REQUEST_ON_STACK in favor of > the smaller SHASH_DESC_ON_STACK by converting from ahash-wrapped-shash > to direct shash. By removing a layer of indirection this both improves > performance and reduces stack usage. The stack allocation will be made > a fixed size in a later patch to the crypto subsystem. > I think you sent this before - it should be in net-next now. johannes
Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
On 2018-07-25 07:04:55 [+], Vakul Garg wrote: > > > > What about PREEMPT_NONE (server)? > > Why not have best of both the worlds :) the NEON code gets interrupted because another tasks wants to schedule and the scheduler allows. With "low latency desktop" this gets right done away. The lower levels won't schedule so fast. So if you seek for performance, the lower level should give you more. If you seek for low latency… Sebastian
RE: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
> -Original Message- > From: bige...@linutronix.de [mailto:bige...@linutronix.de] > Sent: Wednesday, July 25, 2018 12:33 PM > To: Vakul Garg > Cc: Ard Biesheuvel ; linux- > cry...@vger.kernel.org; herb...@gondor.apana.org.au; > will.dea...@arm.com; dave.mar...@arm.com > Subject: Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact > of NEON yield checks > > On 2018-07-25 06:57:42 [+], Vakul Garg wrote: > > I tested this patch. It helped but didn't regain the performance to previous > level. > > Are there more files remaining to be fixed? (In your original patch > > series for adding preemptability check, there were lot more files changed > than this series with 4 files). > > > > Instead of using hardcoded 32 block/16 block limit, should it be controlled > using Kconfig? > > I believe that on different cores, these values could be required to be > different. > > What about PREEMPT_NONE (server)? Why not have best of both the worlds :) > > Sebastian
Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
On 2018-07-25 06:57:42 [+], Vakul Garg wrote: > I tested this patch. It helped but didn't regain the performance to previous > level. > Are there more files remaining to be fixed? (In your original patch series > for adding > preemptability check, there were lot more files changed than this series with > 4 files). > > Instead of using hardcoded 32 block/16 block limit, should it be controlled > using Kconfig? > I believe that on different cores, these values could be required to be > different. What about PREEMPT_NONE (server)? Sebastian
RE: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks
> -Original Message- > From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org] > Sent: Tuesday, July 24, 2018 10:42 PM > To: linux-crypto@vger.kernel.org > Cc: herb...@gondor.apana.org.au; will.dea...@arm.com; > dave.mar...@arm.com; Vakul Garg ; > bige...@linutronix.de; Ard Biesheuvel > Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of > NEON yield checks > > As reported by Vakul, checking the TIF_NEED_RESCHED flag after every > iteration of the GHASH and AES-GCM core routines is having a considerable > performance impact on cores such as the Cortex-A53 with Crypto Extensions > implemented. > > GHASH performance is down by 22% for large block sizes, and AES-GCM is > down by 16% for large block sizes and 128 bit keys. This appears to be a > result of the high performance of the crypto instructions on the one hand > (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the > relatively poor load/store performance of this simple core. > > So let's reduce this performance impact by only doing the yield check once > every 32 blocks for GHASH (or 4 when using the version based on 8-bit > polynomial multiplication), and once every 16 blocks for AES-GCM. > This way, we recover most of the performance while still limiting the > duration of scheduling blackouts due to disabling preemption to ~1000 > cycles. I tested this patch. It helped but didn't regain the performance to previous level. Are there more files remaining to be fixed? (In your original patch series for adding preemptability check, there were lot more files changed than this series with 4 files). Instead of using hardcoded 32 block/16 block limit, should it be controlled using Kconfig? I believe that on different cores, these values could be required to be different. > > Cc: Vakul Garg > Signed-off-by: Ard Biesheuvel > --- > arch/arm64/crypto/ghash-ce-core.S | 12 +--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash- > ce-core.S > index dcffb9e77589..9c14beaabeee 100644 > --- a/arch/arm64/crypto/ghash-ce-core.S > +++ b/arch/arm64/crypto/ghash-ce-core.S > @@ -212,7 +212,7 @@ > ushrXL.2d, XL.2d, #1 > .endm > > - .macro __pmull_ghash, pn > + .macro __pmull_ghash, pn, yield_count > frame_push 5 > > mov x19, x0 > @@ -259,6 +259,9 @@ CPU_LE( rev64 T1.16b, T1.16b ) > eor T2.16b, T2.16b, XH.16b > eor XL.16b, XL.16b, T2.16b > > + tst w19, #(\yield_count - 1) > + b.ne1b > + > cbz w19, 3f > > if_will_cond_yield_neon > @@ -279,11 +282,11 @@ CPU_LE( rev64 T1.16b, T1.16b ) >* struct ghash_key const *k, const char > *head) >*/ > ENTRY(pmull_ghash_update_p64) > - __pmull_ghash p64 > + __pmull_ghash p64, 32 > ENDPROC(pmull_ghash_update_p64) > > ENTRY(pmull_ghash_update_p8) > - __pmull_ghash p8 > + __pmull_ghash p8, 4 > ENDPROC(pmull_ghash_update_p8) > > KS .reqv8 > @@ -428,6 +431,9 @@ CPU_LE( rev x28, x28) > st1 {INP.16b}, [x21], #16 > .endif > > + tst w19, #0xf // do yield check only > + b.ne1b // once every 16 > blocks > + > cbz w19, 3f > > if_will_cond_yield_neon > -- > 2.11.0
Re: [PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates
On 25 July 2018 at 03:29, Eric Biggers wrote: > From: Eric Biggers > > The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16, > but the one-way code (used on remainder blocks) implements it with > vshl + vsri, which is slower. Switch the one-way code to vrev32.16 too. > > Signed-off-by: Eric Biggers Acked-by: Ard Biesheuvel > --- > arch/arm/crypto/chacha20-neon-core.S | 10 -- > 1 file changed, 4 insertions(+), 6 deletions(-) > > diff --git a/arch/arm/crypto/chacha20-neon-core.S > b/arch/arm/crypto/chacha20-neon-core.S > index 3fecb2124c35..451a849ad518 100644 > --- a/arch/arm/crypto/chacha20-neon-core.S > +++ b/arch/arm/crypto/chacha20-neon-core.S > @@ -51,9 +51,8 @@ ENTRY(chacha20_block_xor_neon) > .Ldoubleround: > // x0 += x1, x3 = rotl32(x3 ^ x0, 16) > vadd.i32q0, q0, q1 > - veorq4, q3, q0 > - vshl.u32q3, q4, #16 > - vsri.u32q3, q4, #16 > + veorq3, q3, q0 > + vrev32.16 q3, q3 > > // x2 += x3, x1 = rotl32(x1 ^ x2, 12) > vadd.i32q2, q2, q3 > @@ -82,9 +81,8 @@ ENTRY(chacha20_block_xor_neon) > > // x0 += x1, x3 = rotl32(x3 ^ x0, 16) > vadd.i32q0, q0, q1 > - veorq4, q3, q0 > - vshl.u32q3, q4, #16 > - vsri.u32q3, q4, #16 > + veorq3, q3, q0 > + vrev32.16 q3, q3 > > // x2 += x3, x1 = rotl32(x1 ^ x2, 12) > vadd.i32q2, q2, q3 > -- > 2.18.0 >