Re: 答复: [PATCH 1/3] crypto: skcipher - fix crash flushing dcache in error path

2018-07-25 Thread Eric Biggers
Hi GaoKui,

On Thu, Jul 26, 2018 at 02:44:30AM +, gaokui (A) wrote:
> Hi, Eric,
>   Thanks for your reply.
> 
>   I have run  your program on an original kernel and it reproduced the 
> crash. And I also run the program on a kernel with our patch, but there was 
> no crash. 
> 
>   I think the reason of the crash is  the parameter buffer is aligned 
> with the page .  So the address of the parameter buffer starts at the 
> beginning of the page, which making "walk->offset = 0" and generating the 
> crash. I add some logs in "scatterwalk_pagedone()" to print the value of 
> walk->offset, and the log before the crash shows that "walk->offset = 0".
> 
>   And I do not understand why "walk->offset = 0" means no data to be 
> processed. In the structure " scatterlist", the member "offset" represents 
> the offset of the buffer in the page, and the member length represents the 
> length of the buffer. In function "af_alg_make_sg()", if a buffer occupies 
> more than one pages, the offset will also be set to 0 in the second and 
> following pages. And In function scatterwalk_done(), walk->offset = 0 will 
> also allow to call "scatterwalk_pagedone()". So I think that when 
> "walk->offset = 0" the page  needs to be flushed  as well. 
> 
> BRs
> GaoKui
> 

Did you test my patches or just yours?  Your patch fixes the crash, but I don't
agree that it's the best fix.  What you're missing is that walk->offset has
already been increased by scatterwalk_advance() to the offset of the *end* of
the data region processed.  Hence, walk->offset = 0 implies that 0 bytes were
processed (as walk->offset must have been 0 initially, then had 0 added to it),
which I think isn't meant to be a valid case.  And in particular it does *not*
make sense to flush any page when 0 bytes were processed.

Note that this could also be a problem for empty scatterlist elements, but
AFAICS the scatterlist walk code doesn't actually support those when the total
length isn't 0.  I think that needs improvement too, but AFAICS other changes
would be needed to properly fix that limitation, and you apparently cannot
generate empty scatterlist elements via AF_ALG anyway so only in-kernel users
would be affected.

- Eric


答复: [PATCH 1/3] crypto: skcipher - fix crash flushing dcache in error path

2018-07-25 Thread gaokui (A)
Hi, Eric,
Thanks for your reply.

I have run  your program on an original kernel and it reproduced the 
crash. And I also run the program on a kernel with our patch, but there was no 
crash. 

I think the reason of the crash is  the parameter buffer is aligned 
with the page .  So the address of the parameter buffer starts at the beginning 
of the page, which making "walk->offset = 0" and generating the crash. I add 
some logs in "scatterwalk_pagedone()" to print the value of walk->offset, and 
the log before the crash shows that "walk->offset = 0".

And I do not understand why "walk->offset = 0" means no data to be 
processed. In the structure " scatterlist", the member "offset" represents the 
offset of the buffer in the page, and the member length represents the length 
of the buffer. In function "af_alg_make_sg()", if a buffer occupies more than 
one pages, the offset will also be set to 0 in the second and following pages. 
And In function scatterwalk_done(), walk->offset = 0 will also allow to call 
"scatterwalk_pagedone()". So I think that when "walk->offset = 0" the page  
needs to be flushed  as well. 

BRs
GaoKui

> -邮件原件-
> 发件人: linux-crypto-ow...@vger.kernel.org [mailto:linux-crypto-
> ow...@vger.kernel.org] 代表 Eric Biggers
> 发送时间: 2018年7月24日 1:55
> 收件人: linux-crypto@vger.kernel.org; Herbert Xu
> 
> 抄送: Liuchao (H) ; 罗新强
> ; gaokui (A) ; Eric
> Biggers 
> 主题: [PATCH 1/3] crypto: skcipher - fix crash flushing dcache in error path
> 
> From: Eric Biggers 
> 
> scatterwalk_done() is only meant to be called after a nonzero number of
> bytes have been processed, since scatterwalk_pagedone() will flush the
> dcache of the *previous* page.  But in the error case of
> skcipher_walk_done(), e.g. if the input wasn't an integer number of blocks,
> scatterwalk_done() was actually called after advancing 0 bytes.
> This caused a crash ("BUG: unable to handle kernel paging request") during
> '!PageSlab(page)' on architectures like arm and arm64 that define
> ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, provided that the input was
> page-aligned as in that case walk->offset == 0.
> 
> Fix it by reorganizing skcipher_walk_done() to skip the
> scatterwalk_advance() and scatterwalk_done() if an error has occurred.
> 
> This bug was found by syzkaller fuzzing.
> 
> Reproducer, assuming ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE:
> 
>   #include 
>   #include 
>   #include 
> 
>   int main()
>   {
>   struct sockaddr_alg addr = {
>   .salg_type = "skcipher",
>   .salg_name = "cbc(aes-generic)",
>   };
>   char buffer[4096] __attribute__((aligned(4096))) = { 0 };
>   int fd;
> 
>   fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
>   bind(fd, (void *), sizeof(addr));
>   setsockopt(fd, SOL_ALG, ALG_SET_KEY, buffer, 16);
>   fd = accept(fd, NULL, NULL);
>   write(fd, buffer, 15);
>   read(fd, buffer, 15);
>   }
> 
> Reported-by: Liu Chao 
> Fixes: b286d8b1a690 ("crypto: skcipher - Add skcipher walk interface")
> Cc:  # v4.10+
> Signed-off-by: Eric Biggers 
> ---
>  crypto/skcipher.c | 53 ---
>  1 file changed, 27 insertions(+), 26 deletions(-)
> 
> diff --git a/crypto/skcipher.c b/crypto/skcipher.c index
> 7d6a49fe3047..5f7017b36d75 100644
> --- a/crypto/skcipher.c
> +++ b/crypto/skcipher.c
> @@ -95,7 +95,7 @@ static inline u8 *skcipher_get_spot(u8 *start,
> unsigned int len)
>   return max(start, end_page);
>  }
> 
> -static int skcipher_done_slow(struct skcipher_walk *walk, unsigned int
> bsize)
> +static void skcipher_done_slow(struct skcipher_walk *walk, unsigned int
> +bsize)
>  {
>   u8 *addr;
> 
> @@ -103,23 +103,24 @@ static int skcipher_done_slow(struct
> skcipher_walk *walk, unsigned int bsize)
>   addr = skcipher_get_spot(addr, bsize);
>   scatterwalk_copychunks(addr, >out, bsize,
>  (walk->flags & SKCIPHER_WALK_PHYS) ? 2 : 1);
> - return 0;
>  }
> 
>  int skcipher_walk_done(struct skcipher_walk *walk, int err)  {
> - unsigned int n = walk->nbytes - err;
> - unsigned int nbytes;
> -
> - nbytes = walk->total - n;
> -
> - if (unlikely(err < 0)) {
> - nbytes = 0;
> - n = 0;
> - } else if (likely(!(walk->flags & (SKCIPHER_WALK_PHYS |
> -SKCIPHER_WALK_SLOW |
> -SKCIPHER_WALK_COPY |
> -SKCIPHER_WALK_DIFF {
> + unsigned int n; /* bytes processed */
> + bool more;
> +
> + if (unlikely(err < 0))
> + goto finish;
> +
> + n = walk->nbytes - err;
> + walk->total -= n;
> + more = (walk->total != 0);
> +
> + if (likely(!(walk->flags & (SKCIPHER_WALK_PHYS |
> + SKCIPHER_WALK_SLOW |
> +  

Re: [PATCH v6 10/18] x86/power/64: Remove VLA usage

2018-07-25 Thread Kees Cook
On Wed, Jul 25, 2018 at 4:32 AM, Rafael J. Wysocki  wrote:
> On Tue, Jul 24, 2018 at 6:49 PM, Kees Cook  wrote:
>> In the quest to remove all stack VLA usage from the kernel[1], this
>> removes the discouraged use of AHASH_REQUEST_ON_STACK by switching to
>> shash directly and allocating the descriptor in heap memory (which should
>> be fine: the tfm has already been allocated there too).
>>
>> [1] 
>> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
>>
>> Signed-off-by: Kees Cook 
>> Acked-by: Pavel Machek 
>
> I think I can queue this up if there are no objections from others.
>
> Do you want me to do that?

Sure thing. It looks like the other stand-alone patches like this one
are getting taken into the non-crypto trees, so that's fine.

Thanks!

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-25 Thread bige...@linutronix.de
On 2018-07-25 11:54:53 [+0200], Ard Biesheuvel wrote:
> Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a
> 1000 cycle limit to the quantum of work performed with preemption
> disabled is unreasonably low, we can increase the yield block counts
> and approach the optimal numbers a bit closer. But with diminishing
> returns.

So I tested on SoftIron Overdrive 1000 which has A57 cores. I added this
series and didn't notice any spikes. This means cyclictest reported a
max value of like ~20us (which means the crypto code was not
noticeable).
I played a little with it and tcrypt tests for aes/sha1 and also no huge
spikes. So at this point this looks fantastic. I also setup cryptsetup /
dm-crypt with the usual xts(aes) mode and saw no spikes.
At this point, on this hardware if you want to raise the block count, I
wouldn't mind.

I remember on x86 the SIMD accelerated ciphers led to ~1ms+ spikes once
dm-crypt started its jobs.

Sebastian


Re: [PATCH 1/4] treewide: convert ISO_8859-1 text comments to utf-8

2018-07-25 Thread Joe Perches
On Wed, 2018-07-25 at 15:12 +0200, Arnd Bergmann wrote:
> tools/perf/tests/.gitignore:
> LLVM byte-codes, uncompressed
> On Wed, Jul 25, 2018 at 2:55 AM, Andrew Morton
>  wrote:
> > On Tue, 24 Jul 2018 17:13:20 -0700 Joe Perches  wrote:
> > 
> > > On Tue, 2018-07-24 at 14:00 -0700, Andrew Morton wrote:
> > > > On Tue, 24 Jul 2018 13:13:25 +0200 Arnd Bergmann  wrote:
> > > > > Almost all files in the kernel are either plain text or UTF-8
> > > > > encoded. A couple however are ISO_8859-1, usually just a few
> > > > > characters in a C comments, for historic reasons.
> > > > > This converts them all to UTF-8 for consistency.
> > > 
> > > []
> > > > Will we be getting a checkpatch rule to keep things this way?
> > > 
> > > How would that be done?
> > 
> > I'm using this, seems to work.
> > 
> > if ! file $p | grep -q -P ", ASCII text|, UTF-8 Unicode text"
> > then
> > echo $p: weird charset
> > fi
> 
> There are a couple of files that my version of 'find' incorrectly identified 
> as
> something completely different, like:
> 
> Documentation/devicetree/bindings/pinctrl/pinctrl-sx150x.txt:
> SemOne archive data
> Documentation/devicetree/bindings/rtc/epson,rtc7301.txt:
> Microsoft Document Imaging Format
> Documentation/filesystems/nfs/pnfs-block-server.txt:
> PPMN archive data
> arch/arm/boot/dts/bcm283x-rpi-usb-host.dtsi:
> Sendmail frozen configuration  - version = "host";
> Documentation/networking/segmentation-offloads.txt:
> StuffIt Deluxe Segment (data) : gmentation Offloads in the
> Linux Networking Stack
> arch/sparc/include/asm/visasm.h:  SAS 7+
> arch/xtensa/kernel/setup.c: ,
> init=0x454c, stat=0x090a, dev=0x2009, bas=0x2020
> drivers/cpufreq/powernow-k8.c:
> TI-XX Graphing Calculator (FLASH)
> tools/testing/selftests/net/forwarding/tc_shblocks.sh:
> Minix filesystem, V2 (big endian)
> tools/perf/tests/.gitignore:
> LLVM byte-codes, uncompressed
> 
> All of the above seem to be valid ASCII or UTF-8 files, so the check
> above will lead
> to false-positives, but it may be good enough as they are the
> exception, and may be
> bugs in 'file'.
> 
> Not sure if we need to worry about 'file' not being installed.

checkpatch works on patches so I think the test isn't
really relevant.  It has to use the appropriate email
header that sets the charset.

perhaps:
---
 scripts/checkpatch.pl | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 34e4683de7a3..57355fbd2d28 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2765,9 +2765,13 @@ sub process {
 # Check if there is UTF-8 in a commit log when a mail header has explicitly
 # declined it, i.e defined some charset where it is missing.
if ($in_header_lines &&
-   $rawline =~ /^Content-Type:.+charset="(.+)".*$/ &&
-   $1 !~ /utf-8/i) {
-   $non_utf8_charset = 1;
+   $rawline =~ /^Content-Type:.+charset="?([^\s;"]+)/) {
+   my $charset = $1;
+   $non_utf8_charset = 1 if ($charset !~ /^utf-8$/i);
+   if ($charset !~ /^(?:us-ascii|utf-8|iso-8859-1)$/) {
+   WARN("PATCH_CHARSET",
+"Unpreferred email header charset 
'$charset'\n" . $herecurr);
+   }
}
 
if ($in_commit_log && $non_utf8_charset && $realfile =~ /^$/ &&





Re: [PATCH 1/4] treewide: convert ISO_8859-1 text comments to utf-8

2018-07-25 Thread Arnd Bergmann
tools/perf/tests/.gitignore:
LLVM byte-codes, uncompressed
On Wed, Jul 25, 2018 at 2:55 AM, Andrew Morton
 wrote:
> On Tue, 24 Jul 2018 17:13:20 -0700 Joe Perches  wrote:
>
>> On Tue, 2018-07-24 at 14:00 -0700, Andrew Morton wrote:
>> > On Tue, 24 Jul 2018 13:13:25 +0200 Arnd Bergmann  wrote:
>> > > Almost all files in the kernel are either plain text or UTF-8
>> > > encoded. A couple however are ISO_8859-1, usually just a few
>> > > characters in a C comments, for historic reasons.
>> > > This converts them all to UTF-8 for consistency.
>> []
>> > Will we be getting a checkpatch rule to keep things this way?
>>
>> How would that be done?
>
> I'm using this, seems to work.
>
> if ! file $p | grep -q -P ", ASCII text|, UTF-8 Unicode text"
> then
> echo $p: weird charset
> fi

There are a couple of files that my version of 'find' incorrectly identified as
something completely different, like:

Documentation/devicetree/bindings/pinctrl/pinctrl-sx150x.txt:
SemOne archive data
Documentation/devicetree/bindings/rtc/epson,rtc7301.txt:
Microsoft Document Imaging Format
Documentation/filesystems/nfs/pnfs-block-server.txt:
PPMN archive data
arch/arm/boot/dts/bcm283x-rpi-usb-host.dtsi:
Sendmail frozen configuration  - version = "host";
Documentation/networking/segmentation-offloads.txt:
StuffIt Deluxe Segment (data) : gmentation Offloads in the
Linux Networking Stack
arch/sparc/include/asm/visasm.h:  SAS 7+
arch/xtensa/kernel/setup.c: ,
init=0x454c, stat=0x090a, dev=0x2009, bas=0x2020
drivers/cpufreq/powernow-k8.c:
TI-XX Graphing Calculator (FLASH)
tools/testing/selftests/net/forwarding/tc_shblocks.sh:
Minix filesystem, V2 (big endian)
tools/perf/tests/.gitignore:
LLVM byte-codes, uncompressed

All of the above seem to be valid ASCII or UTF-8 files, so the check
above will lead
to false-positives, but it may be good enough as they are the
exception, and may be
bugs in 'file'.

Not sure if we need to worry about 'file' not being installed.

   Arnd


Re: [PATCH v6 10/18] x86/power/64: Remove VLA usage

2018-07-25 Thread Rafael J. Wysocki
On Tue, Jul 24, 2018 at 6:49 PM, Kees Cook  wrote:
> In the quest to remove all stack VLA usage from the kernel[1], this
> removes the discouraged use of AHASH_REQUEST_ON_STACK by switching to
> shash directly and allocating the descriptor in heap memory (which should
> be fine: the tfm has already been allocated there too).
>
> [1] 
> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
>
> Signed-off-by: Kees Cook 
> Acked-by: Pavel Machek 

I think I can queue this up if there are no objections from others.

Do you want me to do that?

> ---
>  arch/x86/power/hibernate_64.c | 36 ---
>  1 file changed, 21 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
> index 67ccf64c8bd8..f8e3b668d20b 100644
> --- a/arch/x86/power/hibernate_64.c
> +++ b/arch/x86/power/hibernate_64.c
> @@ -233,29 +233,35 @@ struct restore_data_record {
>   */
>  static int get_e820_md5(struct e820_table *table, void *buf)
>  {
> -   struct scatterlist sg;
> -   struct crypto_ahash *tfm;
> +   struct crypto_shash *tfm;
> +   struct shash_desc *desc;
> int size;
> int ret = 0;
>
> -   tfm = crypto_alloc_ahash("md5", 0, CRYPTO_ALG_ASYNC);
> +   tfm = crypto_alloc_shash("md5", 0, 0);
> if (IS_ERR(tfm))
> return -ENOMEM;
>
> -   {
> -   AHASH_REQUEST_ON_STACK(req, tfm);
> -   size = offsetof(struct e820_table, entries) + sizeof(struct 
> e820_entry) * table->nr_entries;
> -   ahash_request_set_tfm(req, tfm);
> -   sg_init_one(, (u8 *)table, size);
> -   ahash_request_set_callback(req, 0, NULL, NULL);
> -   ahash_request_set_crypt(req, , buf, size);
> -
> -   if (crypto_ahash_digest(req))
> -   ret = -EINVAL;
> -   ahash_request_zero(req);
> +   desc = kmalloc(sizeof(struct shash_desc) + crypto_shash_descsize(tfm),
> +  GFP_KERNEL);
> +   if (!desc) {
> +   ret = -ENOMEM;
> +   goto free_tfm;
> }
> -   crypto_free_ahash(tfm);
>
> +   desc->tfm = tfm;
> +   desc->flags = 0;
> +
> +   size = offsetof(struct e820_table, entries) +
> +   sizeof(struct e820_entry) * table->nr_entries;
> +
> +   if (crypto_shash_digest(desc, (u8 *)table, size, buf))
> +   ret = -EINVAL;
> +
> +   kzfree(desc);
> +
> +free_tfm:
> +   crypto_free_shash(tfm);
> return ret;
>  }
>
> --
> 2.17.1
>


Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 11:45, Dave Martin  wrote:
> On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote:
>> On 25 July 2018 at 11:09, Dave Martin  wrote:
>> > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:
>> >> Vakul reports a considerable performance hit when running the accelerated
>> >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
>> >> been updated to take the TIF_NEED_RESCHED flag into account.
>> >>
>> >> The issue appears to be caused by the fact that Cortex-A53, the core in
>> >> question, has a high end implementation of the Crypto Extensions, and
>> >> has a shallow pipeline, which means even sequential algorithms that may be
>> >> held back by pipeline stalls on high end out of order cores run at maximum
>> >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at 
>> >> a
>> >> speed in the order of 2 to 4 cycles per byte, and are currently 
>> >> implemented
>> >> to check the TIF_NEED_RESCHED after each iteration, which may process as
>> >> little as 16 bytes (for GHASH).
>> >>
>> >> Obviously, every cycle of overhead hurts in this context, and given that
>> >> the A53's load/store unit is not quite high end, any delays caused by
>> >> memory accesses that occur in the inner loop of the algorithms are going
>> >> to be quite significant, hence the performance regression.
>> >>
>> >> So reduce the frequency at which the NEON yield checks are performed, so
>> >> that they occur roughly once every 1000 cycles, which is hopefully a
>> >> reasonable tradeoff between throughput and worst case scheduling latency.
>> >
>> > Is there any way to tune this automatically, or it that likely to be more
>> > trouble than it's worth?
>> >
>>
>> Good question. I think A53 is a reasonable worst case, and these
>> changes reduce the impact to the same ballpark as the impact of
>> enabling CONFIG_PREEMPT in the first place.
>>
>> > Also, how did you come up with 1000 cycles?  At what point does
>> > preemption latency become more/less important than throughput?
>> >
>>
>> Another good question. I was hoping Sebastian or the other -rt folks
>> would be triggered by this. Given the above, I ended up with a ~1000
>> cycles quantum, and hopefully this is considered to be small enough.
>>
>> > Maybe someone already invented a similar framework somewhere else in the
>> > kernel.  I seem to remember some automatic selection of memcpy
>> > implementation based on a boot-time benchmark, but I may have
>> > misremembered.
>> >
>>
>> We have crypto benchmarking code in the kernel, and at some point, I
>> even did some work on selecting the best algo based on performance.
>>
>> But to be honest, I think this is a bit overkill. If you need those
>> final 5% of throughput at any cost, you're better off running with
>> CONFIG_PREEMPT=n anyway.
>
> Can't really argue with any of that -- I was just wondering whether
> there was precedent.
>
> Hopefully the ~1000 cycles ballpark will satisfy most people.  For
> the rest, it's too bad: if somebody is relying on the last 1-2% of
> performance, they probably have a broken use case.
>

Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a
1000 cycle limit to the quantum of work performed with preemption
disabled is unreasonably low, we can increase the yield block counts
and approach the optimal numbers a bit closer. But with diminishing
returns.

Also, if the cost of enabling CONFIG_PREEMPT by itself is
significantly reduced, e.g., by moving the per-CPU offset into a GPR,
we can always revisit this of course.


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 11:48, Dave Martin  wrote:
> On Wed, Jul 25, 2018 at 10:11:42AM +0100, Ard Biesheuvel wrote:
>> On 25 July 2018 at 11:05, Dave Martin  wrote:
>> > On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:
>> >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>> >> iteration of the GHASH and AES-GCM core routines is having a considerable
>> >> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>> >> implemented.
>> >>
>> >> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>> >> down by 16% for large block sizes and 128 bit keys. This appears to be
>> >> a result of the high performance of the crypto instructions on the one
>> >> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
>> >> the relatively poor load/store performance of this simple core.
>> >>
>> >> So let's reduce this performance impact by only doing the yield check
>> >> once every 32 blocks for GHASH (or 4 when using the version based on
>> >> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
>> >> This way, we recover most of the performance while still limiting the
>> >> duration of scheduling blackouts due to disabling preemption to ~1000
>> >> cycles.
>> >>
>> >> Cc: Vakul Garg 
>> >> Signed-off-by: Ard Biesheuvel 
>> >> ---
>> >>  arch/arm64/crypto/ghash-ce-core.S | 12 +---
>> >>  1 file changed, 9 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/arch/arm64/crypto/ghash-ce-core.S 
>> >> b/arch/arm64/crypto/ghash-ce-core.S
>> >> index dcffb9e77589..9c14beaabeee 100644
>> >> --- a/arch/arm64/crypto/ghash-ce-core.S
>> >> +++ b/arch/arm64/crypto/ghash-ce-core.S
>> >> @@ -212,7 +212,7 @@
>> >>   ushrXL.2d, XL.2d, #1
>> >>   .endm
>> >>
>> >> - .macro  __pmull_ghash, pn
>> >> + .macro  __pmull_ghash, pn, yield_count
>> >>   frame_push  5
>> >>
>> >>   mov x19, x0
>> >> @@ -259,6 +259,9 @@ CPU_LE(   rev64   T1.16b, T1.16b  )
>> >>   eor T2.16b, T2.16b, XH.16b
>> >>   eor XL.16b, XL.16b, T2.16b
>> >>
>> >> + tst w19, #(\yield_count - 1)
>> >
>> > This should perhaps be (\yield_count) - 1.
>> >
>> > It would be a bit silly to pass a non-trivial expression for yield_count
>> > though.
>> >
>> >> + b.ne1b
>> >> +
>> >
>> > Is it worth having a build-time check that \yield_count is a power of two?
>> > (i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a
>> > generic macro for that.
>> >
>> > Otherwise this code may poll more often than expected.  Not the end of
>> > the world, though.
>> >
>>
>> Thanks for taking a look.
>>
>> Given that the macro in question is used in exactly two places in the
>> same file, and is unlikely to be reused unless the architecture gains
>> support for another optional instruction that can be used as a drop-in
>> replacement, I don't think any of this truly matters tbh.
>
> Fair enough.  A comment on the macro definition might help, but beyond
> that it is probably overkill.
>

Sure. I'll add that in v2.


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Dave Martin
On Wed, Jul 25, 2018 at 10:11:42AM +0100, Ard Biesheuvel wrote:
> On 25 July 2018 at 11:05, Dave Martin  wrote:
> > On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:
> >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
> >> iteration of the GHASH and AES-GCM core routines is having a considerable
> >> performance impact on cores such as the Cortex-A53 with Crypto Extensions
> >> implemented.
> >>
> >> GHASH performance is down by 22% for large block sizes, and AES-GCM is
> >> down by 16% for large block sizes and 128 bit keys. This appears to be
> >> a result of the high performance of the crypto instructions on the one
> >> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
> >> the relatively poor load/store performance of this simple core.
> >>
> >> So let's reduce this performance impact by only doing the yield check
> >> once every 32 blocks for GHASH (or 4 when using the version based on
> >> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
> >> This way, we recover most of the performance while still limiting the
> >> duration of scheduling blackouts due to disabling preemption to ~1000
> >> cycles.
> >>
> >> Cc: Vakul Garg 
> >> Signed-off-by: Ard Biesheuvel 
> >> ---
> >>  arch/arm64/crypto/ghash-ce-core.S | 12 +---
> >>  1 file changed, 9 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/arm64/crypto/ghash-ce-core.S 
> >> b/arch/arm64/crypto/ghash-ce-core.S
> >> index dcffb9e77589..9c14beaabeee 100644
> >> --- a/arch/arm64/crypto/ghash-ce-core.S
> >> +++ b/arch/arm64/crypto/ghash-ce-core.S
> >> @@ -212,7 +212,7 @@
> >>   ushrXL.2d, XL.2d, #1
> >>   .endm
> >>
> >> - .macro  __pmull_ghash, pn
> >> + .macro  __pmull_ghash, pn, yield_count
> >>   frame_push  5
> >>
> >>   mov x19, x0
> >> @@ -259,6 +259,9 @@ CPU_LE(   rev64   T1.16b, T1.16b  )
> >>   eor T2.16b, T2.16b, XH.16b
> >>   eor XL.16b, XL.16b, T2.16b
> >>
> >> + tst w19, #(\yield_count - 1)
> >
> > This should perhaps be (\yield_count) - 1.
> >
> > It would be a bit silly to pass a non-trivial expression for yield_count
> > though.
> >
> >> + b.ne1b
> >> +
> >
> > Is it worth having a build-time check that \yield_count is a power of two?
> > (i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a
> > generic macro for that.
> >
> > Otherwise this code may poll more often than expected.  Not the end of
> > the world, though.
> >
> 
> Thanks for taking a look.
> 
> Given that the macro in question is used in exactly two places in the
> same file, and is unlikely to be reused unless the architecture gains
> support for another optional instruction that can be used as a drop-in
> replacement, I don't think any of this truly matters tbh.

Fair enough.  A comment on the macro definition might help, but beyond
that it is probably overkill.

Cheers
---Dave


Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-25 Thread Dave Martin
On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote:
> On 25 July 2018 at 11:09, Dave Martin  wrote:
> > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:
> >> Vakul reports a considerable performance hit when running the accelerated
> >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
> >> been updated to take the TIF_NEED_RESCHED flag into account.
> >>
> >> The issue appears to be caused by the fact that Cortex-A53, the core in
> >> question, has a high end implementation of the Crypto Extensions, and
> >> has a shallow pipeline, which means even sequential algorithms that may be
> >> held back by pipeline stalls on high end out of order cores run at maximum
> >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a
> >> speed in the order of 2 to 4 cycles per byte, and are currently implemented
> >> to check the TIF_NEED_RESCHED after each iteration, which may process as
> >> little as 16 bytes (for GHASH).
> >>
> >> Obviously, every cycle of overhead hurts in this context, and given that
> >> the A53's load/store unit is not quite high end, any delays caused by
> >> memory accesses that occur in the inner loop of the algorithms are going
> >> to be quite significant, hence the performance regression.
> >>
> >> So reduce the frequency at which the NEON yield checks are performed, so
> >> that they occur roughly once every 1000 cycles, which is hopefully a
> >> reasonable tradeoff between throughput and worst case scheduling latency.
> >
> > Is there any way to tune this automatically, or it that likely to be more
> > trouble than it's worth?
> >
> 
> Good question. I think A53 is a reasonable worst case, and these
> changes reduce the impact to the same ballpark as the impact of
> enabling CONFIG_PREEMPT in the first place.
> 
> > Also, how did you come up with 1000 cycles?  At what point does
> > preemption latency become more/less important than throughput?
> >
> 
> Another good question. I was hoping Sebastian or the other -rt folks
> would be triggered by this. Given the above, I ended up with a ~1000
> cycles quantum, and hopefully this is considered to be small enough.
> 
> > Maybe someone already invented a similar framework somewhere else in the
> > kernel.  I seem to remember some automatic selection of memcpy
> > implementation based on a boot-time benchmark, but I may have
> > misremembered.
> >
> 
> We have crypto benchmarking code in the kernel, and at some point, I
> even did some work on selecting the best algo based on performance.
> 
> But to be honest, I think this is a bit overkill. If you need those
> final 5% of throughput at any cost, you're better off running with
> CONFIG_PREEMPT=n anyway.

Can't really argue with any of that -- I was just wondering whether
there was precedent.

Hopefully the ~1000 cycles ballpark will satisfy most people.  For
the rest, it's too bad: if somebody is relying on the last 1-2% of
performance, they probably have a broken use case.

Cheers
---Dave


Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 11:09, Dave Martin  wrote:
> On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:
>> Vakul reports a considerable performance hit when running the accelerated
>> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
>> been updated to take the TIF_NEED_RESCHED flag into account.
>>
>> The issue appears to be caused by the fact that Cortex-A53, the core in
>> question, has a high end implementation of the Crypto Extensions, and
>> has a shallow pipeline, which means even sequential algorithms that may be
>> held back by pipeline stalls on high end out of order cores run at maximum
>> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a
>> speed in the order of 2 to 4 cycles per byte, and are currently implemented
>> to check the TIF_NEED_RESCHED after each iteration, which may process as
>> little as 16 bytes (for GHASH).
>>
>> Obviously, every cycle of overhead hurts in this context, and given that
>> the A53's load/store unit is not quite high end, any delays caused by
>> memory accesses that occur in the inner loop of the algorithms are going
>> to be quite significant, hence the performance regression.
>>
>> So reduce the frequency at which the NEON yield checks are performed, so
>> that they occur roughly once every 1000 cycles, which is hopefully a
>> reasonable tradeoff between throughput and worst case scheduling latency.
>
> Is there any way to tune this automatically, or it that likely to be more
> trouble than it's worth?
>

Good question. I think A53 is a reasonable worst case, and these
changes reduce the impact to the same ballpark as the impact of
enabling CONFIG_PREEMPT in the first place.

> Also, how did you come up with 1000 cycles?  At what point does
> preemption latency become more/less important than throughput?
>

Another good question. I was hoping Sebastian or the other -rt folks
would be triggered by this. Given the above, I ended up with a ~1000
cycles quantum, and hopefully this is considered to be small enough.

> Maybe someone already invented a similar framework somewhere else in the
> kernel.  I seem to remember some automatic selection of memcpy
> implementation based on a boot-time benchmark, but I may have
> misremembered.
>

We have crypto benchmarking code in the kernel, and at some point, I
even did some work on selecting the best algo based on performance.

But to be honest, I think this is a bit overkill. If you need those
final 5% of throughput at any cost, you're better off running with
CONFIG_PREEMPT=n anyway.


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 11:05, Dave Martin  wrote:
> On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:
>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>> iteration of the GHASH and AES-GCM core routines is having a considerable
>> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>> implemented.
>>
>> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>> down by 16% for large block sizes and 128 bit keys. This appears to be
>> a result of the high performance of the crypto instructions on the one
>> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
>> the relatively poor load/store performance of this simple core.
>>
>> So let's reduce this performance impact by only doing the yield check
>> once every 32 blocks for GHASH (or 4 when using the version based on
>> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
>> This way, we recover most of the performance while still limiting the
>> duration of scheduling blackouts due to disabling preemption to ~1000
>> cycles.
>>
>> Cc: Vakul Garg 
>> Signed-off-by: Ard Biesheuvel 
>> ---
>>  arch/arm64/crypto/ghash-ce-core.S | 12 +---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/crypto/ghash-ce-core.S 
>> b/arch/arm64/crypto/ghash-ce-core.S
>> index dcffb9e77589..9c14beaabeee 100644
>> --- a/arch/arm64/crypto/ghash-ce-core.S
>> +++ b/arch/arm64/crypto/ghash-ce-core.S
>> @@ -212,7 +212,7 @@
>>   ushrXL.2d, XL.2d, #1
>>   .endm
>>
>> - .macro  __pmull_ghash, pn
>> + .macro  __pmull_ghash, pn, yield_count
>>   frame_push  5
>>
>>   mov x19, x0
>> @@ -259,6 +259,9 @@ CPU_LE(   rev64   T1.16b, T1.16b  )
>>   eor T2.16b, T2.16b, XH.16b
>>   eor XL.16b, XL.16b, T2.16b
>>
>> + tst w19, #(\yield_count - 1)
>
> This should perhaps be (\yield_count) - 1.
>
> It would be a bit silly to pass a non-trivial expression for yield_count
> though.
>
>> + b.ne1b
>> +
>
> Is it worth having a build-time check that \yield_count is a power of two?
> (i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a
> generic macro for that.
>
> Otherwise this code may poll more often than expected.  Not the end of
> the world, though.
>

Thanks for taking a look.

Given that the macro in question is used in exactly two places in the
same file, and is unlikely to be reused unless the architecture gains
support for another optional instruction that can be used as a drop-in
replacement, I don't think any of this truly matters tbh.


Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-25 Thread Dave Martin
On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:
> Vakul reports a considerable performance hit when running the accelerated
> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
> been updated to take the TIF_NEED_RESCHED flag into account.
> 
> The issue appears to be caused by the fact that Cortex-A53, the core in
> question, has a high end implementation of the Crypto Extensions, and
> has a shallow pipeline, which means even sequential algorithms that may be
> held back by pipeline stalls on high end out of order cores run at maximum
> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a
> speed in the order of 2 to 4 cycles per byte, and are currently implemented
> to check the TIF_NEED_RESCHED after each iteration, which may process as
> little as 16 bytes (for GHASH).
> 
> Obviously, every cycle of overhead hurts in this context, and given that
> the A53's load/store unit is not quite high end, any delays caused by
> memory accesses that occur in the inner loop of the algorithms are going
> to be quite significant, hence the performance regression.
> 
> So reduce the frequency at which the NEON yield checks are performed, so
> that they occur roughly once every 1000 cycles, which is hopefully a
> reasonable tradeoff between throughput and worst case scheduling latency.

Is there any way to tune this automatically, or it that likely to be more
trouble than it's worth?

Also, how did you come up with 1000 cycles?  At what point does
preemption latency become more/less important than throughput?

Maybe someone already invented a similar framework somewhere else in the
kernel.  I seem to remember some automatic selection of memcpy
implementation based on a boot-time benchmark, but I may have
misremembered.

Cheers
---Dave


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Dave Martin
On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:
> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
> iteration of the GHASH and AES-GCM core routines is having a considerable
> performance impact on cores such as the Cortex-A53 with Crypto Extensions
> implemented.
> 
> GHASH performance is down by 22% for large block sizes, and AES-GCM is
> down by 16% for large block sizes and 128 bit keys. This appears to be
> a result of the high performance of the crypto instructions on the one
> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
> the relatively poor load/store performance of this simple core.
> 
> So let's reduce this performance impact by only doing the yield check
> once every 32 blocks for GHASH (or 4 when using the version based on
> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
> This way, we recover most of the performance while still limiting the
> duration of scheduling blackouts due to disabling preemption to ~1000
> cycles.
> 
> Cc: Vakul Garg 
> Signed-off-by: Ard Biesheuvel 
> ---
>  arch/arm64/crypto/ghash-ce-core.S | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/crypto/ghash-ce-core.S 
> b/arch/arm64/crypto/ghash-ce-core.S
> index dcffb9e77589..9c14beaabeee 100644
> --- a/arch/arm64/crypto/ghash-ce-core.S
> +++ b/arch/arm64/crypto/ghash-ce-core.S
> @@ -212,7 +212,7 @@
>   ushrXL.2d, XL.2d, #1
>   .endm
>  
> - .macro  __pmull_ghash, pn
> + .macro  __pmull_ghash, pn, yield_count
>   frame_push  5
>  
>   mov x19, x0
> @@ -259,6 +259,9 @@ CPU_LE(   rev64   T1.16b, T1.16b  )
>   eor T2.16b, T2.16b, XH.16b
>   eor XL.16b, XL.16b, T2.16b
>  
> + tst w19, #(\yield_count - 1)

This should perhaps be (\yield_count) - 1.

It would be a bit silly to pass a non-trivial expression for yield_count
though.

> + b.ne1b
> +

Is it worth having a build-time check that \yield_count is a power of two?
(i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a
generic macro for that.

Otherwise this code may poll more often than expected.  Not the end of
the world, though.

[...]

Cheers
---Dave


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 09:27, Ard Biesheuvel  wrote:
> (+ Mark)
>
> On 25 July 2018 at 08:57, Vakul Garg  wrote:
>>
>>
>>> -Original Message-
>>> From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org]
>>> Sent: Tuesday, July 24, 2018 10:42 PM
>>> To: linux-crypto@vger.kernel.org
>>> Cc: herb...@gondor.apana.org.au; will.dea...@arm.com;
>>> dave.mar...@arm.com; Vakul Garg ;
>>> bige...@linutronix.de; Ard Biesheuvel 
>>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of
>>> NEON yield checks
>>>
>>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>>> iteration of the GHASH and AES-GCM core routines is having a considerable
>>> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>>> implemented.
>>>
>>> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>>> down by 16% for large block sizes and 128 bit keys. This appears to be a
>>> result of the high performance of the crypto instructions on the one hand
>>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the
>>> relatively poor load/store performance of this simple core.
>>>
>>> So let's reduce this performance impact by only doing the yield check once
>>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit
>>> polynomial multiplication), and once every 16 blocks for AES-GCM.
>>> This way, we recover most of the performance while still limiting the
>>> duration of scheduling blackouts due to disabling preemption to ~1000
>>> cycles.
>>
>> I tested this patch. It helped but didn't regain the performance to previous 
>> level.
>> Are there more files remaining to be fixed? (In your original patch series 
>> for adding
>> preemptability check, there were lot more files changed than this series 
>> with 4 files).
>>
>> Instead of using hardcoded  32 block/16 block limit, should it be controlled 
>> using Kconfig?
>> I believe that on different cores, these values could be required to be 
>> different.
>>
>
> Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on
> my 24xA53 system, probably because each per-CPU variable access
> involves disabling and re-enabling preemption, turning every per-CPU
> load into 2 loads and a store,

Actually, more like

load/store
load
load/store

so 3 loads and 2 stores.



> which hurts on this particular core.
> Mark and I have played around a bit with using a GPR to record the
> per-CPU offset, which would make this unnecessary, but this has its
> own set of problems so that is not expected to land any time soon.
>
> So if you care that much about squeezing the last drop of throughput
> out of your system without regard for worst case scheduling latency,
> disabling CONFIG_PREEMPT is a much better idea than playing around
> with tunables to tweak the maximum quantum of work that is executed
> with preemption disabled, especially since distro kernels will pick
> the default anyway.


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
(+ Mark)

On 25 July 2018 at 08:57, Vakul Garg  wrote:
>
>
>> -Original Message-
>> From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org]
>> Sent: Tuesday, July 24, 2018 10:42 PM
>> To: linux-crypto@vger.kernel.org
>> Cc: herb...@gondor.apana.org.au; will.dea...@arm.com;
>> dave.mar...@arm.com; Vakul Garg ;
>> bige...@linutronix.de; Ard Biesheuvel 
>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of
>> NEON yield checks
>>
>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>> iteration of the GHASH and AES-GCM core routines is having a considerable
>> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>> implemented.
>>
>> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>> down by 16% for large block sizes and 128 bit keys. This appears to be a
>> result of the high performance of the crypto instructions on the one hand
>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the
>> relatively poor load/store performance of this simple core.
>>
>> So let's reduce this performance impact by only doing the yield check once
>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit
>> polynomial multiplication), and once every 16 blocks for AES-GCM.
>> This way, we recover most of the performance while still limiting the
>> duration of scheduling blackouts due to disabling preemption to ~1000
>> cycles.
>
> I tested this patch. It helped but didn't regain the performance to previous 
> level.
> Are there more files remaining to be fixed? (In your original patch series 
> for adding
> preemptability check, there were lot more files changed than this series with 
> 4 files).
>
> Instead of using hardcoded  32 block/16 block limit, should it be controlled 
> using Kconfig?
> I believe that on different cores, these values could be required to be 
> different.
>

Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on
my 24xA53 system, probably because each per-CPU variable access
involves disabling and re-enabling preemption, turning every per-CPU
load into 2 loads and a store, which hurts on this particular core.
Mark and I have played around a bit with using a GPR to record the
per-CPU offset, which would make this unnecessary, but this has its
own set of problems so that is not expected to land any time soon.

So if you care that much about squeezing the last drop of throughput
out of your system without regard for worst case scheduling latency,
disabling CONFIG_PREEMPT is a much better idea than playing around
with tunables to tweak the maximum quantum of work that is executed
with preemption disabled, especially since distro kernels will pick
the default anyway.


Re: [PATCH v6 13/18] wireless/lib80211: Convert from ahash to shash

2018-07-25 Thread Johannes Berg
On Tue, 2018-07-24 at 09:49 -0700, Kees Cook wrote:
> In preparing to remove all stack VLA usage from the kernel[1], this
> removes the discouraged use of AHASH_REQUEST_ON_STACK in favor of
> the smaller SHASH_DESC_ON_STACK by converting from ahash-wrapped-shash
> to direct shash. By removing a layer of indirection this both improves
> performance and reduces stack usage. The stack allocation will be made
> a fixed size in a later patch to the crypto subsystem.
> 
I think you sent this before - it should be in net-next now.

johannes


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread bige...@linutronix.de
On 2018-07-25 07:04:55 [+], Vakul Garg wrote:
> > 
> > What about PREEMPT_NONE (server)?
> 
> Why not have best of both the worlds :)

the NEON code gets interrupted because another tasks wants to schedule
and the scheduler allows. With "low latency desktop" this gets right
done away. The lower levels won't schedule so fast. So if you seek for
performance, the lower level should give you more. If you seek for low
latency…

Sebastian


RE: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Vakul Garg


> -Original Message-
> From: bige...@linutronix.de [mailto:bige...@linutronix.de]
> Sent: Wednesday, July 25, 2018 12:33 PM
> To: Vakul Garg 
> Cc: Ard Biesheuvel ; linux-
> cry...@vger.kernel.org; herb...@gondor.apana.org.au;
> will.dea...@arm.com; dave.mar...@arm.com
> Subject: Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact
> of NEON yield checks
> 
> On 2018-07-25 06:57:42 [+], Vakul Garg wrote:
> > I tested this patch. It helped but didn't regain the performance to previous
> level.
> > Are there more files remaining to be fixed? (In your original patch
> > series for adding preemptability check, there were lot more files changed
> than this series with 4 files).
> >
> > Instead of using hardcoded  32 block/16 block limit, should it be controlled
> using Kconfig?
> > I believe that on different cores, these values could be required to be
> different.
> 
> What about PREEMPT_NONE (server)?

Why not have best of both the worlds :)

> 
> Sebastian


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread bige...@linutronix.de
On 2018-07-25 06:57:42 [+], Vakul Garg wrote:
> I tested this patch. It helped but didn't regain the performance to previous 
> level.
> Are there more files remaining to be fixed? (In your original patch series 
> for adding
> preemptability check, there were lot more files changed than this series with 
> 4 files).
> 
> Instead of using hardcoded  32 block/16 block limit, should it be controlled 
> using Kconfig?
> I believe that on different cores, these values could be required to be 
> different.

What about PREEMPT_NONE (server)?

Sebastian


RE: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Vakul Garg



> -Original Message-
> From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org]
> Sent: Tuesday, July 24, 2018 10:42 PM
> To: linux-crypto@vger.kernel.org
> Cc: herb...@gondor.apana.org.au; will.dea...@arm.com;
> dave.mar...@arm.com; Vakul Garg ;
> bige...@linutronix.de; Ard Biesheuvel 
> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of
> NEON yield checks
> 
> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
> iteration of the GHASH and AES-GCM core routines is having a considerable
> performance impact on cores such as the Cortex-A53 with Crypto Extensions
> implemented.
> 
> GHASH performance is down by 22% for large block sizes, and AES-GCM is
> down by 16% for large block sizes and 128 bit keys. This appears to be a
> result of the high performance of the crypto instructions on the one hand
> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the
> relatively poor load/store performance of this simple core.
> 
> So let's reduce this performance impact by only doing the yield check once
> every 32 blocks for GHASH (or 4 when using the version based on 8-bit
> polynomial multiplication), and once every 16 blocks for AES-GCM.
> This way, we recover most of the performance while still limiting the
> duration of scheduling blackouts due to disabling preemption to ~1000
> cycles.

I tested this patch. It helped but didn't regain the performance to previous 
level.
Are there more files remaining to be fixed? (In your original patch series for 
adding
preemptability check, there were lot more files changed than this series with 4 
files).

Instead of using hardcoded  32 block/16 block limit, should it be controlled 
using Kconfig?
I believe that on different cores, these values could be required to be 
different.

> 
> Cc: Vakul Garg 
> Signed-off-by: Ard Biesheuvel 
> ---
>  arch/arm64/crypto/ghash-ce-core.S | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-
> ce-core.S
> index dcffb9e77589..9c14beaabeee 100644
> --- a/arch/arm64/crypto/ghash-ce-core.S
> +++ b/arch/arm64/crypto/ghash-ce-core.S
> @@ -212,7 +212,7 @@
>   ushrXL.2d, XL.2d, #1
>   .endm
> 
> - .macro  __pmull_ghash, pn
> + .macro  __pmull_ghash, pn, yield_count
>   frame_push  5
> 
>   mov x19, x0
> @@ -259,6 +259,9 @@ CPU_LE(   rev64   T1.16b, T1.16b  )
>   eor T2.16b, T2.16b, XH.16b
>   eor XL.16b, XL.16b, T2.16b
> 
> + tst w19, #(\yield_count - 1)
> + b.ne1b
> +
>   cbz w19, 3f
> 
>   if_will_cond_yield_neon
> @@ -279,11 +282,11 @@ CPU_LE( rev64   T1.16b, T1.16b  )
>* struct ghash_key const *k, const char
> *head)
>*/
>  ENTRY(pmull_ghash_update_p64)
> - __pmull_ghash   p64
> + __pmull_ghash   p64, 32
>  ENDPROC(pmull_ghash_update_p64)
> 
>  ENTRY(pmull_ghash_update_p8)
> - __pmull_ghash   p8
> + __pmull_ghash   p8, 4
>  ENDPROC(pmull_ghash_update_p8)
> 
>   KS  .reqv8
> @@ -428,6 +431,9 @@ CPU_LE(   rev x28, x28)
>   st1 {INP.16b}, [x21], #16
>   .endif
> 
> + tst w19, #0xf   // do yield check only
> + b.ne1b  // once every 16
> blocks
> +
>   cbz w19, 3f
> 
>   if_will_cond_yield_neon
> --
> 2.11.0



Re: [PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 03:29, Eric Biggers  wrote:
> From: Eric Biggers 
>
> The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
> but the one-way code (used on remainder blocks) implements it with
> vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.
>
> Signed-off-by: Eric Biggers 

Acked-by: Ard Biesheuvel 

> ---
>  arch/arm/crypto/chacha20-neon-core.S | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/crypto/chacha20-neon-core.S 
> b/arch/arm/crypto/chacha20-neon-core.S
> index 3fecb2124c35..451a849ad518 100644
> --- a/arch/arm/crypto/chacha20-neon-core.S
> +++ b/arch/arm/crypto/chacha20-neon-core.S
> @@ -51,9 +51,8 @@ ENTRY(chacha20_block_xor_neon)
>  .Ldoubleround:
> // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #16
> -   vsri.u32q3, q4, #16
> +   veorq3, q3, q0
> +   vrev32.16   q3, q3
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
> vadd.i32q2, q2, q3
> @@ -82,9 +81,8 @@ ENTRY(chacha20_block_xor_neon)
>
> // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #16
> -   vsri.u32q3, q4, #16
> +   veorq3, q3, q0
> +   vrev32.16   q3, q3
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
> vadd.i32q2, q2, q3
> --
> 2.18.0
>