arm64: ghash - reduce performance impact of NEON yield checks

Ard Biesheuvel Wed, 25 Jul 2018 01:09:29 -0700

On 25 July 2018 at 09:27, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
> (+ Mark)
>
> On 25 July 2018 at 08:57, Vakul Garg <vakul.g...@nxp.com> wrote:
>>
>>
>>> -----Original Message-----
>>> From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org]
>>> Sent: Tuesday, July 24, 2018 10:42 PM
>>> To: linux-crypto@vger.kernel.org
>>> Cc: herb...@gondor.apana.org.au; will.dea...@arm.com;
>>> dave.mar...@arm.com; Vakul Garg <vakul.g...@nxp.com>;
>>> bige...@linutronix.de; Ard Biesheuvel <ard.biesheu...@linaro.org>
>>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of
>>> NEON yield checks
>>>
>>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>>> iteration of the GHASH and AES-GCM core routines is having a considerable
>>> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>>> implemented.
>>>
>>> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>>> down by 16% for large block sizes and 128 bit keys. This appears to be a
>>> result of the high performance of the crypto instructions on the one hand
>>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the
>>> relatively poor load/store performance of this simple core.
>>>
>>> So let's reduce this performance impact by only doing the yield check once
>>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit
>>> polynomial multiplication), and once every 16 blocks for AES-GCM.
>>> This way, we recover most of the performance while still limiting the
>>> duration of scheduling blackouts due to disabling preemption to ~1000
>>> cycles.
>>
>> I tested this patch. It helped but didn't regain the performance to previous 
>> level.
>> Are there more files remaining to be fixed? (In your original patch series 
>> for adding
>> preemptability check, there were lot more files changed than this series 
>> with 4 files).
>>
>> Instead of using hardcoded  32 block/16 block limit, should it be controlled 
>> using Kconfig?
>> I believe that on different cores, these values could be required to be 
>> different.
>>
>
> Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on
> my 24xA53 system, probably because each per-CPU variable access
> involves disabling and re-enabling preemption, turning every per-CPU
> load into 2 loads and a store,


Actually, more like

load/store
load
load/store

so 3 loads and 2 stores.



> which hurts on this particular core.
> Mark and I have played around a bit with using a GPR to record the
> per-CPU offset, which would make this unnecessary, but this has its
> own set of problems so that is not expected to land any time soon.
>
> So if you care that much about squeezing the last drop of throughput
> out of your system without regard for worst case scheduling latency,
> disabling CONFIG_PREEMPT is a much better idea than playing around
> with tunables to tweak the maximum quantum of work that is executed
> with preemption disabled, especially since distro kernels will pick
> the default anyway.

Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

Reply via email to