Hoi Sascha, > On 2026-06-16 14:49, Johannes Schneider wrote: > > barebox's PBL ships a generic-C sha256_transform() that runs roughly > > 1.6 MB/s on a Cortex-A53. Callers that hash MB-scale blobs in the PBL > > -- e.g. the fw-external SHA-256 verify on i.MX8M, ~720 KiB of BL32 -- > > spend hundreds of ms in the transform even with the D-cache warm. > > > > Wire the asm core in arch/arm/crypto/sha2-ce-core.S into the PBL link > > and expose it through a new sha256_transform_blocks() entry point. > > The asm has an internal multi-block loop; a single call amortises the > > prologue (round-constant load, state load) over the whole input, which > > makes the difference between ~200 ms (per-block calls) and ~5 ms > > (batched) on the BL32 verify. > > > > Rewire sha256_update()'s bulk path to call sha256_transform_blocks() > > with the remaining block count rather than looping over a single-block > > transform. The generic-C path gets a trivial blocks-wrapping shim so > > both code paths share the same caller-side API. > > > > The asm needs two link-time constants (sha256_ce_offsetof_count and > > sha256_ce_offsetof_finalize) which we provide locally rather than > > pulling in sha2-ce-glue.c -- the glue drags crypto-API and > > kernel_neon_begin shims that the PBL has no use for. > > > > Measured on i.MX8MM and i.MX8MP, ~720 KiB SHA-256 verify with MMU on: > > ~300 ms (generic-C) > > -> 17 ms (crypto-ext, single block per call) > > -> 3-5 ms (crypto-ext, batched). > > Both crypto-ext savings carry over with MMU off too, just shifted up > > by the uncached-DRAM read cost. > > > > Assisted-by: Claude:claude-opus-4-7 > > Signed-off-by: Johannes Schneider <[email protected]> > > --- > > arch/arm/crypto/Makefile | 3 ++ > > crypto/Kconfig | 12 ++++++++ > > crypto/sha2.c | 66 ++++++++++++++++++++++++++++++++++++---- > > 3 files changed, 75 insertions(+), 6 deletions(-) > > > > diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile > > index 55b3ac0538..72d4bd77c0 100644 > > --- a/arch/arm/crypto/Makefile > > +++ b/arch/arm/crypto/Makefile > > @@ -15,6 +15,9 @@ sha1-ce-y := sha1-ce-glue.o sha1-ce-core.o > > obj-$(CONFIG_DIGEST_SHA256_ARM64_CE) += sha2-ce.o > > sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o > > > > +# Reuse the asm core (glue is provided inline in crypto/sha2.c). > > +pbl-$(CONFIG_PBL_DIGEST_SHA256_ARM64_CE) += sha2-ce-core.o > > + > > quiet_cmd_perl = PERL $@ > > cmd_perl = $(PERL) $(<) > $(@) > > > > diff --git a/crypto/Kconfig b/crypto/Kconfig > > index 528e9a0d22..3dfb316b32 100644 > > --- a/crypto/Kconfig > > +++ b/crypto/Kconfig > > @@ -107,6 +107,18 @@ config DIGEST_SHA256_ARM64_CE > > Architecture: arm64 using: > > - ARMv8 Crypto Extensions > > > > +config PBL_DIGEST_SHA256_ARM64_CE > > + bool "SHA-256 in PBL via ARMv8 Crypto Extensions" > > + depends on CPU_V8 && PBL_IMAGE > > + help > > + Use ARMv8 Crypto Extensions (sha256h/sha256h2/sha256su0/sha256su1) > > + for the SHA-256 transform inside the PBL. Roughly 100x faster than > > + the generic-C transform; for callers that hash large blobs (e.g. > > + fw-external SHA-256 verifies) this is the difference between tens > > + of ms and hundreds. Requires Cortex-A53 or later with the optional > > + Crypto Extensions feature. > > + > > + > > endif > > > > config CRYPTO_PBKDF2 > > diff --git a/crypto/sha2.c b/crypto/sha2.c > > Don't modify this file for adding a special purpose sha256 > implementation. > > Instead create a pbl/sha256.c where you have a > > pbl_sha256(void *buf, size_t size) > > In this pick the best available implementation. > > Sascha >
Ok, done too - and also merged into the v2 patchstack with the PIO+MMU / SDMA, since the sha-ing also wants the mmu enable commit. This patches follow up version is this one here: https://lists.infradead.org/pipermail/barebox/2026-July/056908.html gruß Johannes
