barebox's PBL ships a generic-C sha256_transform() that runs roughly 1.6 MB/s on a Cortex-A53. Callers that hash MB-scale blobs in the PBL -- e.g. the fw-external SHA-256 verify on i.MX8M, ~720 KiB of BL32 -- spend hundreds of ms in the transform even with the D-cache warm.
Wire the asm core in arch/arm/crypto/sha2-ce-core.S into the PBL link and expose it through a new sha256_transform_blocks() entry point. The asm has an internal multi-block loop; a single call amortises the prologue (round-constant load, state load) over the whole input, which makes the difference between ~200 ms (per-block calls) and ~5 ms (batched) on the BL32 verify. Rewire sha256_update()'s bulk path to call sha256_transform_blocks() with the remaining block count rather than looping over a single-block transform. The generic-C path gets a trivial blocks-wrapping shim so both code paths share the same caller-side API. The asm needs two link-time constants (sha256_ce_offsetof_count and sha256_ce_offsetof_finalize) which we provide locally rather than pulling in sha2-ce-glue.c -- the glue drags crypto-API and kernel_neon_begin shims that the PBL has no use for. Measured on i.MX8MM and i.MX8MP, ~720 KiB SHA-256 verify with MMU on: ~300 ms (generic-C) -> 17 ms (crypto-ext, single block per call) -> 3-5 ms (crypto-ext, batched). Both crypto-ext savings carry over with MMU off too, just shifted up by the uncached-DRAM read cost. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Johannes Schneider <[email protected]> --- arch/arm/crypto/Makefile | 3 ++ crypto/Kconfig | 12 ++++++++ crypto/sha2.c | 66 ++++++++++++++++++++++++++++++++++++---- 3 files changed, 75 insertions(+), 6 deletions(-) diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile index 55b3ac0538..72d4bd77c0 100644 --- a/arch/arm/crypto/Makefile +++ b/arch/arm/crypto/Makefile @@ -15,6 +15,9 @@ sha1-ce-y := sha1-ce-glue.o sha1-ce-core.o obj-$(CONFIG_DIGEST_SHA256_ARM64_CE) += sha2-ce.o sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o +# Reuse the asm core (glue is provided inline in crypto/sha2.c). +pbl-$(CONFIG_PBL_DIGEST_SHA256_ARM64_CE) += sha2-ce-core.o + quiet_cmd_perl = PERL $@ cmd_perl = $(PERL) $(<) > $(@) diff --git a/crypto/Kconfig b/crypto/Kconfig index 528e9a0d22..3dfb316b32 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -107,6 +107,18 @@ config DIGEST_SHA256_ARM64_CE Architecture: arm64 using: - ARMv8 Crypto Extensions +config PBL_DIGEST_SHA256_ARM64_CE + bool "SHA-256 in PBL via ARMv8 Crypto Extensions" + depends on CPU_V8 && PBL_IMAGE + help + Use ARMv8 Crypto Extensions (sha256h/sha256h2/sha256su0/sha256su1) + for the SHA-256 transform inside the PBL. Roughly 100x faster than + the generic-C transform; for callers that hash large blobs (e.g. + fw-external SHA-256 verifies) this is the difference between tens + of ms and hundreds. Requires Cortex-A53 or later with the optional + Crypto Extensions feature. + + endif config CRYPTO_PBKDF2 diff --git a/crypto/sha2.c b/crypto/sha2.c index cac5095648..06af886867 100644 --- a/crypto/sha2.c +++ b/crypto/sha2.c @@ -29,6 +29,44 @@ #include <crypto/internal.h> #include <crypto/pbl-sha.h> +#if defined(__PBL__) && IS_ENABLED(CONFIG_PBL_DIGEST_SHA256_ARM64_CE) +/* + * PBL multi-block sha256 dispatch through the asm core in + * arch/arm/crypto/sha2-ce-core.S. The asm expects a sha256_ce_state- + * compatible struct and reads its `count` / `finalize` fields at the + * offsets advertised by the two link-time constants below. With + * finalize == 0 the asm runs just the block transform and writes the + * new midstate back into state[]; count/buf are untouched. + * + * Avoiding sha2-ce-glue.c here keeps the PBL out of the crypto-API and + * kernel_neon_begin shims, which add bytes and unrelated dependencies. + */ +struct pbl_sha256_ce_state { + u32 state[8]; + u64 count; + u8 buf[64]; + u32 finalize; +}; + +const u32 sha256_ce_offsetof_count = offsetof(struct pbl_sha256_ce_state, count); +const u32 sha256_ce_offsetof_finalize = offsetof(struct pbl_sha256_ce_state, finalize); + +extern int sha2_ce_transform(struct pbl_sha256_ce_state *sst, + const u8 *src, int blocks); + +static void sha256_transform_blocks(u32 *state, const u8 *input, + unsigned int blocks) +{ + struct pbl_sha256_ce_state sst; + + memcpy(sst.state, state, sizeof(sst.state)); + sst.finalize = 0; + sha2_ce_transform(&sst, input, blocks); + memcpy(state, sst.state, sizeof(sst.state)); +} + +#else /* generic C transform */ + static inline u32 Ch(u32 x, u32 y, u32 z) { return z ^ (x & (y ^ z)); @@ -213,6 +251,18 @@ static void sha256_transform(u32 *state, const u8 *input) state[4] += e; state[5] += f; state[6] += g; state[7] += h; } +static void sha256_transform_blocks(u32 *state, const u8 *input, + unsigned int blocks) +{ + while (blocks--) { + sha256_transform(state, input); + input += 64; + } +} + +#endif /* PBL crypto-ext vs generic */ + + static int sha224_init(struct digest *desc) { struct sha256_state *sctx = digest_ctx(desc); @@ -258,18 +308,22 @@ int sha256_update(struct digest *desc, const void *data, src = data; if ((partial + len) > 63) { + unsigned int blocks; + if (partial) { done = -partial; memcpy(sctx->buf + partial, data, done + 64); - src = sctx->buf; + sha256_transform_blocks(sctx->state, sctx->buf, 1); + done += 64; } - do { - sha256_transform(sctx->state, src); - done += 64; - src = data + done; - } while (done + 63 < len); + blocks = (len - done) / 64; + if (blocks) { + sha256_transform_blocks(sctx->state, data + done, blocks); + done += blocks * 64; + } + src = data + done; partial = 0; } memcpy(sctx->buf + partial, src, len - done); -- 2.43.0
