On Thu, 29 May 2025 at 10:37, Eric Biggers <ebigg...@kernel.org> wrote: > > Long-term, I'd like to find a clean way to consolidate the library code for > each > algorithm into a single module.
No, while I think the current situation isn't great, I think the "make it one single module" is even worse. For most architectures - including s390 - you end up being in the situation that these kinds of hw accelerated crypto things depend on some CPU capability, and aren't necessarily statically always available. So these things end up having stupid extra overhead due to having some conditional. That extra overhead is then in turn minimized with tricks like static branches, but that's all just just piling more ugly hacks on top because it picked a bad choice to begin with. So what's the *right* thing to do? The right thing to do is to just link the right routine in the first place, and *not* have static branch hackery at all. Because you didn't need it. And we already do runtime linking at module loading time. So if it's a module, if the hardware acceleration doesn't exist, the module load should just fail, and the loader should go on to load the next option. Not any silly "one module to rule them all" hackery that only results in worse code. Just a simple "if this module loads successfully, you'll link the optimal hw acceleration". Now, the problem with this all is the *non*modular case. For modules, we already have the optimal solution in the form of init-module error handling and runtime linking. So I think the module case is "solved" (except the solution is not what we actually do). For the non-module case, the problem is that "I linked this unconditionally, and now it turns out I run on hardware that doesn't have the capability to run this". And that's when you need to do things like static_call_update() to basically do runtime re-linking of a static decision. And currently we very much do this wrong. See how s390 and x86-64 (and presumably others) basically have the *exact* same problems, but they then mix static branches and static calls (in the case of x86-64) and just have non-optimal code in general. What I think the generic code should do (for the built-in case) is just have DEFINE_STATIC_CALL(sha256_blocks_fn, sha256_blocks_generic); and do static_call(sha256_blocks_fn)(args..); and then architecture code can do the static_call_update() to set their optimal version. And yeah, we'd presumably need multiple versions, since there's the whole "is simd usable" thing. Although maybe that's going away? Linus