Background ========== [PR#14759](https://github.com/openssl/openssl/pull/14759) (Set blinding=yes property on some algorithm implementations) is a fix for [Issue#14654](https://github.com/openssl/openssl/issues/14654) which itself is a spin-off of [Issue#14616](https://github.com/openssl/openssl/issues/14616).
The original issue is about _Deprecated low level key API's that have no replacements_, among which the following received special attention and were issued a dedicated issue during an OTC meeting: ~~~c // RSA functions on RSA_FLAG_* void RSA_clear_flags(RSA *r, int flags); int RSA_test_flags(const RSA *r, int flags); void RSA_set_flags(RSA *r, int flags); // RSA_FLAG_* about blinding #define RSA_FLAG_BLINDING #define RSA_FLAG_NO_BLINDING // RSA functions directly on blinding int RSA_blinding_on(RSA *rsa, BN_CTX *ctx); void RSA_blinding_off(RSA *rsa); BN_BLINDING *RSA_setup_blinding(RSA *rsa, BN_CTX *ctx); ~~~ The decision the sprung Issue#14616 and PR#14759 was to use the propquery mechanism to let providers advertise algorithms as `blinding=yes` to select secure implementations if there are insecure ones present as well. Similarly it was discussed the potential `consttime=yes` property that would work in a similar way: if applied properly for our current implementations that are not fully constant time it would allow a user/sysadmin/developer to prefer a third party implementation for the same algorithm with better security guarantees. In some contexts the consttime implementation might be seriously penalizing in terms of performance and in the contexts where const time is not required this would allow to select accordingly. Definition for the blinding property ------------------------------------ The current definition of the `blinding` property applies to provider-native algorithm implementations for the `asym_cipher` and `signature` operations: ```pod =head2 Properties The following property is set by some of the OpenSSL signature algorithms. =over 4 =item "blinding" This boolean property is set to "yes" if the implementation performs blinding to prevent some side-channel attacks. ``` Rationale ========= Property queries are our decision making process for implementation selection, and has been part of the design for 3.0 since the Brisbane design meetings: I am not opposing the use of the property query mechanism to select algorithms here, but the semantics of the properties we decide to adopt (and thus endorse and encourage also for use by 3rd parties). In particular I see the following issues with choices like `blinding=yes` or `consttime=yes`. Design promotes insecure by default ----------------------------------- This design is a slippery slope into patterns that go against the "secure by default" best practices. Users/administrators/developers should not have to opt-in for the safer implementation, but the other way around: the opt-in should be for the less safe but more performant implementations after evaluating the consequences of such a choice in specific contexts. We shouldn't have users having to query for `consttime=yes` algorithms but rather for `consttime=no` explicitly in specific conditions. So if this was the only issue with PR#14759 my recommendation would be to rather flag the non-blinding implementations as such rather than the other way around as is currently done. The scenario in which 3rd party providers offer insecure algorithms not flagged as such with `consttime=no` or `blinding=no` IMO would then fall under the "know your providers" umbrella: if you are installing provider X you should be well aware that you are trusting X's authors "to do the right thing". The project history showed us how the risk for insecure-by-default designs and its consequences are not just an hypothetical matter of debate: for example, `BN_FLG_CONSTTIME` has been with us since version 0.9.8f (and even before that under the `BN_FLG_EXP_CONSTTIME` name), well over 14 years, in which the decision of having the flag off by default and requiring to manually enabling it when desired has been the cause of many vulnerabilities and fixes, that are still a concern today, as we fixed yet another instance of forgetting to set it just some weeks ago (PR#13889). In this case, though, I expect that just flipping the default is not enough to accept this PR: the problem of going with the negative versions of these flags is that such design can't be made future proof as we wish to do: we can't know in advance all possible properties that should be declared as `=no` by an algorithm released today as part of our providers when new relevant properties might be defined in the future. E.g., what if after 3.0 is released a ECCKiila provider was released and opted to tag its implementations as `formally_verified`? They could add `formally_verified=yes` but then they would fall into the "insecure by default" pattern: users would need to opt-in for the safer version instead of getting it by default, but there is no other option given that our and others' providers' offerings were not preemptively flagged with `formally_verified=no`. Scope of propquery semantics ---------------------------- This follows directly from the previous example. The design of the propquery mechanism seems (IMO) to be limited to provide __well-defined__ semantics for the properties only within the scope of a single provider (and a single provider version at that). If a given provider offered at runtime for the same algorithm two or more versions, one flagged as `consttime=no` and one with consttime guarantees, then it could provide documentation to its userbase on how to setup default propqueries in the configuration file at installation time or to application developers to embed them in their code depending on the use case, to select the `consttime=no` offering when such a thing is perceived as desirable for whatever (ill-guided :P) reason. But doing this across providers does not really work, because other provider authors might have not flagged their insecure implementations with `consttime=no`, or because there might be very different opinions on what qualifies as "fully consttime" (that evolve over time with different threat models and attacker capabilities): provX's `consttime=yes` might be "more consttime" than provY's `consttime=yes`. For example provX is similar to our default provider, and offers an EC algorithm built on top of a generic BIGNUM module so its generic implementation supports any valid EC curve named or custom and orgX made sure that the ECDH/ECDSA and ECmath layers of the implementation follow state-of-the-art practices regarding secret-dependant code execution or data access so they flag their EC implementation as `consttime=yes`, even though the underlying BIGNUM module has limited consttime guarantees, so in certain circumstances a dynamic `realloc` or some other seemingly minor thing could leak a few secret bits to a motivated attacker. provY is similar to our default provider, and provides an EC algorithm, but instead of having a generic codepath they only support a subset of named curves (e.g. only P256 and P521) each with a dedicated implementation that removes the need for a generic BIGNUM layer and instead embeds its specific consttime field arithmetic in each curve implementation (e.g. they provide our `ecp_nistz256.c` and `ecp_nistp521.c` implementations only). provY's ECDH/ECDSA, ECmath and FFmath layers all follow best consttime practices so they flag as `consttime=yes` their EC offering. Now `constime=yes` is not well-defined anymore across providers, making it quite useless for users and providers to use `consttime=yes` or `consttime=no` to pick among offerings from different providers. The same would be true within the same organization and provider, across provider versions. Let's say tomorrow we were to merge a PR that made all of BIGNUM resistant against all currently known timing side-channel techniques. Coincidentally we decided to mark all our pubkey providers as `consttime=yes` after verifying that all the layers above BIGNUM are also timing-resistant to current techniques. Fast-forward a year and a new CPU design flaw is discovered, giving a remote attacker a side-channel with unprecedented resolution and SNR: our previous `consttime=yes` semantic would not stand the test of time and the disruption of semantics could have ripple-effects within our providers and to the stack of applications that embedded `consttime=yes` as a selection mechanism to pick among offerings from various provider versions with now ill-defined qualifiers. If we can agree, even just on some level or some specific cases like these, that the definitions of properties in propqueries have their scope limited to the single provider version, what is the point of exposing `blinding=*` to our users, given that at runtime we don't offer alternative implementations for the same algorithm with different properties? (I mean, yes we have alternative choices at runtime in some cases, but not at the algorithm level in the provider API, they are buried well within each algorithm implementation that offers such choices) Too vague/coarse semantics -------------------------- Elaborating further from the examples above, there is also a conceptual issue at the base of the OTC discussion that led to PR#14759, namely that "blinding is a property of an algorithm: either an implementation does blinding or it doesn't". But this is not entirely true: blinding is applied to individual operations at various levels. RSA is a special case that has the luxury of resting its security on basically a single mathematical operation (modular exponentiation) over a single mathematical layer of abstraction (finite field math provided by BIGNUM): so it is enough to apply blinding to the modular exponentiation to say "this algorithm implementations does blinding". That is not true anymore already for DSA signing (that relies on the same single finite field math abstraction provided by BIGNUM, but already rests on two operation, exponentiation and modular inversion) and one more layer deep with ECDH/ECDSA where you have two operations (scalar multiplication and modular inversion), of which the first is composed of math operations coming from the EC arithmetic layer of abstraction, and each of those operation themselves are built as the composition of many operations from the same FF arithmetic layer from BIGNUM. The number of levels of abstraction involved in an implementation keeps rising steeply when we move into the realm of upcoming post-quantum algorithms. When you have multiple operations from multiple layers, applying binding efficiently is non-trivial, one maybe blinds only the EC scalarmul but not the FF modinv, because the first takes the bulk of the computation time and common consensus at some point in time is that there is little point in blinding the modinv if its SNR is deemed too low to be exploitable. But what happens when this does not hold anymore? What is the meaning of `blinding=yes` for an EC implementation that applied blinding only to scalarmul and not to modinv because that was the best practice at the time it was written? What if blinding is applied to the scalarmul but the FF operation to perform the blinding already leaks a small amount because BIGNUM? A small leakage might not be considered a problem with 2048+ RSA keys (shared opinion might be that 1 or 3 bits of knowledge wouldn't give the attacker a noticeable boost compared to other tools in their arsenal) but the same few bits have been proven sufficient to break even P384 or P521 keys (which are above the security level of RSA 2048) with limited effort and costs: what is the meaning of `blinding=yes` in the case where the blinding implementation itself reveals information on the secret it should protect? What if the blinding operation itself is not leaking a single bit by itself, but it magnifies some effect from the underlying level of abstraction (e.g., blinding scalarmul could make the nonce bigger than the prime order subgroup size `n`, and the specific scalarmul implementation must guarantee that its scalar input is less than `n` to compute correctly, so it has a special branch to be robust and handle bigger scalars coming from blinding, creating a magnifying lens for an attacker when the branch is taken, as that says something on the probability of the high order bits of the scalar being set)? This might seem as having nothing to do with the semantics of `blinding=yes`, but it does for providers like ours where we apply blinding at top abstraction layer (and in the case of certain paths also in strategic points in the second layer) but then have a number of different sub-algorithm implementations that behave very differently: depending on the curve name (and encoding format when parsing a key) you might end up in the generic prime/binary curves implementation, or in one of the specific ones like nistz256 or nistp521. Applying `blinding=yes` like this: ```c {"ECDSA","provider=default,blinding=yes",ossl_ecdsa_signature_functions} ``` is already making the semantics of the `blinding` property ill-defined, and affecting its effectiveness/purpose as a decision mechanism to be exposed to users: we have different codepaths under the same umbrella flagged as applying blinding, but even within it that blinding has very different shades of "yes". As @t8m said "Security is not a boolean value" and my point here is that neither is `blinding` or `consttime` and similar: they come in shades (even the `formally_verified` one: maybe the EC layer is formally verified by tools, but the underlying FF module might come with different guarantees, etc.) and the threshold by which SOTA deems the quality to become "yes*" or "no*" changes over time (e.g., it was fine to only blind scalarmul yesterday, so implementation X was considered as a blinding implementation, but today also modinv blinding is required so the same implementation X is now deemed non-blinding). Alternative ----------- Support for the specific `RSA_FLAG_BLINDING`&co functionality could instead just be dropped. My understanding is that its usefulness is already limited to legacy compatibility with engines which anyway would not play nicely with the new provider/core architecture. More in general, as a mechanism to select secure implementations if there are insecure ones present as well, the problem can already be solved applying the "know your providers" mantra in combination with the use of the `provider=default`, `provider=provX` properties to pick across the offerings of multiple providers. This wouldn't cover the use case (is this even supported/tested right now?) of a single provider registering the same algorithm twice with different alternative implementations and properties: in this case though it would be possible to use a property definition local to the specific provider version, and its documentation would inform the users of how to use propquery to select what and the related conditions and caveats. Luckily this last case does not affect our implementations, as we don't offer competing algorithms at the provider level. Although I wouldn't propose to do it, if we really really wanted to retain the functionality of `RSA_FLAG_BLINDING` and have algorithms exposing knobs with the capability of turning on/off (or 1000/500/0) whatever degree of blinding they support, we could achieve this with existing ctrls/params as we do for many other tunable knobs. Using ctrls/params over doing this on a propquery level has the advantage that the definition/semantics can be scoped with more granularity than offered at the provider level. As a user/sysadmin, if I have done the evaluation of the specific context that could lead me to responsibly decide to pick no-blinding, no-consttime, no-secure or similar hazardous choices, I would expect to be anyway in a situation where I have so much control over my system, software and environment that patching the code to pass a param or call a ctrl function deeply inside my application shouldn't be a blocker at all. Summary ------- I am opposed to merging PR#14759 for several reasons: - it lends itself to insecure-by-default patterns, repeating mistakes from the past - the validity scope for similar properties is somewhat limited to a single provider version, so its effectiveness against its purpose of being a selection mechanism for users to pick among competing algorithms offered by different providers is limited (and has the potential to hunt us back in the future) - its definition and semantics are vague, its applicability as an algorithm property too coarse to be well-defined even inside our own providers - `secure`, `blinding`, `consttime`, etc. are not boolean properties of algorithms (even if I really wish they could be) - it's one more vague concept for everyone to understand, and one more item with not-too-clear logic for the project to handle and maintain for the foreseeable future - we have already established alternatives in the form of per-implementation params or ctrls to tune some security parameters that can alter the execution flow of our implementations Also, from a more generic usability perspective and thinking of supporting those in our Community developing 3rd party providers, I think we should also - [in the vote] take a position on what are good/bad ideas when devising provider-specific properties, and have some official guidelines in the documentation about it by the time final 3.0.0 is release, and - [not in the vote, could be a separate discussion] document what mechanisms we plan to put in place to coordinate efforts which will allow for some carefully evaluated properties to have well-defined semantics (with versioning?) across providers/vendors (some form of registry of properties, with well-defined criteria for maintaining it?) Proposed vote text ================== Do not merge PR#14759, prevent declaring properties similar to `blinding=yes` or `consttime=yes` in our implementations and discourage 3rd parties from adopting similar designs. Please provide feedback on the vote text, I would like to open the vote on Monday and have some public debate about its points via the list rather then compress it on the OTC meeting which already has enough items in agenda! Thanks, Nicola