Hello Sean,
I have cc'ed the precis mailing list because some of what I'll write
below is relevant for the discussion you have started there. This is
also the reason why I'm keeping most the previous context.
On 2015/11/11 00:25, Sean Leonard wrote:
Hello Martin,
On Nov 10, 2015, at 1:45 AM, Martin J. Dürst <[email protected]> wrote:
Hello Sean,
I have a few questions re. your registration below.
On 2015/11/05 14:57, Sean Leonard wrote:
Hello:
To keep this moving, trying a different thing. Please review.
Sean
*****
Type name: application
Subtype name: pkcs8-encrypted
Required parameters: N/A
Optional parameters:
charset: When the private key encryption algorithm incorporates a “password" that is an
octet string, a mapping between user input and the octet string is desirable. PKCS #5
[RFC2898] Section 3 recommends "that applications follow some common text encoding
rules"; it then suggests, but does not recommend, ASCII and UTF-8. This parameter
specifies the charset that a recipient SHOULD attempt first when mapping user input to the
octet string. It has the same semantics as the charset parameter from text/plain, except that
it only applies to the user’s input of the password. There is no default value.
Why does it say "This parameter specifies the charset that a recipient SHOULD
attempt *first*" here? Can't that encoding just be specified as such?
At least for future, similar efforts, it would be extremely desirable to not
leave character encoding open like this, but just to nail it down to UTF-8.
There seems to be something of a “cultural disconnect” between the security
people and the I18N/UI/UX people.
The I18N/UI/UX people want well-defined interfaces that work with users “in
their own language”, whether that language is visual, aural, tactile, symbolic,
pictorial, etc. Invariably this involves Unicode and a large character
repertoire such as 💩 and 大便所.
In contrast, the security people find open-ended things like Unicode to be
anathema and would much rather restrict the range of inputs to a small and
preferably uniformly distributed set of values. And there are good reasons for
that, because when you introduce bias into cryptographic protocols, it turns
out that it is a lot easier to cryptanalyze the results.
The common security protocols that I have seen that take passwords, hand-wave
about character sets and encodings and define the password to be an octet
string. This is great for universality but bad for human input. PBKDF2 (PKCS
#5, on which this PKCS #8 EncryptedPrivateKeyInfo registration is based) is a
leading example of the “octet string” approach. Ultimately, the algorithms
don’t care what encoding it’s in, as long as they get a blob of bits (octets).
My knowledge of implementations of PKCS #5/#8/#12 suggests that there are many
applications out there that give zero thought to the encoding issue, which
means that they will take user input “As-Is”, i.e., in the current code page.
Note that PKCS #12 defines the input to this structure as a UTF-16LE encoded
character string, *with* a terminating U+0000 NULL character (i.e., the octets
00 00). This is really “weird” except of course for the fact that Microsoft
invented it and then shipped it without too much thought, in which case, all
weirdness can be explained.
It is a design criteria that if you extract such an EncryptedPrivateKeyInfo
blob from a PKCS #12 file, that you should be able to process it. If you
specify UTF-8 as the one, single, true encoding of the password for
application/pkcs8-encrypted, that can’t happen.
That's just fine, in this specific case. I have explicitly prefaced my
remark above with "At least in the future".
But if we know that the password is encoded in UTF-16LE, then why
doesn't your registration just say "This parameter specifies the
charset" rather than the handwavy "This parameter specifies the charset
that a recipient SHOULD attempt *first*".
Furthermore, UTF-8 is not uniformly distributed across the octet range. If your
users are in US-English they are highly likely to have octets in 20-7E. Octets
in 00-1F will be pretty rare. And if you choose scalar values randomly in
Unicode (regardless of assignment), you will see a *lot* of F0-F4 but virtually
none in 00-7F. And in spite of all this, octets F5-FF will *never* appear in
UTF-8.
It turns out that we have a pretty good source of uniformity and universality:
characters in the US-ASCII range 20-7E. Many password input boxes will only
accept US-ASCII and so user’s non-US-English keyboards will switch to US-ASCII
mode for the purpose of providing input to such boxes. What matters is not so
much the specific characters, so much as a reasonable selection of arbitrary
buttons that a user can push *across a wide range of devices*. This ends up
giving you 5-6 bits of entropy per user input. So the need for UTF-8 or any
particular encoding is actually not as great as some people perceive.
My comment was specifically trying to say: If you use something more
than US-ASCII, make it UTF-8. I think that's also the general policy of
the IETF. As for entropy, the entropy needs to be measured over the
whole string. It's clear that in UTF-8 bytes, a password in the ASCII
range is shorter than a similar-length (in terms of charaters) password
in a non-Latin script. The entropy of each byte will be lower, but the
entropy of the overall password should be about the same.
Something that's very important for passwords is how easy they are to
remember for actual people. It should be obvious that it's easier for
somebody to remember a password in the language/script they use every
day than in some foreign gibberish.
Overall I think that a standard such as IEEE 802.11 strikes a reasonable
balance. (See 802.11-2012 Annex M.4, which is informative, but is pretty much
the worldwide de-facto standard practice.) In 802.11, the input to PBKDF2 is
between 8-63 ASCII-encoded characters in the range 20-7E, or 64 hexadecimal
characters that convert directly to 32 octets.
So it's up to 63 ASCII characters but only up to 32 octets that may e.g.
be used for UTF-8? That doesn't strike me as a reasonable balance; it
puts a much stronger length limitation on some scripts outside ASCII.
***
To answer your questions directly:
Why does it say "This parameter specifies the charset that a recipient SHOULD
attempt *first*" here?
Can't that encoding just be specified as such?
The parameter is not cryptographically protected so it is subject to tampering
or substitution. Furthermore, a good-faith but naïve sender may put some
encoding (e.g., UTF-8) but not have the means to verify that the encoding
actually works, because the user did not supply the password. Basically it’s a
good-faith first effort, but this parameter can’t meaningfully restrict what
the sender or receiver attempt to do.
That essentially applies to any single parameter in any single media
type registration, and in much more of what the IETF does. Yet this is
virtually never called out, because otherwise, IETF documents would be
full of such stuff and very hard to read.
Also, I am not sure how to specify the NULL suffix in the PKCS #12-extracted
case.
That may suggest that you are going down the wrong path here.
I suppose it could just be “+0” or something.
ualg: When the charset is a Unicode-based encoding, this parameter is a space-delimited
list of Unicode algorithms that a recipient SHOULD first attempt to apply to the Unicode
user input in succession, in order to derive the octet string. The list of algorithm
keywords is defined by [UNICODE]. “Tailored operations” are operations that are sensitive
to language, which must be provided as an input parameter. If a tailored operation is
called for, the exclamation mark followed by the [BCP47] language tag specifies the
language. For example, "toNFD toNFKC_Casefold!tr" first applies Normalization
Form D, followed by Normalization Form KC with Case Folding in the Turkish language,
according to [UNICODE] and [UAX31]. The default value of this parameter is empty, and
leaves the matter of whether to normalize, case fold, or apply other transformations
unspecified.
"When the charset is": Is this the charset parameter, or the actual encoding of
the password?
Admittedly this was vague. First draft. I am not sure what it should be. Per PKCS #5, the
"Actual Encoding" is just an octet string of arbitrary length.
I would limit this to cases when the charset parameter is present and defined.
Makes it easier.
What is a "Unicode algorithm”?
Conformance Clause D17.
Well, this, via the term "Named Unicode Algorithm" points to table 3.1
(page 93 in Unicode V 8.0).
Reading on and looking at the examples, the intent becomes clearer, at least to somebody
who has seen things such toNFD and toNFKC and Casefold, but I hope we can avoid
"specification by example" here.
In fairness, “toNFD” and “toNFKC” are not defined terms. However, NFD (D118)
and NFKC (D121) are.
Yes, but not as (Named) Unicode Algorithms.
I would rather not create Yet Another Registry of things.
I'd agree in principle.
The terms are in fact defined in [UNICODE] in the conformance clauses.
Yes, but there are many other things defined there, too.
My usability perception is that if people really want to use Unicode in their
passwords, canonicalization is a very useful property to preserve. Case
folding/case mapping are not so useful, as most systems like to have
case-sensitive passwords for greater entropy, but “most systems” is not “all
systems” so we shouldn’t preclude the use of case algorithms. As for other
algorithms such as line breaking, character segmentation, Hangul syllable name
generation, etc., the short answer is “I don’t know”. (These are all reasons
why people stick with ASCII passwords, by the way.)
Line breaking, character segmentation, Hangul syllable name
generation,... are completely irrelevant for passwords and passphrases.
Also, many algorithms come with options or parameters.
Also, if there is indeed a list of algorithm identifiers in [UNICODE], then it
would be good to give a Section number. Is the intent that each and every
algorithm named somewhere in [UNICODE] is implemented? My rough guess would be
that the average password input implementation implements only the identity
transform. [I would of course be positively surprised if I were wrong.]
See above; main thing that worries me is Normalization Forms.
Also, references for [UNICODE], [BCP47], and [UAX31] should be give so that
this registration is self-containing.
Ok.
Another possibility is that this registration goes back to “rev 1”, i.e., no
optional parameters about the character encoding at all. I think that is
perfectly defensible. But it is not particularly i18n-friendly.
I'm not sufficiently familiar with the format and the actual use cases,
but my suggestion would be to check what's actually out there in the
field (such as the Microsoft UTF-16LE including final NULL), and select
or create a list of parameters/algorithms (with a registry if it turns
out to be needed). To that, add a way to reference PRECIS, even if it's
not currently used, because that includes the expertise/recommendations
of experts.
The current proposal just essentially saying: Unicode may define some of
the pieces you may want to use here, and may have labels for them, so
just give it a try. I'm not at all sure this will help interoperability,
except by similar accidents like the Microsoft one that you described above.
Regards, Martin.
Regards,
Sean
Regards, Martin.
Encoding considerations: binary
Security considerations:
Carries a cryptographic private key. See Section 6 of RFC 5958.
EncryptedPrivateKeyInfo PKCS #8 data contains exactly one private key. Poor
password choices, weak algorithms, or improper parameter selections (e.g.,
insufficient salting rounds) will make the confidential payloads much easier to
compromise.
Interoperability considerations:
PKCS #8 is a widely recognized format for private key information on all modern
cryptographic stacks. The encrypted variation in this registration,
EncryptedPrivateKeyInfo (Section 3, Encrypted Private Key Info, of RFC 5958, and Section
6 of PKCS #8), is less widely used for exchange than PKCS #12, but it is much simpler to
implement. The contents are exactly one private key (with optional attributes), so the
possibility for hidden "easter eggs" in the payload such as unexpected
certificates or miscellaneous secrets is drastically reduced.
Published specification:
PKCS #8 v1.2, November 1993 (republished as RFC 5208, May 2008); RFC 5958,
August 2010
Applications that use this media type:
Machines, applications, browsers, Internet kiosks, and so on, that support this
standard allow a user to import, export, and exercise a single private key.
Fragment identifier considerations: N/A
Additional information:
Deprecated alias names for this type: N/A
Magic number(s): None.
File extension(s): .p8e
Macintosh file type code(s): N/A
Person & email address to contact for further information:
Sean Leonard <dev+ietf&seantek.com>
Intended usage: COMMON
Restrictions on usage: None.
Author:
RSA, EMC, IETF
Change controller: The IETF
Provisional registration? (standards tree only): No
_______________________________________________
media-types mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/media-types
_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis