I have on request from the chairs done a review of three Precis documents. Here
is the review of draft-ietf-precis-framework-20.
The review has focused on the use of Unicode and the relationship with IDNA2008.
You see my comments below.
Best, Patrik Fältström
draft-ietf-precis-framework-20
Abstract
Application protocols using Unicode characters in protocol strings
need to properly handle such strings in order to enforce
internationalization rules for strings placed in various protocol
slots (such as addresses and identifiers) and to perform valid
comparison operations (e.g., for purposes of authentication or
authorization). This document defines a framework enabling
application protocols to perform the preparation, enforcement, and
comparison of internationalized strings ("PRECIS") in a way that
depends on the properties of Unicode characters and thus is agile
with respect to versions of Unicode. As a result, this framework
provides a more sustainable approach to the handling of
internationalized strings than the previous framework, known as
Stringprep (
RFC 3454). This document obsoletes RFC 3454
.
Explanation on what my review is concentrating on:
When using a character set like Unicode where things like transformation, comparisons
etc, the actual transformation can happen in multiple locations in the architecture, and
what is needed is for applications to understand what format the strings are that are
received and what format strings are expected to be in when being sent. Of course the
principle of "be liberal in what you accept, and conservative in what you send"
is very important.
A simplified sketch of the architecture is as follows:
1. [A] sends a string to [B] for storage
2. [C] send a string to be for lookup, which implies matching algorithm is
applied
3. If there is a match, data is sent back to [C]
In this very simple way of looking at the issues the questions include:
A. What Unicode code points should [A] accept as input
B. What transformation is [A] expected to do?
C. What transformation is [A] allowed to do?
D. What Unicode code points is [A] allowed to send to [B]?
E. What Unicode code points can [B] expect from [A]?
F. What transformation is [B] expected to do on data sent from [A] before data
is stored in the database?
:
:
I.e. it must be as clear as possible for each one of the parties A, B and C
what they are expected to do, what they must (and must not) do.
And of course ultimately the issue/trouble is that the Unicode Character Set is created
and designed in such a way that there are many equivalences that humans in various
contexts do expect are to be treated as "the same". Of course different in
different contexts.
Now to the review...
4. Enable application protocols to define profiles of the PRECIS
string classes if necessary (addressing matters such as width
mapping, case mapping, Unicode normalization, and directionality)
but strongly discourage the multiplication of profiles beyond
Saint-Andre & Blanchet Expires May 25, 2015 [Page 4]
Internet-Draft PRECIS Framework November 2014
necessity in order to avoid violations of the Principle of Least
User Astonishment.
It must also be clear who has the responsibility to do whatever transformations
needed.
It is expected that this framework will yield the following benefits:
o Application protocols will be agile with regard to Unicode
versions.
o Implementers will be able to share code point tables and software
code across application protocols, most likely by means of
software libraries.
o End users will be able to acquire more accurate expectations about
the characters that are acceptable in various contexts. Given
this more uniform set of string classes, it is also expected that
copy/paste operations between software implementing different
application protocols will be more predictable and coherent.
It must also to everyone involved be clear what is the normative authoritative
source for what is allowed and not.
For IDNA2008 (for example) it is the _algorithm_ that is normative. Not any
tables that are derivatives from applying the algorithm on a specific version
of the Unicode Character Set.
When an application applies a profile of a PRECIS string class, it
can achieve the following objectives:
a. Determine if a given string conforms to the profile, thus
enabling enforcement of the rules (e.g., to determine if a string
is allowed for use in the relevant protocol slot specified by an
application protocol).
b. Determine if any two given strings are equivalent, thus enabling
comparision (e.g., to make an access decision for purposes of
authentication or authorization as further described in
[
RFC6943
]).
And of course applying transformation on a string being received before it is
passed on to the next step of whatever process the application is participating
in. In this case, the string is in one form before the transformation and
another form after the transformation. It must also be clear to everyone
involved that transformations applied very seldom (if at all) are reversible.
Specifically this is the case when doing case folding transformations.
3. Preparation, Enforcement, and Comparison
This document distinguishes between three different actions that an
entity can take with regard to a string:
o Enforcement entails applying all of the rules specified for a
particular string class or profile thereof to an individual
string, for the purpose of determining if the string can be used
in a given protocol slot.
o Comparison entails applying all of the rules specified for a
particular string class or profile thereof to two separate
strings, for the purpose of determining if the two strings are
equivalent.
In fact I think the "comparison" entitles three steps:
1. Apply all transformation and enforcement on string A.
2. Apply all transformation and enforcement on string B.
3. Compare the strings A and B unicode character by character. Only if all
characters are the same, a positive match is the result.
Saint-Andre & Blanchet Expires May 25, 2015 [Page 6]
Internet-Draft PRECIS Framework November 2014
o Preparation entails only ensuring that the characters in an
individual string are allowed by the underlying PRECIS string
class.
I think the idea with "preparation" is to apply certain transformation and to,
after transformation, ensure all characters in the context they exist, are allowed, so
that the final string after the preparation step is a valid precis string?
I would recommend explicitly mentioning the fact (destructive) transformation
might occur in this step.
In most cases, authoritative entities such as servers are responsible
for enforcement, whereas subsidiary entities such as clients are
responsible only for preparation. The rationale for this distinction
is that clients might not have the facilities (in terms of device
memory and processing power) to enforce all the rules regarding
internationalized strings (such as width mapping and Unicode
normalization), although they can more easily limit the repertoire of
characters they offer to an end user. By contrast, it is assumed
that a server would have more capacity to enforce the rules, and in
any case acts as an authority regarding allowable strings in protocol
slots such as addresses and endpoint identifiers. In addition, a
client cannot necessarily be trusted to properly generate such
strings, especially for security-sensitive contexts such as
authentication and authorization.
This paragraph is very vague. I think the protocol need a much stricter
specification on who is expected to do what. This because the protocol itself
(that is for example between client and server) must be robust enough to carry
whatever code points the client is using.
Valid: Defines which code points and character categories are
treated as valid input to the string.
The term "input" is not clear to me, given transformation might occur.
Disallowed: Defines which code points and character categories need
to be excluded from the string.
It is a bit confusing to talk about both categories and code points at the same
time. I would recommend in this point in time in the document talk about what
code points are disallowed. Reason for this is that you might have a category
that is disallowed while the code point that is of that category is allowed
(based on other rules, like exceptions). To make it crystal clear what is
disallowed, I recommend only use that term for code points.
4.2.1. Valid
o Code points traditionally used as letters and numbers in writing
systems, i.e., the LetterDigits ("A") category first defined in
[
RFC5892] and listed here under Section 8.1
.
o Code points in the range U+0021 through U+007E, i.e., the
(printable) ASCII7 ("K") rule defined under
Section 8.11
. These
code points are "grandfathered" into PRECIS and thus are valid
even if they would otherwise be disallowed according to the
property-based rules specified in the next section.
Note: Although the PRECIS IdentifierClass re-uses the LetterDigits
category from IDNA2008, the range of characters allowed in the
IdentifierClass is wider than the range of characters allowed in
IDNA2008. The main reason is that IDNA2008 applies the Unstable
category before the LetterDigits category, thus disallowing
uppercase characters, whereas the IdentifierClass does not apply
the Unstable category.
You must remove the code points of class ("C") in RFC5892.
Or to state things differently. If one look at the Unicode tables, the following
combination of matches exists for code points that matches category "A" and at
least one more category, for Unicode 7.0.0:
AB
ABC
ABF
AC
ACI
AD
AE
AF
AI
There are several of these combinations that is given this definition is valid
which I would not say is recommended for use for identifiers.
Further, regarding not including stable. This implies it is allowed to use code
points in Precis that are not stable regarding normalization and/or case
folding. The normalization and/or case folding still must be made somewhere
before matching is happening.
Lets for example say that "A" and "a" are both valid (which they would be). The question is then
whether there is case mapping before comparison or not, and if there is, it must be ensured that the two identities
"A" and "a" are not both created in some name space.
I know this is exactly what you have been talking about and discussing, but it
must be absolutely crystal clear everyone understand what this implies.
Specifically when case folding (lower case) is replaced with NFC or some
normalization algorithm.
More about this later.
4.2.3. Disallowed
See above.
Some application technologies need strings that can be used in a
free-form way, e.g., as a password in an authentication exchange (see
[
I-D.ietf-precis-saslprepbis
]) or a nickname in a chatroom (see
[
I-D.ietf-precis-nickname
]). We group such things into a class
called "FreeformClass" having the following features.
Security Warning: As mentioned, the FreeformClass prioritizes
expressiveness over safety;
Section 11.3
describes some of the
security hazards involved with using or profiling the
FreeformClass.
Security Warning: Consult
Section 11.6
for relevant security
considerations when strings conforming to the FreeformClass, or a
profile thereof, are used as passwords.
There are very dangerous issues here when using this class for any kind of comparison. Specifically
in the case of password and user names (or file names) where it is unclear what kind of
normalization might happen between "the keyboard" and "the application". I.e.
the user might really really think they enter a certain code point, but in reality what the
application see is either NFC(string) or NFD(string) and which one might vary on the operating
system (or file system) in use. Specifically when leaving this undefined.
I am all in favor of leaving this undefined for this class, but then it might
not be the best to do any kind of matching (including searching). Unless some
kind of transformation is made for the matching/searching.
I would recommend the following general rules:
- IdentifierClass are used wherever it is importan the namespace include only
globally unique strings, like identifiers for user names etc
- IdentifierClass are also used for passwords and whenever a comparison is
used, but the transformation should not be destructive.
- FreeformClass is used for storage of various things
- Protocols must be stable for FreeformClass in the transport
4.3.1. Valid
See comments above regarding combination of A with other categories.
5.1. Profiles Must Not Be Multiplied Beyond Necessity
The risk of profile proliferation is significant because having too
many profiles will result in different behavior across various
applications, thus violating what is known in user interface design
as the Principle of Least Astonishment.
Indeed, we already have too many profiles. Ideally we would have at
most two or three profiles. Unfortunately, numerous application
protocols exist with their own quirks regarding protocol strings.
Saint-Andre & Blanchet Expires May 25, 2015 [Page 12]
Internet-Draft PRECIS Framework November 2014
Domain names, email addresses, instant messaging addresses, chatroom
nicknames, filenames, authentication identifiers, passwords, and
other strings are already out there in the wild and need to be
supported in existing application protocols such as DNS, SMTP, XMPP,
IRC, NFS, iSCSI, EAP, and SASL among others.
Nevertheless, profiles must not be multiplied beyond necessity.
To help prevent profile proliferation, this document recommends
sensible defaults for the various options offered to profile creators
(such as width mapping and Unicode normalization). In addition, the
guidelines for designated experts provided under
Section 9
are meant
to encourage a high level of due diligence regarding new profiles.
What are the requirements to create a new Profile?
Either there are requirements or not. This text above does not add much help if
there is a conflict in the future regarding request for registration of a new
Profile. Are you really happy with what is above?
The WG must honestly say "yes" to this. If they do, I am happy! :-)
5.2.1. Width Mapping Rule
The width mapping rule of a profile specifies whether width mapping
is performed on fullwidth and halfwidth characters, and how the
mapping is done. Typically such mapping consists of mapping
fullwidth and halfwidth characters, i.e., code points with a
Decomposition Type of Wide or Narrow, to their decomposition
mappings; as an example, FULLWIDTH DIGIT ZERO (U+FF10) would be
mapped to DIGIT ZERO (U+0030).
The normalization form specified by a profile (see below) has an
impact on the need for width mapping. Because width mapping is
performed as a part of compatibility decomposition, a profile
employing either normalization form KD (NFKD) or normalization form
KC (NFKC) does not need to specify width mapping. However, if
Unicode normalization form C (NFC) is used (as is recommended) then
the profile needs to specify whether to apply width mapping; in this
case, width mapping is in general RECOMMENDED because allowing
fullwidth and halfwidth characters to remain unmapped to their
compatibility variants would violate the Principle of Least
Astonishment. For more information about the concept of width in
East Asian scripts within Unicode, see Unicode Standard Annex #11
[
UAX11
].
Doing this mapping is not easy. I strongly recommend an algorithm is presented
that either is in use or not by the profile.
5.2.3. Case Mapping Rule
The case mapping rule of a profile specifies whether case mapping is
performed (instead of case preservation) on uppercase and titlecase
characters, and how the mapping is done (e.g., mapping uppercase and
titlecase characters to their lowercase equivalents).
You either apply mapping or not, to _all_ code points. The above make it sort
of look like if case mapping is sometimes not to be performed.
If case mapping is desired (instead of case preservation), it is
RECOMMENDED to use Unicode Default Case Folding as defined in Chapter
3 of the Unicode Standard [
Unicode7.0
].
Note: Unicode Default Case Folding is not designed to handle
various localization issues (such as so-called "dotless i" in
several Turkic languages). The PRECIS mappings document
[
I-D.ietf-precis-mappings
] describes these issues in greater
detail and defines a "local case mapping" method that handles some
locale-dependent and context-dependent mappings.
In order to maximize entropy and minimize the potential for false
positives, it is NOT RECOMMENDED for application protocols to map
uppercase and titlecase code points to their lowercase equivalents
when strings conforming to the FreeformClass, or a profile thereof,
are used in passwords; instead, it is RECOMMENDED to preserve the
case of all code points contained in such strings and then perform
case-sensitive comparison. See also the related discussion in
[
I-D.ietf-precis-saslprepbis
].
The above is too complicated.
The only realistic way of handling casing is to:
1. Decide whether there is case insensitive matching to be done or not
2. If it is, case fold to lower case before the matching
3. If transformation of a string is made, case fold to lower case as part of
the transformation
4. Do not forget issues with normalization and case folding both be applied on
the same string
5.2.5. Directionality Rule
The directionality rule of a profile specifies how to treat strings
containing left-to-right (LTR) and right-to-left (RTL) characters
(see Unicode Standard Annex #9 [
UAX9
]). A profile usually specifies
a directionality rule that restricts strings to be entirely LTR
Saint-Andre & Blanchet Expires May 25, 2015 [Page 14]
Internet-Draft PRECIS Framework November 2014
strings or entirely RTL strings and defines the allowable sequences
of characters in LTR and RTL strings. Possible rules include, but
are not limited to, (a) considering any string that contains a right-
to-left code point to be a right-to-left string, or (b) applying the
"Bidi Rule" from [
RFC5893
].
One can not restrict to only LTR or RTL as some code points are neutral
regarding directionality.
See RFC5893. This was one of the mistakes in IDNA2003.
Mixed-direction strings are not directly supported by the PRECIS
framework itself, since there is currently no widely accepted and
implemented solution for the safe display of mixed-direction strings.
Define Mixed-Direction strings or else the text will be confusing.
username = userpart *(1*SP userpart)
userpart = 1*(idbyte)
;
; an "idbyte" is a byte used to represent a
; UTF-8 encoded Unicode code point that can be
; contained in a string that conforms to the
; PRECIS "IdentifierClass"
;
Do not talk about "byte" here but instead "character". So, in the grammar, talk
about Unicode Code Points. How the Unicode string is then encoding (for example UTF-8) is a
different issue.
6. Order of Operations
To ensure proper comparison, the rules specified for a particular
string class or profile MUST be applied in the following order:
It must, as I started with saying above, be clear when these operations take
place. Is it a (destructive) transformation done somewhere (application, client
side, server side) or just something part of a matching algorithm?
8.3. IgnorableProperties (C)
This category is defined in Secton 2.3 of [
RFC5892
] but is not used
in PRECIS.
Note: See the "PrecisIgnorableProperties (M)" category below for a
more inclusive category used in PRECIS identifiers.
See comments above.
8.7. BackwardCompatible (G)
This category is defined in Secton 2.7 of [
RFC5892
] and is included
by reference for use in PRECIS.
Note: Because of how the PRECIS string classes are defined, only
changes that would result in code points being added to or removed
from the LetterDigits ("A") category would result in backward-
incompatible modifications to code point assignments.
Are you sure? I am not.
Therefore,
management of this category is handled via the processes specified in
[
RFC5892
]. At the time of this writing (and also at the time that
RFC 5892
was published), this category consisted of the empty set;
however, that is subject to change as described in
RFC 5892
.
This is true on the other hand.
8.13. PrecisIgnorableProperties (M)
This PRECIS-specific category is used to group code points that are
discouraged from use in PRECIS string classes.
M: Default_Ignorable_Code_Point(cp) = True or
Noncharacter_Code_Point(cp) = True
The definition for Default_Ignorable_Code_Point can be found in the
DerivedCoreProperties.txt [
2
] file, and at the time of Unicode 7.0 is
as follows:
Other_Default_Ignorable_Code_Point
+ Cf (Format characters)
+ Variation_Selector
- White_Space
- FFF9..FFFB (Annotation Characters)
- 0600..0604, 06DD, 070F, 110BD (exceptional Cf characters
that should be visible)
I would be very nervous over having explicit code points in a generic rule like
this. If the code points are to be listed explicitly, add them to an exception
rule. Otherwise *this* rule have to be changed when future changes have to be
made that would require otherwise exceptions to be added.
8.14. Spaces (N)
This PRECIS-specific category is used to group code points that are
space characters.
N: General_Category(cp) is in {Zs}
I am still thinking of how "spaces" is handled here. Will check and think more
when I look at the mapping documents. Specifically when you also look at Arabic and other
similar scripts. Just do destructive transformation to U+0020 is not always working I
think.
8.17. HasCompat (Q)
This PRECIS-specific category is used to group code points that have
compatibility equivalents as explained in Chapter 2 and Chapter 3 of
the Unicode Standard [
Unicode7.0
].
Q: toNFKC(cp) != cp
The toNFKC() operation returns the code point in normalization form
KC. For more information, see
Section 5
of Unicode Standard Annex
#15 [
UAX15
].
Think about implications of this together with case folding (or not).
_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis