[precis] Review of current document draft-ietf-precis-framework-20

Patrik Fältström Sun, 23 Nov 2014 06:59:11 -0800

I have on request from the chairs done a review of three Precis documents. Here 
is the review of draft-ietf-precis-framework-20.


The review has focused on the use of Unicode and the relationship with IDNA2008.

You see my comments below.

   Best, Patrik Fältström

> draft-ietf-precis-framework-20
> 
> Abstract
> 
>    Application protocols using Unicode characters in protocol strings
>    need to properly handle such strings in order to enforce
>    internationalization rules for strings placed in various protocol
>    slots (such as addresses and identifiers) and to perform valid
>    comparison operations (e.g., for purposes of authentication or
>    authorization).  This document defines a framework enabling
>    application protocols to perform the preparation, enforcement, and
>    comparison of internationalized strings ("PRECIS") in a way that
>    depends on the properties of Unicode characters and thus is agile
>    with respect to versions of Unicode.  As a result, this framework
>    provides a more sustainable approach to the handling of
>    internationalized strings than the previous framework, known as
>    Stringprep (
> RFC 3454).  This document obsoletes RFC 3454
> .

Explanation on what my review is concentrating on:

When using a character set like Unicode where things like transformation, 
comparisons etc, the actual transformation can happen in multiple locations in 
the architecture, and what is needed is for applications to understand what 
format the strings are that are received and what format strings are expected 
to be in when being sent. Of course the principle of "be liberal in what you 
accept, and conservative in what you send" is very important.

A simplified sketch of the architecture is as follows:

1. [A] sends a string to [B] for storage

2. [C] send a string to be for lookup, which implies matching algorithm is 
applied

3. If there is a match, data is sent back to [C]

In this very simple way of looking at the issues the questions include:

A. What Unicode code points should [A] accept as input

B. What transformation is [A] expected to do?

C. What transformation is [A] allowed to do?

D. What Unicode code points is [A] allowed to send to [B]?

E. What Unicode code points can [B] expect from [A]?

F. What transformation is [B] expected to do on data sent from [A] before data 
is stored in the database?

:
:

I.e. it must be as clear as possible for each one of the parties A, B and C 
what they are expected to do, what they must (and must not) do.

And of course ultimately the issue/trouble is that the Unicode Character Set is 
created and designed in such a way that there are many equivalences that humans 
in various contexts do expect are to be treated as "the same". Of course 
different in different contexts.

Now to the review...

>    4.  Enable application protocols to define profiles of the PRECIS
>        string classes if necessary (addressing matters such as width
>        mapping, case mapping, Unicode normalization, and directionality)
>        but strongly discourage the multiplication of profiles beyond
> 
> 
> 
> 
> Saint-Andre & Blanchet    Expires May 25, 2015                  [Page 4]
> 
> Internet-Draft              PRECIS Framework               November 2014
> 
> 
> 
>        necessity in order to avoid violations of the Principle of Least
>        User Astonishment.

It must also be clear who has the responsibility to do whatever transformations 
needed.

>    It is expected that this framework will yield the following benefits:
> 
>    o  Application protocols will be agile with regard to Unicode
>       versions.
> 
>    o  Implementers will be able to share code point tables and software
>       code across application protocols, most likely by means of
>       software libraries.
> 
>    o  End users will be able to acquire more accurate expectations about
>       the characters that are acceptable in various contexts.  Given
>       this more uniform set of string classes, it is also expected that
>       copy/paste operations between software implementing different
>       application protocols will be more predictable and coherent.

It must also to everyone involved be clear what is the normative authoritative 
source for what is allowed and not.

For IDNA2008 (for example) it is the _algorithm_ that is normative. Not any 
tables that are derivatives from applying the algorithm on a specific version 
of the Unicode Character Set.

>    When an application applies a profile of a PRECIS string class, it
>    can achieve the following objectives:
> 
>    a.  Determine if a given string conforms to the profile, thus
>        enabling enforcement of the rules (e.g., to determine if a string
>        is allowed for use in the relevant protocol slot specified by an
>        application protocol).
> 
>    b.  Determine if any two given strings are equivalent, thus enabling
>        comparision (e.g., to make an access decision for purposes of
>        authentication or authorization as further described in
>        [
> RFC6943
> ]).

And of course applying transformation on a string being received before it is 
passed on to the next step of whatever process the application is participating 
in. In this case, the string is in one form before the transformation and 
another form after the transformation. It must also be clear to everyone 
involved that transformations applied very seldom (if at all) are reversible. 
Specifically this is the case when doing case folding transformations.

> 3.  Preparation, Enforcement, and Comparison
> 
> 
>    This document distinguishes between three different actions that an
>    entity can take with regard to a string:
> 
>    o  Enforcement entails applying all of the rules specified for a
>       particular string class or profile thereof to an individual
>       string, for the purpose of determining if the string can be used
>       in a given protocol slot.
> 
>    o  Comparison entails applying all of the rules specified for a
>       particular string class or profile thereof to two separate
>       strings, for the purpose of determining if the two strings are
>       equivalent.

In fact I think the "comparison" entitles three steps:

1. Apply all transformation and enforcement on string A.

2. Apply all transformation and enforcement on string B.

3. Compare the strings A and B unicode character by character. Only if all 
characters are the same, a positive match is the result.

> Saint-Andre & Blanchet    Expires May 25, 2015                  [Page 6]
> 
> Internet-Draft              PRECIS Framework               November 2014
> 
> 
> 
>    o  Preparation entails only ensuring that the characters in an
>       individual string are allowed by the underlying PRECIS string
>       class.

I think the idea with "preparation" is to apply certain transformation and to, 
after transformation, ensure all characters in the context they exist, are 
allowed, so that the final string after the preparation step is a valid precis 
string?

I would recommend explicitly mentioning the fact (destructive) transformation 
might occur in this step.

>    In most cases, authoritative entities such as servers are responsible
>    for enforcement, whereas subsidiary entities such as clients are
>    responsible only for preparation.  The rationale for this distinction
>    is that clients might not have the facilities (in terms of device
>    memory and processing power) to enforce all the rules regarding
>    internationalized strings (such as width mapping and Unicode
>    normalization), although they can more easily limit the repertoire of
>    characters they offer to an end user.  By contrast, it is assumed
>    that a server would have more capacity to enforce the rules, and in
>    any case acts as an authority regarding allowable strings in protocol
>    slots such as addresses and endpoint identifiers.  In addition, a
>    client cannot necessarily be trusted to properly generate such
>    strings, especially for security-sensitive contexts such as
>    authentication and authorization.

This paragraph is very vague. I think the protocol need a much stricter 
specification on who is expected to do what. This because the protocol itself 
(that is for example between client and server) must be robust enough to carry 
whatever code points the client is using.

>    Valid:  Defines which code points and character categories are
>       treated as valid input to the string.

The term "input" is not clear to me, given transformation might occur.

>    Disallowed:  Defines which code points and character categories need
>       to be excluded from the string.

It is a bit confusing to talk about both categories and code points at the same 
time. I would recommend in this point in time in the document talk about what 
code points are disallowed. Reason for this is that you might have a category 
that is disallowed while the code point that is of that category is allowed 
(based on other rules, like exceptions). To make it crystal clear what is 
disallowed, I recommend only use that term for code points.

> 4.2.1.  Valid
> 
> 
>    o  Code points traditionally used as letters and numbers in writing
>       systems, i.e., the LetterDigits ("A") category first defined in
>       [
> RFC5892] and listed here under Section 8.1
> .
> 
>    o  Code points in the range U+0021 through U+007E, i.e., the
>       (printable) ASCII7 ("K") rule defined under
> Section 8.11
> .  These
>       code points are "grandfathered" into PRECIS and thus are valid
>       even if they would otherwise be disallowed according to the
>       property-based rules specified in the next section.
> 
>       Note: Although the PRECIS IdentifierClass re-uses the LetterDigits
>       category from IDNA2008, the range of characters allowed in the
>       IdentifierClass is wider than the range of characters allowed in
>       IDNA2008.  The main reason is that IDNA2008 applies the Unstable
>       category before the LetterDigits category, thus disallowing
>       uppercase characters, whereas the IdentifierClass does not apply
>       the Unstable category.

You must remove the code points of class ("C") in RFC5892.

Or to state things differently. If one look at the Unicode tables, the 
following combination of matches exists for code points that matches category 
"A" and at least one more category, for Unicode 7.0.0:

AB
ABC
ABF
AC
ACI
AD
AE
AF
AI

There are several of these combinations that is given this definition is valid 
which I would not say is recommended for use for identifiers.

Further, regarding not including stable. This implies it is allowed to use code 
points in Precis that are not stable regarding normalization and/or case 
folding. The normalization and/or case folding still must be made somewhere 
before matching is happening.

Lets for example say that "A" and "a" are both valid (which they would be). The 
question is then whether there is case mapping before comparison or not, and if 
there is, it must be ensured that the two identities "A" and "a" are not both 
created in some name space.

I know this is exactly what you have been talking about and discussing, but it 
must be absolutely crystal clear everyone understand what this implies. 
Specifically when case folding (lower case) is replaced with NFC or some 
normalization algorithm.

More about this later.

> 4.2.3.  Disallowed

See above.

>    Some application technologies need strings that can be used in a
>    free-form way, e.g., as a password in an authentication exchange (see
>    [
> I-D.ietf-precis-saslprepbis
> ]) or a nickname in a chatroom (see
>    [
> I-D.ietf-precis-nickname
> ]).  We group such things into a class
>    called "FreeformClass" having the following features.
> 
>       Security Warning: As mentioned, the FreeformClass prioritizes
>       expressiveness over safety;
> Section 11.3
>  describes some of the
>       security hazards involved with using or profiling the
>       FreeformClass.
> 
>       Security Warning: Consult
> Section 11.6
>  for relevant security
>       considerations when strings conforming to the FreeformClass, or a
>       profile thereof, are used as passwords.

There are very dangerous issues here when using this class for any kind of 
comparison. Specifically in the case of password and user names (or file names) 
where it is unclear what kind of normalization might happen between "the 
keyboard" and "the application". I.e. the user might really really think they 
enter a certain code point, but in reality what the application see is either 
NFC(string) or NFD(string) and which one might vary on the operating system (or 
file system) in use. Specifically when leaving this undefined.

I am all in favor of leaving this undefined for this class, but then it might 
not be the best to do any kind of matching (including searching). Unless some 
kind of transformation is made for the matching/searching.

I would recommend the following general rules:

- IdentifierClass are used wherever it is importan the namespace include only 
globally unique strings, like identifiers for user names etc

- IdentifierClass are also used for passwords and whenever a comparison is 
used, but the transformation should not be destructive.

- FreeformClass is used for storage of various things

- Protocols must be stable for FreeformClass in the transport

> 4.3.1.  Valid

See comments above regarding combination of A with other categories.

> 5.1.  Profiles Must Not Be Multiplied Beyond Necessity
> 
> 
>    The risk of profile proliferation is significant because having too
>    many profiles will result in different behavior across various
>    applications, thus violating what is known in user interface design
>    as the Principle of Least Astonishment.
> 
>    Indeed, we already have too many profiles.  Ideally we would have at
>    most two or three profiles.  Unfortunately, numerous application
>    protocols exist with their own quirks regarding protocol strings.
> 
> 
> 
> 
> Saint-Andre & Blanchet    Expires May 25, 2015                 [Page 12]
> 
> Internet-Draft              PRECIS Framework               November 2014
> 
> 
> 
>    Domain names, email addresses, instant messaging addresses, chatroom
>    nicknames, filenames, authentication identifiers, passwords, and
>    other strings are already out there in the wild and need to be
>    supported in existing application protocols such as DNS, SMTP, XMPP,
>    IRC, NFS, iSCSI, EAP, and SASL among others.
> 
>    Nevertheless, profiles must not be multiplied beyond necessity.
> 
>    To help prevent profile proliferation, this document recommends
>    sensible defaults for the various options offered to profile creators
>    (such as width mapping and Unicode normalization).  In addition, the
>    guidelines for designated experts provided under
> Section 9
>  are meant
>    to encourage a high level of due diligence regarding new profiles.

What are the requirements to create a new Profile?

Either there are requirements or not. This text above does not add much help if 
there is a conflict in the future regarding request for registration of a new 
Profile. Are you really happy with what is above?

The WG must honestly say "yes" to this. If they do, I am happy! :-)

> 5.2.1.  Width Mapping Rule
> 
> 
>    The width mapping rule of a profile specifies whether width mapping
>    is performed on fullwidth and halfwidth characters, and how the
>    mapping is done.  Typically such mapping consists of mapping
>    fullwidth and halfwidth characters, i.e., code points with a
>    Decomposition Type of Wide or Narrow, to their decomposition
>    mappings; as an example, FULLWIDTH DIGIT ZERO (U+FF10) would be
>    mapped to DIGIT ZERO (U+0030).
> 
>    The normalization form specified by a profile (see below) has an
>    impact on the need for width mapping.  Because width mapping is
>    performed as a part of compatibility decomposition, a profile
>    employing either normalization form KD (NFKD) or normalization form
>    KC (NFKC) does not need to specify width mapping.  However, if
>    Unicode normalization form C (NFC) is used (as is recommended) then
>    the profile needs to specify whether to apply width mapping; in this
>    case, width mapping is in general RECOMMENDED because allowing
>    fullwidth and halfwidth characters to remain unmapped to their
>    compatibility variants would violate the Principle of Least
>    Astonishment.  For more information about the concept of width in
>    East Asian scripts within Unicode, see Unicode Standard Annex #11
>    [
> UAX11
> ].

Doing this mapping is not easy. I strongly recommend an algorithm is presented 
that either is in use or not by the profile.

> 5.2.3.  Case Mapping Rule
> 
> 
>    The case mapping rule of a profile specifies whether case mapping is
>    performed (instead of case preservation) on uppercase and titlecase
>    characters, and how the mapping is done (e.g., mapping uppercase and
>    titlecase characters to their lowercase equivalents).

You either apply mapping or not, to _all_ code points. The above make it sort 
of look like if case mapping is sometimes not to be performed.

>    If case mapping is desired (instead of case preservation), it is
>    RECOMMENDED to use Unicode Default Case Folding as defined in Chapter
>    3 of the Unicode Standard [
> Unicode7.0
> ].
> 
>       Note: Unicode Default Case Folding is not designed to handle
>       various localization issues (such as so-called "dotless i" in
>       several Turkic languages).  The PRECIS mappings document
>       [
> I-D.ietf-precis-mappings
> ] describes these issues in greater
>       detail and defines a "local case mapping" method that handles some
>       locale-dependent and context-dependent mappings.
> 
>    In order to maximize entropy and minimize the potential for false
>    positives, it is NOT RECOMMENDED for application protocols to map
>    uppercase and titlecase code points to their lowercase equivalents
>    when strings conforming to the FreeformClass, or a profile thereof,
>    are used in passwords; instead, it is RECOMMENDED to preserve the
>    case of all code points contained in such strings and then perform
>    case-sensitive comparison.  See also the related discussion in
>    [
> I-D.ietf-precis-saslprepbis
> ].

The above is too complicated.

The only realistic way of handling casing is to:

1. Decide whether there is case insensitive matching to be done or not

2. If it is, case fold to lower case before the matching

3. If transformation of a string is made, case fold to lower case as part of 
the transformation

4. Do not forget issues with normalization and case folding both be applied on 
the same string

> 5.2.5.  Directionality Rule
> 
> 
>    The directionality rule of a profile specifies how to treat strings
>    containing left-to-right (LTR) and right-to-left (RTL) characters
>    (see Unicode Standard Annex #9 [
> UAX9
> ]).  A profile usually specifies
>    a directionality rule that restricts strings to be entirely LTR
> 
> 
> 
> 
> Saint-Andre & Blanchet    Expires May 25, 2015                 [Page 14]
> 
> Internet-Draft              PRECIS Framework               November 2014
> 
> 
> 
>    strings or entirely RTL strings and defines the allowable sequences
>    of characters in LTR and RTL strings.  Possible rules include, but
>    are not limited to, (a) considering any string that contains a right-
>    to-left code point to be a right-to-left string, or (b) applying the
>    "Bidi Rule" from [
> RFC5893
> ].

One can not restrict to only LTR or RTL as some code points are neutral 
regarding directionality.

See RFC5893. This was one of the mistakes in IDNA2003.

>    Mixed-direction strings are not directly supported by the PRECIS
>    framework itself, since there is currently no widely accepted and
>    implemented solution for the safe display of mixed-direction strings.

Define Mixed-Direction strings or else the text will be confusing.

>       username   = userpart *(1*SP userpart)
>       userpart   = 1*(idbyte)
>                    ;
>                    ; an "idbyte" is a byte used to represent a
>                    ; UTF-8 encoded Unicode code point that can be
>                    ; contained in a string that conforms to the
>                    ; PRECIS "IdentifierClass"
>                    ;

Do not talk about "byte" here but instead "character". So, in the grammar, talk 
about Unicode Code Points. How the Unicode string is then encoding (for example 
UTF-8) is a different issue.

> 6.  Order of Operations
> 
> 
>    To ensure proper comparison, the rules specified for a particular
>    string class or profile MUST be applied in the following order:

It must, as I started with saying above, be clear when these operations take 
place. Is it a (destructive) transformation done somewhere (application, client 
side, server side) or just something part of a matching algorithm?

> 8.3.  IgnorableProperties (C)
> 
> 
>    This category is defined in Secton 2.3 of [
> RFC5892
> ] but is not used
>    in PRECIS.
> 
>    Note: See the "PrecisIgnorableProperties (M)" category below for a
>    more inclusive category used in PRECIS identifiers.

See comments above.

> 8.7.  BackwardCompatible (G)
> 
> 
>    This category is defined in Secton 2.7 of [
> RFC5892
> ] and is included
>    by reference for use in PRECIS.
> 
>    Note: Because of how the PRECIS string classes are defined, only
>    changes that would result in code points being added to or removed
>    from the LetterDigits ("A") category would result in backward-
>    incompatible modifications to code point assignments.

Are you sure? I am not.

> Therefore,
>    management of this category is handled via the processes specified in
>    [
> RFC5892
> ].  At the time of this writing (and also at the time that
> 
> RFC 5892
>  was published), this category consisted of the empty set;
>    however, that is subject to change as described in
> RFC 5892
> .

This is true on the other hand.

> 8.13.  PrecisIgnorableProperties (M)
> 
> 
>    This PRECIS-specific category is used to group code points that are
>    discouraged from use in PRECIS string classes.
> 
>    M: Default_Ignorable_Code_Point(cp) = True or
>       Noncharacter_Code_Point(cp) = True
> 
>    The definition for Default_Ignorable_Code_Point can be found in the
>    DerivedCoreProperties.txt [
> 2
> ] file, and at the time of Unicode 7.0 is
>    as follows:
> 
>      Other_Default_Ignorable_Code_Point
>    + Cf (Format characters)
>    + Variation_Selector
>    - White_Space
>    - FFF9..FFFB (Annotation Characters)
>    - 0600..0604, 06DD, 070F, 110BD (exceptional Cf characters
>                                     that should be visible)

I would be very nervous over having explicit code points in a generic rule like 
this. If the code points are to be listed explicitly, add them to an exception 
rule. Otherwise *this* rule have to be changed when future changes have to be 
made that would require otherwise exceptions to be added.

> 8.14.  Spaces (N)
> 
> 
>    This PRECIS-specific category is used to group code points that are
>    space characters.
> 
>    N: General_Category(cp) is in {Zs}

I am still thinking of how "spaces" is handled here. Will check and think more 
when I look at the mapping documents. Specifically when you also look at Arabic 
and other similar scripts. Just do destructive transformation to U+0020 is not 
always working I think.

> 8.17.  HasCompat (Q)
> 
> 
>    This PRECIS-specific category is used to group code points that have
>    compatibility equivalents as explained in Chapter 2 and Chapter 3 of
>    the Unicode Standard [
> Unicode7.0
> ].
> 
>    Q: toNFKC(cp) != cp
> 
>    The toNFKC() operation returns the code point in normalization form
>    KC.  For more information, see
> Section 5
>  of Unicode Standard Annex
>    #15 [
> UAX15
> ].

Think about implications of this together with case folding (or not).

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis

[precis] Review of current document draft-ietf-precis-framework-20

Reply via email to