Thanks, Patrik. I shall consider your feedback in detail over the next few days.

Peter

On 11/23/14, 7:58 AM, Patrik Fältström wrote:
I have on request from the chairs done a review of three Precis documents. Here 
is the review of draft-ietf-precis-framework-20.

The review has focused on the use of Unicode and the relationship with IDNA2008.

You see my comments below.

    Best, Patrik Fältström

draft-ietf-precis-framework-20

Abstract

    Application protocols using Unicode characters in protocol strings
    need to properly handle such strings in order to enforce
    internationalization rules for strings placed in various protocol
    slots (such as addresses and identifiers) and to perform valid
    comparison operations (e.g., for purposes of authentication or
    authorization).  This document defines a framework enabling
    application protocols to perform the preparation, enforcement, and
    comparison of internationalized strings ("PRECIS") in a way that
    depends on the properties of Unicode characters and thus is agile
    with respect to versions of Unicode.  As a result, this framework
    provides a more sustainable approach to the handling of
    internationalized strings than the previous framework, known as
    Stringprep (
RFC 3454).  This document obsoletes RFC 3454
.

Explanation on what my review is concentrating on:

When using a character set like Unicode where things like transformation, comparisons 
etc, the actual transformation can happen in multiple locations in the architecture, and 
what is needed is for applications to understand what format the strings are that are 
received and what format strings are expected to be in when being sent. Of course the 
principle of "be liberal in what you accept, and conservative in what you send" 
is very important.

A simplified sketch of the architecture is as follows:

1. [A] sends a string to [B] for storage

2. [C] send a string to be for lookup, which implies matching algorithm is 
applied

3. If there is a match, data is sent back to [C]

In this very simple way of looking at the issues the questions include:

A. What Unicode code points should [A] accept as input

B. What transformation is [A] expected to do?

C. What transformation is [A] allowed to do?

D. What Unicode code points is [A] allowed to send to [B]?

E. What Unicode code points can [B] expect from [A]?

F. What transformation is [B] expected to do on data sent from [A] before data 
is stored in the database?

:
:

I.e. it must be as clear as possible for each one of the parties A, B and C 
what they are expected to do, what they must (and must not) do.

And of course ultimately the issue/trouble is that the Unicode Character Set is created 
and designed in such a way that there are many equivalences that humans in various 
contexts do expect are to be treated as "the same". Of course different in 
different contexts.

Now to the review...

    4.  Enable application protocols to define profiles of the PRECIS
        string classes if necessary (addressing matters such as width
        mapping, case mapping, Unicode normalization, and directionality)
        but strongly discourage the multiplication of profiles beyond




Saint-Andre & Blanchet    Expires May 25, 2015                  [Page 4]

Internet-Draft              PRECIS Framework               November 2014



        necessity in order to avoid violations of the Principle of Least
        User Astonishment.

It must also be clear who has the responsibility to do whatever transformations 
needed.

    It is expected that this framework will yield the following benefits:

    o  Application protocols will be agile with regard to Unicode
       versions.

    o  Implementers will be able to share code point tables and software
       code across application protocols, most likely by means of
       software libraries.

    o  End users will be able to acquire more accurate expectations about
       the characters that are acceptable in various contexts.  Given
       this more uniform set of string classes, it is also expected that
       copy/paste operations between software implementing different
       application protocols will be more predictable and coherent.

It must also to everyone involved be clear what is the normative authoritative 
source for what is allowed and not.

For IDNA2008 (for example) it is the _algorithm_ that is normative. Not any 
tables that are derivatives from applying the algorithm on a specific version 
of the Unicode Character Set.

    When an application applies a profile of a PRECIS string class, it
    can achieve the following objectives:

    a.  Determine if a given string conforms to the profile, thus
        enabling enforcement of the rules (e.g., to determine if a string
        is allowed for use in the relevant protocol slot specified by an
        application protocol).

    b.  Determine if any two given strings are equivalent, thus enabling
        comparision (e.g., to make an access decision for purposes of
        authentication or authorization as further described in
        [
RFC6943
]).

And of course applying transformation on a string being received before it is 
passed on to the next step of whatever process the application is participating 
in. In this case, the string is in one form before the transformation and 
another form after the transformation. It must also be clear to everyone 
involved that transformations applied very seldom (if at all) are reversible. 
Specifically this is the case when doing case folding transformations.

3.  Preparation, Enforcement, and Comparison


    This document distinguishes between three different actions that an
    entity can take with regard to a string:

    o  Enforcement entails applying all of the rules specified for a
       particular string class or profile thereof to an individual
       string, for the purpose of determining if the string can be used
       in a given protocol slot.

    o  Comparison entails applying all of the rules specified for a
       particular string class or profile thereof to two separate
       strings, for the purpose of determining if the two strings are
       equivalent.

In fact I think the "comparison" entitles three steps:

1. Apply all transformation and enforcement on string A.

2. Apply all transformation and enforcement on string B.

3. Compare the strings A and B unicode character by character. Only if all 
characters are the same, a positive match is the result.

Saint-Andre & Blanchet    Expires May 25, 2015                  [Page 6]

Internet-Draft              PRECIS Framework               November 2014



    o  Preparation entails only ensuring that the characters in an
       individual string are allowed by the underlying PRECIS string
       class.

I think the idea with "preparation" is to apply certain transformation and to, 
after transformation, ensure all characters in the context they exist, are allowed, so 
that the final string after the preparation step is a valid precis string?

I would recommend explicitly mentioning the fact (destructive) transformation 
might occur in this step.

    In most cases, authoritative entities such as servers are responsible
    for enforcement, whereas subsidiary entities such as clients are
    responsible only for preparation.  The rationale for this distinction
    is that clients might not have the facilities (in terms of device
    memory and processing power) to enforce all the rules regarding
    internationalized strings (such as width mapping and Unicode
    normalization), although they can more easily limit the repertoire of
    characters they offer to an end user.  By contrast, it is assumed
    that a server would have more capacity to enforce the rules, and in
    any case acts as an authority regarding allowable strings in protocol
    slots such as addresses and endpoint identifiers.  In addition, a
    client cannot necessarily be trusted to properly generate such
    strings, especially for security-sensitive contexts such as
    authentication and authorization.

This paragraph is very vague. I think the protocol need a much stricter 
specification on who is expected to do what. This because the protocol itself 
(that is for example between client and server) must be robust enough to carry 
whatever code points the client is using.

    Valid:  Defines which code points and character categories are
       treated as valid input to the string.

The term "input" is not clear to me, given transformation might occur.

    Disallowed:  Defines which code points and character categories need
       to be excluded from the string.

It is a bit confusing to talk about both categories and code points at the same 
time. I would recommend in this point in time in the document talk about what 
code points are disallowed. Reason for this is that you might have a category 
that is disallowed while the code point that is of that category is allowed 
(based on other rules, like exceptions). To make it crystal clear what is 
disallowed, I recommend only use that term for code points.

4.2.1.  Valid


    o  Code points traditionally used as letters and numbers in writing
       systems, i.e., the LetterDigits ("A") category first defined in
       [
RFC5892] and listed here under Section 8.1
.

    o  Code points in the range U+0021 through U+007E, i.e., the
       (printable) ASCII7 ("K") rule defined under
Section 8.11
.  These
       code points are "grandfathered" into PRECIS and thus are valid
       even if they would otherwise be disallowed according to the
       property-based rules specified in the next section.

       Note: Although the PRECIS IdentifierClass re-uses the LetterDigits
       category from IDNA2008, the range of characters allowed in the
       IdentifierClass is wider than the range of characters allowed in
       IDNA2008.  The main reason is that IDNA2008 applies the Unstable
       category before the LetterDigits category, thus disallowing
       uppercase characters, whereas the IdentifierClass does not apply
       the Unstable category.

You must remove the code points of class ("C") in RFC5892.

Or to state things differently. If one look at the Unicode tables, the following 
combination of matches exists for code points that matches category "A" and at 
least one more category, for Unicode 7.0.0:

AB
ABC
ABF
AC
ACI
AD
AE
AF
AI

There are several of these combinations that is given this definition is valid 
which I would not say is recommended for use for identifiers.

Further, regarding not including stable. This implies it is allowed to use code 
points in Precis that are not stable regarding normalization and/or case 
folding. The normalization and/or case folding still must be made somewhere 
before matching is happening.

Lets for example say that "A" and "a" are both valid (which they would be). The question is then 
whether there is case mapping before comparison or not, and if there is, it must be ensured that the two identities 
"A" and "a" are not both created in some name space.

I know this is exactly what you have been talking about and discussing, but it 
must be absolutely crystal clear everyone understand what this implies. 
Specifically when case folding (lower case) is replaced with NFC or some 
normalization algorithm.

More about this later.

4.2.3.  Disallowed

See above.

    Some application technologies need strings that can be used in a
    free-form way, e.g., as a password in an authentication exchange (see
    [
I-D.ietf-precis-saslprepbis
]) or a nickname in a chatroom (see
    [
I-D.ietf-precis-nickname
]).  We group such things into a class
    called "FreeformClass" having the following features.

       Security Warning: As mentioned, the FreeformClass prioritizes
       expressiveness over safety;
Section 11.3
  describes some of the
       security hazards involved with using or profiling the
       FreeformClass.

       Security Warning: Consult
Section 11.6
  for relevant security
       considerations when strings conforming to the FreeformClass, or a
       profile thereof, are used as passwords.

There are very dangerous issues here when using this class for any kind of comparison. Specifically 
in the case of password and user names (or file names) where it is unclear what kind of 
normalization might happen between "the keyboard" and "the application". I.e. 
the user might really really think they enter a certain code point, but in reality what the 
application see is either NFC(string) or NFD(string) and which one might vary on the operating 
system (or file system) in use. Specifically when leaving this undefined.

I am all in favor of leaving this undefined for this class, but then it might 
not be the best to do any kind of matching (including searching). Unless some 
kind of transformation is made for the matching/searching.

I would recommend the following general rules:

- IdentifierClass are used wherever it is importan the namespace include only 
globally unique strings, like identifiers for user names etc

- IdentifierClass are also used for passwords and whenever a comparison is 
used, but the transformation should not be destructive.

- FreeformClass is used for storage of various things

- Protocols must be stable for FreeformClass in the transport

4.3.1.  Valid

See comments above regarding combination of A with other categories.

5.1.  Profiles Must Not Be Multiplied Beyond Necessity


    The risk of profile proliferation is significant because having too
    many profiles will result in different behavior across various
    applications, thus violating what is known in user interface design
    as the Principle of Least Astonishment.

    Indeed, we already have too many profiles.  Ideally we would have at
    most two or three profiles.  Unfortunately, numerous application
    protocols exist with their own quirks regarding protocol strings.




Saint-Andre & Blanchet    Expires May 25, 2015                 [Page 12]

Internet-Draft              PRECIS Framework               November 2014



    Domain names, email addresses, instant messaging addresses, chatroom
    nicknames, filenames, authentication identifiers, passwords, and
    other strings are already out there in the wild and need to be
    supported in existing application protocols such as DNS, SMTP, XMPP,
    IRC, NFS, iSCSI, EAP, and SASL among others.

    Nevertheless, profiles must not be multiplied beyond necessity.

    To help prevent profile proliferation, this document recommends
    sensible defaults for the various options offered to profile creators
    (such as width mapping and Unicode normalization).  In addition, the
    guidelines for designated experts provided under
Section 9
  are meant
    to encourage a high level of due diligence regarding new profiles.

What are the requirements to create a new Profile?

Either there are requirements or not. This text above does not add much help if 
there is a conflict in the future regarding request for registration of a new 
Profile. Are you really happy with what is above?

The WG must honestly say "yes" to this. If they do, I am happy! :-)

5.2.1.  Width Mapping Rule


    The width mapping rule of a profile specifies whether width mapping
    is performed on fullwidth and halfwidth characters, and how the
    mapping is done.  Typically such mapping consists of mapping
    fullwidth and halfwidth characters, i.e., code points with a
    Decomposition Type of Wide or Narrow, to their decomposition
    mappings; as an example, FULLWIDTH DIGIT ZERO (U+FF10) would be
    mapped to DIGIT ZERO (U+0030).

    The normalization form specified by a profile (see below) has an
    impact on the need for width mapping.  Because width mapping is
    performed as a part of compatibility decomposition, a profile
    employing either normalization form KD (NFKD) or normalization form
    KC (NFKC) does not need to specify width mapping.  However, if
    Unicode normalization form C (NFC) is used (as is recommended) then
    the profile needs to specify whether to apply width mapping; in this
    case, width mapping is in general RECOMMENDED because allowing
    fullwidth and halfwidth characters to remain unmapped to their
    compatibility variants would violate the Principle of Least
    Astonishment.  For more information about the concept of width in
    East Asian scripts within Unicode, see Unicode Standard Annex #11
    [
UAX11
].

Doing this mapping is not easy. I strongly recommend an algorithm is presented 
that either is in use or not by the profile.

5.2.3.  Case Mapping Rule


    The case mapping rule of a profile specifies whether case mapping is
    performed (instead of case preservation) on uppercase and titlecase
    characters, and how the mapping is done (e.g., mapping uppercase and
    titlecase characters to their lowercase equivalents).

You either apply mapping or not, to _all_ code points. The above make it sort 
of look like if case mapping is sometimes not to be performed.

    If case mapping is desired (instead of case preservation), it is
    RECOMMENDED to use Unicode Default Case Folding as defined in Chapter
    3 of the Unicode Standard [
Unicode7.0
].

       Note: Unicode Default Case Folding is not designed to handle
       various localization issues (such as so-called "dotless i" in
       several Turkic languages).  The PRECIS mappings document
       [
I-D.ietf-precis-mappings
] describes these issues in greater
       detail and defines a "local case mapping" method that handles some
       locale-dependent and context-dependent mappings.

    In order to maximize entropy and minimize the potential for false
    positives, it is NOT RECOMMENDED for application protocols to map
    uppercase and titlecase code points to their lowercase equivalents
    when strings conforming to the FreeformClass, or a profile thereof,
    are used in passwords; instead, it is RECOMMENDED to preserve the
    case of all code points contained in such strings and then perform
    case-sensitive comparison.  See also the related discussion in
    [
I-D.ietf-precis-saslprepbis
].

The above is too complicated.

The only realistic way of handling casing is to:

1. Decide whether there is case insensitive matching to be done or not

2. If it is, case fold to lower case before the matching

3. If transformation of a string is made, case fold to lower case as part of 
the transformation

4. Do not forget issues with normalization and case folding both be applied on 
the same string

5.2.5.  Directionality Rule


    The directionality rule of a profile specifies how to treat strings
    containing left-to-right (LTR) and right-to-left (RTL) characters
    (see Unicode Standard Annex #9 [
UAX9
]).  A profile usually specifies
    a directionality rule that restricts strings to be entirely LTR




Saint-Andre & Blanchet    Expires May 25, 2015                 [Page 14]

Internet-Draft              PRECIS Framework               November 2014



    strings or entirely RTL strings and defines the allowable sequences
    of characters in LTR and RTL strings.  Possible rules include, but
    are not limited to, (a) considering any string that contains a right-
    to-left code point to be a right-to-left string, or (b) applying the
    "Bidi Rule" from [
RFC5893
].

One can not restrict to only LTR or RTL as some code points are neutral 
regarding directionality.

See RFC5893. This was one of the mistakes in IDNA2003.

    Mixed-direction strings are not directly supported by the PRECIS
    framework itself, since there is currently no widely accepted and
    implemented solution for the safe display of mixed-direction strings.

Define Mixed-Direction strings or else the text will be confusing.

       username   = userpart *(1*SP userpart)
       userpart   = 1*(idbyte)
                    ;
                    ; an "idbyte" is a byte used to represent a
                    ; UTF-8 encoded Unicode code point that can be
                    ; contained in a string that conforms to the
                    ; PRECIS "IdentifierClass"
                    ;

Do not talk about "byte" here but instead "character". So, in the grammar, talk 
about Unicode Code Points. How the Unicode string is then encoding (for example UTF-8) is a 
different issue.

6.  Order of Operations


    To ensure proper comparison, the rules specified for a particular
    string class or profile MUST be applied in the following order:

It must, as I started with saying above, be clear when these operations take 
place. Is it a (destructive) transformation done somewhere (application, client 
side, server side) or just something part of a matching algorithm?

8.3.  IgnorableProperties (C)


    This category is defined in Secton 2.3 of [
RFC5892
] but is not used
    in PRECIS.

    Note: See the "PrecisIgnorableProperties (M)" category below for a
    more inclusive category used in PRECIS identifiers.

See comments above.

8.7.  BackwardCompatible (G)


    This category is defined in Secton 2.7 of [
RFC5892
] and is included
    by reference for use in PRECIS.

    Note: Because of how the PRECIS string classes are defined, only
    changes that would result in code points being added to or removed
    from the LetterDigits ("A") category would result in backward-
    incompatible modifications to code point assignments.

Are you sure? I am not.

Therefore,
    management of this category is handled via the processes specified in
    [
RFC5892
].  At the time of this writing (and also at the time that

RFC 5892
  was published), this category consisted of the empty set;
    however, that is subject to change as described in
RFC 5892
.

This is true on the other hand.

8.13.  PrecisIgnorableProperties (M)


    This PRECIS-specific category is used to group code points that are
    discouraged from use in PRECIS string classes.

    M: Default_Ignorable_Code_Point(cp) = True or
       Noncharacter_Code_Point(cp) = True

    The definition for Default_Ignorable_Code_Point can be found in the
    DerivedCoreProperties.txt [
2
] file, and at the time of Unicode 7.0 is
    as follows:

      Other_Default_Ignorable_Code_Point
    + Cf (Format characters)
    + Variation_Selector
    - White_Space
    - FFF9..FFFB (Annotation Characters)
    - 0600..0604, 06DD, 070F, 110BD (exceptional Cf characters
                                     that should be visible)

I would be very nervous over having explicit code points in a generic rule like 
this. If the code points are to be listed explicitly, add them to an exception 
rule. Otherwise *this* rule have to be changed when future changes have to be 
made that would require otherwise exceptions to be added.

8.14.  Spaces (N)


    This PRECIS-specific category is used to group code points that are
    space characters.

    N: General_Category(cp) is in {Zs}

I am still thinking of how "spaces" is handled here. Will check and think more 
when I look at the mapping documents. Specifically when you also look at Arabic and other 
similar scripts. Just do destructive transformation to U+0020 is not always working I 
think.

8.17.  HasCompat (Q)


    This PRECIS-specific category is used to group code points that have
    compatibility equivalents as explained in Chapter 2 and Chapter 3 of
    the Unicode Standard [
Unicode7.0
].

    Q: toNFKC(cp) != cp

    The toNFKC() operation returns the code point in normalization form
    KC.  For more information, see
Section 5
  of Unicode Standard Annex
    #15 [
UAX15
].

Think about implications of this together with case folding (or not).



_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis



--
Peter Saint-Andre
https://andyet.com/

_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis

Reply via email to