On 11/23/14, 7:58 AM, Patrik Fältström wrote:
I have on request from the chairs done a review of three Precis
documents. Here is the review of draft-ietf-precis-framework-20.

Patrik, many thanks for the thorough review!

The review has focused on the use of Unicode and the relationship
with IDNA2008.

You see my comments below.

Best, Patrik Fältström

draft-ietf-precis-framework-20

Abstract

Application protocols using Unicode characters in protocol strings
need to properly handle such strings in order to enforce
internationalization rules for strings placed in various protocol
slots (such as addresses and identifiers) and to perform valid
comparison operations (e.g., for purposes of authentication or
authorization).  This document defines a framework enabling
application protocols to perform the preparation, enforcement, and
comparison of internationalized strings ("PRECIS") in a way that
depends on the properties of Unicode characters and thus is agile
with respect to versions of Unicode.  As a result, this framework
provides a more sustainable approach to the handling of
internationalized strings than the previous framework, known as
Stringprep ( RFC 3454).  This document obsoletes RFC 3454 .

Explanation on what my review is concentrating on:

When using a character set like Unicode where things like
transformation, comparisons etc, the actual transformation can happen
in multiple locations in the architecture, and what is needed is for
applications to understand what format the strings are that are
received and what format strings are expected to be in when being
sent. Of course the principle of "be liberal in what you accept, and
conservative in what you send" is very important.

A simplified sketch of the architecture is as follows:

1. [A] sends a string to [B] for storage

2. [C] send a string to be for lookup, which implies matching
algorithm is applied

3. If there is a match, data is sent back to [C]

In this very simple way of looking at the issues the questions
include:

A. What Unicode code points should [A] accept as input

B. What transformation is [A] expected to do?

C. What transformation is [A] allowed to do?

D. What Unicode code points is [A] allowed to send to [B]?

E. What Unicode code points can [B] expect from [A]?

F. What transformation is [B] expected to do on data sent from [A]
before data is stored in the database?

: :

I.e. it must be as clear as possible for each one of the parties A, B
and C what they are expected to do, what they must (and must not)
do.

For the profiles and application usages that go along with the framework (draft-ietf-precis-saslprepbis, draft-ietf-precis-nickname, draft-ietf-xmpp-6122bis), we have tried to address the considerations you raise. However, I think that in several places we could do a better job, such as:

i. Explain not only what entities are expected to do, but also what they are allowed to do.

ii. Specify what various using applications need to do (e.g., I don't think we've made that fully clear in the saslprepbis and nickname documents, although I think it is very clear in 6122bis).

These are matters that we can focus on during WGLC for those specs, I think.

And of course ultimately the issue/trouble is that the Unicode
Character Set is created and designed in such a way that there are
many equivalences that humans in various contexts do expect are to be
treated as "the same". Of course different in different contexts.

Indeed.

Now to the review...

4.  Enable application protocols to define profiles of the PRECIS
string classes if necessary (addressing matters such as width
mapping, case mapping, Unicode normalization, and directionality)
but strongly discourage the multiplication of profiles beyond
necessity in order to avoid violations of the Principle of Least
User Astonishment.

It must also be clear who has the responsibility to do whatever
transformations needed.

Yes. See above and below in this message, other messages in this thread, and the forthcoming revised I-D.

It is expected that this framework will yield the following
benefits:

o  Application protocols will be agile with regard to Unicode
versions.

o  Implementers will be able to share code point tables and
software code across application protocols, most likely by means
of software libraries.

o  End users will be able to acquire more accurate expectations
about the characters that are acceptable in various contexts.
Given this more uniform set of string classes, it is also expected
that copy/paste operations between software implementing different
application protocols will be more predictable and coherent.

It must also to everyone involved be clear what is the normative
authoritative source for what is allowed and not.

For IDNA2008 (for example) it is the _algorithm_ that is normative.
Not any tables that are derivatives from applying the algorithm on a
specific version of the Unicode Character Set.

Agreed. That's why the introduction states:

   The character categories and calculation rules defined under
   Section 7 and Section 8 are normative and apply to all Unicode code
   points.  The code point table that results from applying the
   character categories and calculation rules to the latest version of
   Unicode can be found in an IANA registry.

When an application applies a profile of a PRECIS string class, it
can achieve the following objectives:

a.  Determine if a given string conforms to the profile, thus
enabling enforcement of the rules (e.g., to determine if a string
is allowed for use in the relevant protocol slot specified by an
application protocol).

b.  Determine if any two given strings are equivalent, thus
enabling comparision (e.g., to make an access decision for purposes
of authentication or authorization as further described in [
RFC6943 ]).

And of course applying transformation on a string being received
before it is passed on to the next step of whatever process the
application is participating in. In this case, the string is in one
form before the transformation and another form after the
transformation. It must also be clear to everyone involved that
transformations applied very seldom (if at all) are reversible.
Specifically this is the case when doing case folding
transformations.

Ack.

3.  Preparation, Enforcement, and Comparison


This document distinguishes between three different actions that
an entity can take with regard to a string:

o  Enforcement entails applying all of the rules specified for a
particular string class or profile thereof to an individual string,
for the purpose of determining if the string can be used in a given
protocol slot.

o  Comparison entails applying all of the rules specified for a
particular string class or profile thereof to two separate strings,
for the purpose of determining if the two strings are equivalent.

In fact I think the "comparison" entitles three steps:

1. Apply all transformation and enforcement on string A.

2. Apply all transformation and enforcement on string B.

3. Compare the strings A and B unicode character by character. Only
if all characters are the same, a positive match is the result.

Agreed. I'll adjust the text along those lines.

o  Preparation entails only ensuring that the characters in an
individual string are allowed by the underlying PRECIS string
class.

I think the idea with "preparation" is to apply certain
transformation and to, after transformation, ensure all characters in
the context they exist, are allowed, so that the final string after
the preparation step is a valid precis string?

During discussion of draft-ietf-xmpp-6122bis in the XMPP WG, several commenters noted that they would prefer to assign more lightweight processing to entities that are not capable of full enforcement of all the rules for a profile (i.e., not heavy or advanced tasks like Unicode normalization). Out of that conversation emerged the concept of "preparation" as involving only limits on the character ranges.

Perhaps we could have chosen a less ambiguous term than "preparation"?

I would recommend explicitly mentioning the fact (destructive)
transformation might occur in this step.

Destructive transformation doesn't sound good. Do you have examples of what that means?

In most cases, authoritative entities such as servers are
responsible for enforcement, whereas subsidiary entities such as
clients are responsible only for preparation.  The rationale for
this distinction is that clients might not have the facilities (in
terms of device memory and processing power) to enforce all the
rules regarding internationalized strings (such as width mapping
and Unicode normalization), although they can more easily limit the
repertoire of characters they offer to an end user.  By contrast,
it is assumed that a server would have more capacity to enforce the
rules, and in any case acts as an authority regarding allowable
strings in protocol slots such as addresses and endpoint
identifiers.  In addition, a client cannot necessarily be trusted
to properly generate such strings, especially for
security-sensitive contexts such as authentication and
authorization.

This paragraph is very vague. I think the protocol need a much
stricter specification on who is expected to do what. This because
the protocol itself (that is for example between client and server)
must be robust enough to carry whatever code points the client is
using.

Yes, and that's what each application usage or profile needs to specify very clearly. We can provide some guidelines for such text in the framework, but I don't think we can provide that text for all applications here.

Valid:  Defines which code points and character categories are
treated as valid input to the string.

The term "input" is not clear to me, given transformation might
occur.

Good point. Will fix.

Disallowed:  Defines which code points and character categories
need to be excluded from the string.

It is a bit confusing to talk about both categories and code points
at the same time. I would recommend in this point in time in the
document talk about what code points are disallowed. Reason for this
is that you might have a category that is disallowed while the code
point that is of that category is allowed (based on other rules, like
exceptions). To make it crystal clear what is disallowed, I recommend
only use that term for code points.

Yes, we'll change that.

4.2.1.  Valid


o  Code points traditionally used as letters and numbers in
writing systems, i.e., the LetterDigits ("A") category first
defined in [ RFC5892] and listed here under Section 8.1 .

o  Code points in the range U+0021 through U+007E, i.e., the
(printable) ASCII7 ("K") rule defined under Section 8.11 .  These
code points are "grandfathered" into PRECIS and thus are valid even
if they would otherwise be disallowed according to the
property-based rules specified in the next section.

Note: Although the PRECIS IdentifierClass re-uses the LetterDigits
category from IDNA2008, the range of characters allowed in the
IdentifierClass is wider than the range of characters allowed in
IDNA2008.  The main reason is that IDNA2008 applies the Unstable
category before the LetterDigits category, thus disallowing
uppercase characters, whereas the IdentifierClass does not apply
the Unstable category.

You must remove the code points of class ("C") in RFC5892.

The only difference between class "C" (IDNA2008) and class "M" (PRECIS) is that in PRECIS we allow white space code points in the FreeformClass. Thus we use "C", not "M".

Or to state things differently. If one look at the Unicode tables,
the following combination of matches exists for code points that
matches category "A" and at least one more category, for Unicode
7.0.0:

AB ABC ABF AC ACI AD AE AF AI

There are several of these combinations that is given this definition
is valid which I would not say is recommended for use for
identifiers.

OK, I will look at this more closely.

Further, regarding not including stable. This implies it is allowed
to use code points in Precis that are not stable regarding
normalization and/or case folding. The normalization and/or case
folding still must be made somewhere before matching is happening.

Lets for example say that "A" and "a" are both valid (which they
would be). The question is then whether there is case mapping before
comparison or not, and if there is, it must be ensured that the two
identities "A" and "a" are not both created in some name space.

I know this is exactly what you have been talking about and
discussing, but it must be absolutely crystal clear everyone
understand what this implies. Specifically when case folding (lower
case) is replaced with NFC or some normalization algorithm.

I think this is handled more directly by the UsernameCaseMapped and UsernameCasePreserved profiles of the IdentifierClass specified in draft-ietf-precis-saslprepbis.

More about this later.

There is always more to be said about i18n. :-)

4.2.3.  Disallowed

See above.

Some application technologies need strings that can be used in a
free-form way, e.g., as a password in an authentication exchange
(see [ I-D.ietf-precis-saslprepbis ]) or a nickname in a chatroom
(see [ I-D.ietf-precis-nickname ]).  We group such things into a
class called "FreeformClass" having the following features.

Security Warning: As mentioned, the FreeformClass prioritizes
expressiveness over safety; Section 11.3 describes some of the
security hazards involved with using or profiling the
FreeformClass.

Security Warning: Consult Section 11.6 for relevant security
considerations when strings conforming to the FreeformClass, or a
profile thereof, are used as passwords.

There are very dangerous issues here when using this class for any
kind of comparison.  Specifically in the case of password and user
names (or file names)

We actively discourage the use of the FreeformClass (or profiles thereof) for user names, file names, and the like.

See however draft-ietf-precis-nickname, where we use a profile of the FreeformClass for nicknames (e.g., in a chatroom).

where it is unclear what kind of normalization
might happen between "the keyboard" and "the application". I.e. the
user might really really think they enter a certain code point, but
in reality what the application see is either NFC(string) or
NFD(string) and which one might vary on the operating system (or file
system) in use. Specifically when leaving this undefined.

I am all in favor of leaving this undefined for this class, but then
it might not be the best to do any kind of matching (including
searching). Unless some kind of transformation is made for the
matching/searching.

We might need some more text about these matters in the profiles that use the FreeformClass: OpaqueString (in saslprepbis) and Nickname. Another topic to fix during WGLC.

I would recommend the following general rules:

- IdentifierClass are used wherever it is importan the namespace
include only globally unique strings, like identifiers for user names
etc

Agreed.

- IdentifierClass are also used for passwords

That might not be consistent with the need for entropy in passwords. See draft-ietf-precis-saslprepbis on this point.

and whenever a
comparison is used,

Ideally, passwords are not stored in the clear, which case some kind of transformation will need to happen before any comparison is done. That's why draft-ietf-precis-saslprepbis states:

   In protocols that provide passwords as input to a cryptographic
   algorithm such as a hash function, the client will need to perform
   proper preparation of the password before applying the algorithm,
   since the password is not available to the server in plaintext form.

but the transformation should not be
destructive.

Here again I'm not sure exactly what that means.

- FreeformClass is used for storage of various things

- Protocols must be stable for FreeformClass in the transport

If you have time, could you expand upon those two points a bit?

4.3.1.  Valid

See comments above regarding combination of A with other categories.

5.1.  Profiles Must Not Be Multiplied Beyond Necessity


The risk of profile proliferation is significant because having
too many profiles will result in different behavior across various
applications, thus violating what is known in user interface
design as the Principle of Least Astonishment.

Indeed, we already have too many profiles.  Ideally we would have
at most two or three profiles.  Unfortunately, numerous
application protocols exist with their own quirks regarding
protocol strings.

Domain names, email addresses, instant messaging addresses,
chatroom nicknames, filenames, authentication identifiers,
passwords, and other strings are already out there in the wild and
need to be supported in existing application protocols such as DNS,
SMTP, XMPP, IRC, NFS, iSCSI, EAP, and SASL among others.

Nevertheless, profiles must not be multiplied beyond necessity.

To help prevent profile proliferation, this document recommends
sensible defaults for the various options offered to profile
creators (such as width mapping and Unicode normalization).  In
addition, the guidelines for designated experts provided under
Section 9 are meant to encourage a high level of due diligence
regarding new profiles.

What are the requirements to create a new Profile?

Rule #1: Don't create new profiles.

Rule #2: Don't ignore Rule #1.

Rule #3: If you push on through the first two rules, provide a strong justification for the new profile and answer all the questions under the IANA considerations in the framework. Even then, expect a lot of pushback from the designated experts.

Either there are requirements or not. This text above does not add
much help if there is a conflict in the future regarding request for
registration of a new Profile. Are you really happy with what is
above?

I'm happy with the totality of information and requirements in Sections 5.1, 9, and 10.3 of framework-20.

The WG must honestly say "yes" to this. If they do, I am happy! :-)

I can't speak for the WG. :-)

5.2.1.  Width Mapping Rule


The width mapping rule of a profile specifies whether width
mapping is performed on fullwidth and halfwidth characters, and how
the mapping is done.  Typically such mapping consists of mapping
fullwidth and halfwidth characters, i.e., code points with a
Decomposition Type of Wide or Narrow, to their decomposition
mappings; as an example, FULLWIDTH DIGIT ZERO (U+FF10) would be
mapped to DIGIT ZERO (U+0030).

The normalization form specified by a profile (see below) has an
impact on the need for width mapping.  Because width mapping is
performed as a part of compatibility decomposition, a profile
employing either normalization form KD (NFKD) or normalization
form KC (NFKC) does not need to specify width mapping.  However,
if Unicode normalization form C (NFC) is used (as is recommended)
then the profile needs to specify whether to apply width mapping;
in this case, width mapping is in general RECOMMENDED because
allowing fullwidth and halfwidth characters to remain unmapped to
their compatibility variants would violate the Principle of Least
Astonishment.  For more information about the concept of width in
East Asian scripts within Unicode, see Unicode Standard Annex #11
[ UAX11 ].

Doing this mapping is not easy. I strongly recommend an algorithm is
presented that either is in use or not by the profile.

We've been trying not to define new algorithms. Is there one we can can point to elsewhere?

5.2.3.  Case Mapping Rule


The case mapping rule of a profile specifies whether case mapping
is performed (instead of case preservation) on uppercase and
titlecase characters, and how the mapping is done (e.g., mapping
uppercase and titlecase characters to their lowercase
equivalents).

You either apply mapping or not, to _all_ code points. The above make
it sort of look like if case mapping is sometimes not to be
performed.

Would we ever apply case mapping to characters other than upperase and titlecase characters? Perhaps you mean only to point out that the rule is applied to all code points, but only uppercase and titlecase code points are transformed by the rule. If so, the text is easy to adjust.

If case mapping is desired (instead of case preservation), it is
RECOMMENDED to use Unicode Default Case Folding as defined in
Chapter 3 of the Unicode Standard [ Unicode7.0 ].

Note: Unicode Default Case Folding is not designed to handle
various localization issues (such as so-called "dotless i" in
several Turkic languages).  The PRECIS mappings document [
I-D.ietf-precis-mappings ] describes these issues in greater detail
and defines a "local case mapping" method that handles some
locale-dependent and context-dependent mappings.

In order to maximize entropy and minimize the potential for false
positives, it is NOT RECOMMENDED for application protocols to map
uppercase and titlecase code points to their lowercase equivalents
when strings conforming to the FreeformClass, or a profile
thereof, are used in passwords; instead, it is RECOMMENDED to
preserve the case of all code points contained in such strings and
then perform case-sensitive comparison.  See also the related
discussion in [ I-D.ietf-precis-saslprepbis ].

The above is too complicated.

It is long-winded, but is it more complicated than your terse versino below?

The only realistic way of handling casing is to:

1. Decide whether there is case insensitive matching to be done or
not

2. If it is, case fold to lower case before the matching

3. If transformation of a string is made, case fold to lower case as
part of the transformation

If I understand the difference between (2) and (3) correctly, (2) applies to case-insensitive matching (e.g., an application might preserve case on storing strings but compare two output strings in a case-insensitive way), whereas (3) applies to storage of strings in some canonical format.

4. Do not forget issues with normalization and case folding both be
applied on the same string

Yes, I will give more thought to that and see if we can add some appropriate text.

5.2.5.  Directionality Rule


The directionality rule of a profile specifies how to treat
strings containing left-to-right (LTR) and right-to-left (RTL)
characters (see Unicode Standard Annex #9 [ UAX9 ]).  A profile
usually specifies a directionality rule that restricts strings to
be entirely LTR
strings or entirely RTL strings and defines the allowable
sequences of characters in LTR and RTL strings.  Possible rules
include, but are not limited to, (a) considering any string that
contains a right- to-left code point to be a right-to-left string,
or (b) applying the "Bidi Rule" from [ RFC5893 ].

One can not restrict to only LTR or RTL as some code points are
neutral regarding directionality.

See RFC5893. This was one of the mistakes in IDNA2003.

I think you're right that the foregoing text is not quite correct.

And in any case, as John Klensin reiterated at the mic in a recent PRECIS WG session, if we define something other than the Bidi Rule then we'll likely get it wrong. I think that text predates John's comments and hasn't been updated..

Mixed-direction strings are not directly supported by the PRECIS
framework itself, since there is currently no widely accepted and
implemented solution for the safe display of mixed-direction
strings.

Define Mixed-Direction strings or else the text will be confusing.

Will do.

username   = userpart *(1*SP userpart) userpart   = 1*(idbyte) ; ;
an "idbyte" is a byte used to represent a ; UTF-8 encoded Unicode
code point that can be ; contained in a string that conforms to
the ; PRECIS "IdentifierClass" ;

Do not talk about "byte" here but instead "character". So, in the
grammar, talk about Unicode Code Points. How the Unicode string is
then encoding (for example UTF-8) is a different issue.

Hmm. All of the new profiles or application usages specify UTF-8. This issue of bytes/octets vs. characters caused us problems in XMPP (in part because we specify length limits). I'll need to look at it more closely before making a change.

6.  Order of Operations


To ensure proper comparison, the rules specified for a particular
string class or profile MUST be applied in the following order:

It must, as I started with saying above, be clear when these
operations take place. Is it a (destructive) transformation done
somewhere (application, client side, server side) or just something
part of a matching algorithm?

Yes, we've tried to address that in profile and application specs.

8.3.  IgnorableProperties (C)


This category is defined in Secton 2.3 of [ RFC5892 ] but is not
used in PRECIS.

Note: See the "PrecisIgnorableProperties (M)" category below for a
more inclusive category used in PRECIS identifiers.

See comments above.

8.7.  BackwardCompatible (G)


This category is defined in Secton 2.7 of [ RFC5892 ] and is
included by reference for use in PRECIS.

Note: Because of how the PRECIS string classes are defined, only
changes that would result in code points being added to or removed
from the LetterDigits ("A") category would result in backward-
incompatible modifications to code point assignments.

Are you sure? I am not.

I think we just leave that up to RFC 5892 and remove that note.

Therefore, management of this category is handled via the processes
specified in [ RFC5892 ].  At the time of this writing (and also at
the time that

RFC 5892 was published), this category consisted of the empty set;
however, that is subject to change as described in RFC 5892 .

This is true on the other hand.

8.13.  PrecisIgnorableProperties (M)


This PRECIS-specific category is used to group code points that
are discouraged from use in PRECIS string classes.

M: Default_Ignorable_Code_Point(cp) = True or
Noncharacter_Code_Point(cp) = True

The definition for Default_Ignorable_Code_Point can be found in
the DerivedCoreProperties.txt [ 2 ] file, and at the time of
Unicode 7.0 is as follows:

Other_Default_Ignorable_Code_Point + Cf (Format characters) +
Variation_Selector - White_Space - FFF9..FFFB (Annotation
Characters) - 0600..0604, 06DD, 070F, 110BD (exceptional Cf
characters that should be visible)

I would be very nervous over having explicit code points in a generic
rule like this. If the code points are to be listed explicitly, add
them to an exception rule. Otherwise *this* rule have to be changed
when future changes have to be made that would require otherwise
exceptions to be added.

That simply repeats what Unicode 7.0 says, but we can remove it to prevent developer confusion.

8.14.  Spaces (N)


This PRECIS-specific category is used to group code points that
are space characters.

N: General_Category(cp) is in {Zs}

I am still thinking of how "spaces" is handled here. Will check and
think more when I look at the mapping documents. Specifically when
you also look at Arabic and other similar scripts. Just do
destructive transformation to U+0020 is not always working I think.

We have had a lot of discussion about space characters, not really captured all that well in the specs. But see Section 5.5:

5.5.  A Note about Spaces

   With regard to the IdentiferClass, the consensus of the PRECIS
   Working Group was that spaces are problematic for many reasons,
   including:

   o  Many Unicode characters are confusable with ASCII space.

   o  Even if non-ASCII space characters are mapped to ASCII space
      (U+0020), space characters are often not rendered in user
      interfaces, leading to the possibility that a human user might
      consider a string containing spaces to be equivalent to the same
      string without spaces.

   o  In some locales, some devices are known to generate a character
      other than ASCII space (such as ZERO WIDTH JOINER, U+200D) when a
      user performs an action like hitting the space bar on a keyboard.

   One consequence of disallowing space characters in the
   IdentifierClass might be to effectively discourage their use within
   identifiers created in newer application protocols; given the
   challenges involved with properly handling space characters
   (especially non-ASCII space characters) in identifiers and other
   protocol strings, the PRECIS Working Group considered this to be a
   feature, not a bug.

   However, the FreeformClass does allow spaces, which enables
   application protocols to define profiles of the FreeformClass that
   are more flexible than any profiles of the IdentifierClass.  In
   addition, as explained in the previous section, application protocols
   can also define application-layer constructs containing spaces.

8.17.  HasCompat (Q)


This PRECIS-specific category is used to group code points that
have compatibility equivalents as explained in Chapter 2 and
Chapter 3 of the Unicode Standard [ Unicode7.0 ].

Q: toNFKC(cp) != cp

The toNFKC() operation returns the code point in normalization
form KC.  For more information, see Section 5 of Unicode Standard
Annex #15 [ UAX15 ].

Think about implications of this together with case folding (or
not).

I shall. :-)

Thanks again for the review. I'll work to update the spec soon.

Peter

--
Peter Saint-Andre
CTO @ &yet
https://andyet.com/

_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis

Reply via email to