Hi Joe, thanks for the review and my apologies for taking a month to reply.
On 7/30/14, 3:25 PM, Joe Hildebrand (jhildebr) wrote:
The reasons the precis group got a spate of questions from me today was I
was prepping to do this review. There are a couple of issues that the
precis folk should pay more attention to.
> 1. Introduction
...
> Instead, this document builds upon the
> internationalization framework defined by the IETF's PRECIS Working
> Group [I-D.ietf-precis-framework], while attempting to ensure that
> the characters allowed in Jabber IDs under stringprep are still
> allowed and handled in the same way under PRECIS.
"the same way" means more backward-compatibility to me than I think we
intend here.
Yes, that is a bit vague, even though it does say "attempting". Here is
one possible approach...
OLD
Instead, this document builds upon the
internationalization framework defined by the IETF's PRECIS Working
Group [I-D.ietf-precis-framework], while attempting to ensure that
the characters allowed in Jabber IDs under stringprep are still
allowed and handled in the same way under PRECIS.
NEW
Instead, this document builds upon the
internationalization framework defined by the IETF's PRECIS Working
Group [I-D.ietf-precis-framework]. Although every attempt has been
made to ensure that the characters allowed in Jabber IDs under
stringprep are still allowed and handled in the same way under
PRECIS, there is no guarantee of strict backward compatibility
because of changes in Unicode and the fact that PRECIS handling is
based on Unicode properties, not a hardcoded table of characters.
> 3.1. Fundamentals
>
> jid = [ localpart "@" ] domainpart [ "/" resourcepart ]
> localpart = 1*1023(localpoint)
> ;
> ; a "localpoint" is a UTF-8 encoded
> ; Unicode code point that conforms to
> ; the "JIDlocalIdentifierClass" profile
> ; of the PRECIS IdentifierClass
> ;
This implies 1023 codepoints, not 1023 bytes to me. Same issue for ifqdn
and resourcepart. 6122 just had 1*; I think going back to that would be
fine since we have a rule below that captures the max size.
Your proposal seems fine to me, too. It's hard to capture these nuances
in ABNF, at times. Although, from later in the thread, "localbyte" would
work for me.
> 3.2. Domainpart
>
> The domainpart of a JID is that portion after the '@' character (if
> any) and before the '/' character (if any); it is the primary
I think it's often surprising to people that foo/@bar is a valid JID with
"foo" as the domainpart and "@bar" as the resourcepart. The text above,
although pulled from 6122, might be better as:
The domainpart of a JID is that portion after the first '@' character (if
any) and before the first '/' character (if any);
That's acceptable to me.
and possibly adding the example.
Examples are good. I'll add a few to Table 1.
> In general, the content of a domainpart is an Internationalized
> Domain Name ("IDN") as described in the specifications for
> Internationalized Domain Names in Applications (commonly called
> "IDNA2008"), and a domainpart is an "IDNA-aware domain name slot" as
> defined in [RFC5890]. The following rules apply to a domainpart that
> consists of a fully-qualified domain name and MUST be applied in the
> following order:
When do these rules need to be applied? Only before comparison or routing?
That is a very good question.
This might be a difference between the "preparation" and "comparison" of
the PRECIS acronym.
You'll notice that the PRECIS nickname spec draws a sharper distinction
between preparation and comparison than the others:
http://www.ietf.org/archive/id/draft-ietf-precis-nickname-09.txt
Section 2 there says in part:
For preparation purposes (most commonly, when a chatroom client
generates a nickname from user input for inclusion as a protocol
element that represents a "nickname slot"), an application MUST at a
minimum ensure that the string conforms to the "FreeformClass" string
class defined in [I-D.ietf-precis-framework]; however, it MAY in
addition perform the normalization and mapping operations specified
below for comparison purposes.
For comparison purposes (e.g., when a chatroom server determines if
two nicknames are in conflict during the authorization process), an
application MUST treat a nickname as specified below (these rules
constitute the "NicknameFreeformClass" profile). The operations
specified MUST be completed in the order shown (in particular,
normalization MUST be performed after the other mapping steps and
before validity-checking against the definition of the PRECIS
"FreeformClass", consistent with [I-D.ietf-precis-framework]).
[various rules elided]
I wonder if we want to say, in general, that there is something of a
lower bar for preparation than for comparison. For example, for an XMPP
localpart we might say that an entity doing preparation just needs to
ensure that it doesn't include any characters outside of the PRECIS
IdentifierClass, whereas an entity doing comparison needs to apply the
normalization and mapping rules. The primary reason we might do this is
that it could ease the burden on XMPP clients or servers during certain
operations, whereas at those times when comparison is truly needed
(e.g., when user authentication or authorization are being made) the
full set of rules would be applied.
Although I'm not entirely comfortable with this approach, pragmatically
it might be more acceptable than saying that all entities must apply all
of the rules all of the time.
This is related to text in Section 4:
Enforcement of the XMPP address format rules is the responsibility of
XMPP servers. Although XMPP clients SHOULD prepare complete JIDs and
parts of JIDs in accordance with this document before including them
in protocol slots within XML streams (such that JIDs and parts of
JIDs are in conformance), XMPP servers MUST enforce the rules
wherever possible and reject stanzas and other XML elements that
violate the rules (for stanzas, by returning a <jid-malformed/> error
to the sender as described in Section 8.3.3.8 of [RFC6120]).
That text seems to imply the same principle: clients prepare and servers
enforce (by mean of comparison?). But I think we could be clearer about
the whole matter by explicitly saying that enforcement includes
application of all the rules (just as comparison does - it's just that
comparison involves applying all of the rules to two strings in order to
determine if they are "equivalent", whereas enforcement involves apply
the rules to a single string).
> 1. The domainpart MUST contain only NR-LDH labels and U-labels as
> defined in [RFC5890] and MUST consist only of Unicode code points
> that conform to the rules specified in [RFC5892] (which includes
> Unicode normalization). This implies that the domainpart MUST
> NOT include A-labels as defined in [RFC5890]; each A-label MUST
> be converted to a U-label during preparation of a domainpart, and
> comparison MUST be performed using U-labels, not A-labels.
This seems like an always rule, including for dumb clients.
Things are a bit more clear-cut with regard to rules that are based on
PRECIS, not IDNA, because the models are slightly different. In PRECIS
we have base string classes (IdentifierClass and FreeformClass), so it
might make sense to say that preparation involves ensuring that the
preparing entity doesn't allow in any code points that are disallowed
for that base string class. We don't have base string classes in IDNA.
Although the foregoing rule is similar to the base string class idea, it
goes beyond by including normalization. I'd almost prefer that we figure
this out very clearly first for PRECIS-based identifiers (in XMPP, the
localpart and resourcepart) and then see how the resulting text can be
ported over to our use of IDNA-based identifiers (in XMPP, the domainpart).
> 2. All uppercase and titlecase code points within the domainpart
> MUST be mapped to their lowercase equivalents, preferably using
> Unicode Default Case Folding as defined in Chapter 3 of the
> Unicode Standard [UNICODE].
Dumb clients might get away with this and the system would still work.
> 3. Fullwidth and halfwidth characters within the domainpart MUST be
> mapped to their decomposition mappings.
Dumb clients have no shot at this one.
Right - in the emerging approach we're exploring here, the latter two
rules would be a matter of enforcement and comparison only, not of
preparation.
> Implementation Note: The foregoing order is different from the
> order for localparts and resourceparts as described below, to
> maintain consistency with the IDNA methods in both [RFC5892] and
> [RFC5895].
>
> After any and all normalization, conversion, and mapping of code
> points,
as well as conversion to UTF-8.
True, although we kind of assume that in the XMPP world because all data
sent over an XMPP stream is required to be UTF-8. Mentioning it seems
useful, though.
> a domainpart MUST NOT be zero octets in length and MUST NOT
> be more than 1023 octets in length. (Naturally, the length limits of
> [RFC1034] apply, and nothing in this document is to be interpreted as
> overriding those more fundamental limits.)
>
> 3.3. Localpart
>
> The localpart of a JID is an optional identifier placed before the
> domainpart and separated from the latter by the '@' character.
> Typically a localpart uniquely identifies the entity requesting and
> using network access provided by a server (i.e., a local account),
> although it can also represent other kinds of entities (e.g., a chat
> room associated with a multi-user chat service [XEP-0045]). The
> entity represented by an XMPP localpart is addressed within the
> context of a specific domain (i.e., <localpart@domainpart>).
>
> A localpart MUST NOT be zero octets in length and MUST NOT be more
> than 1023 octets in length. This rule is to be enforced after any
> normalization and mapping of code points.
and conversion to UTF-8.
As above.
> A localpart MUST consist only of Unicode code points that conform to
> the "JIDlocalIdentifierClass" profile of the "IdentifierClass" base
> string class defined in [I-D.ietf-precis-framework]. The
> JIDlocalIdentifierClass profile includes all code points allowed by
> the IdentifierClass base class, with the exception of the following
> characters that are explicitly disallowed in XMPP localparts:
(special precis focus)
I would have expected this to be phrased more similarly to step 2 of
http://tools.ietf.org/html/draft-ietf-precis-framework-17#section-5, or
for section 5 to just have a step about codepoints forbidden in a given
usage of the selected precis class.
Good point - I agree that more internal harmony would be helpful here
between the framework and the various profiles.
> The normalization and mapping rules for the JIDlocalIdentifierClass
> are as follows, where the operations specified MUST be completed in
> the order shown:
Again, I think we need language about when these rules are applied. The
rest of the section is about what is allowed, not about how to compare.
As discussed above, I think we need to more clearly delineate what's
required for preparation, what's required for enforcement, and what's
required for comparison. And as mentioned seems to me right now that the
same rules are involved in enforcement and comparison, except that
applying those rules during enforcement is a way to determine if a
single string conforms, whereas applying those rules during comparison
is a way to determine if two strings are "equivalent". That said, your
use of the phrase "about what is allowed, not about how to compare"
might suggest that more is involved in comparison than in enforcement.
To choose a simple example, is the JID <[email protected]> "allowed" if
the jabber.org server enforces all the rules for a localpart? It seems
to me not. We're saying that a client could send that (since both "S"
and "P" are allowed by the category "Lu - Uppercase_Letter" and thus
would pass the preparation test), but that a server which is enforcing
the rules would map "S" to "s" and "P" to "p". However, the rules for
comparison are the same as for enforcement: "StPeter" and "stpeter"
would compare as equivalent.
> 1. Fullwidth and halfwidth characters MUST be mapped to their
> decomposition mappings.
>
> 2. Uppercase and titlecase characters MUST be mapped to their
> lowercase equivalents, preferably using Unicode Default Case
> Folding as defined in Chapter 3 of the Unicode Standard
> [UNICODE].
Nothing about SpecialCasing?
That's a question for the WG. :-)
The PRECIS framework states:
If case mapping is desired (instead of case preservation), it is
RECOMMENDED to use Unicode Default Case Folding as defined in Chapter
3 of the Unicode Standard [Unicode6.3].
Note: Unicode Default Case Folding is not designed to handle
various localization issues (such as so-called "dotless i" in
several Turkic languages). The PRECIS mappings document
[I-D.ietf-precis-mappings] describes these issues in greater
detail and defines a "local case mapping" method that handles some
locale-dependent and context-dependent mappings.
Given the discussions in recent PRECIS WG meetings, I would shy away
from applying locale-dependent and context-dependent mappings in XMPP
localparts. However, I'm open to argument.
> A resourcepart MUST NOT be zero octets in length and MUST NOT be more
> than 1023 octets in length. This rule is to be enforced after any
> normalization and mapping of code points.
>
> A resourcepart MUST consist only of Unicode code points that conform
> to the "JIDresourceFreeformClass" profile of the "FreeformClass" base
> string class defined in [I-D.ietf-precis-framework].
>
> The normalization and mapping rules for the resourcepart of a JID are
> as follows, where the operations specified MUST be completed in the
> order shown:
Again, when are the rules applied?
See above.
> 1. Fullwidth and halfwidth characters MAY be mapped to their
> decomposition mappings.
(precis)
I need a hint as to when do this. "MAY" isn't nearly enough.
Do you mean "when" as "in what contexts is it smart to do width mapping
on resourceparts" or as something else (e.g., "when" could mean "by
which entities" such as clients, servers, and XMPP "components").
Later in this thread, you and Florian Zeitz seem to think that MUST NOT
perform width mapping is the right approach.
However, resourceparts are used in multiple contexts (we could say that
there are multiple "resourcepart slots").
For the JIDs of connected resources (user@domain/foo), I tend to agree.
For the JIDs of chatroom participants, the precis-nickname spec says to
use NFKC, which handles width mapping as part of normalization (and thus
might be taken to violate the proposed MUST NOT approach).
I haven't yet taken the time to find and think about other resourcepart
slots in various XMPP extensions, but I hesitate to make a categorical
statement in 6122bis since the applicability of width mapping might
depend on the context in which a resourcepart is used.
> 2. Map any instances of non-ASCII space to ASCII space (U+0020).
(precis)
I was hoping either the framework doc or the mappings doc would tell me
more about which characters to map here. RFC 3454 had table C.1.2, but I
don't see any hints about what I'm supposed to do now.
Good catch.
Is the rule "has a
compatibility mapping to U+0020"?
BTW I count at least three kinds of compatibility mapping to 0020:
<compat> (as in U+0384 GREEK TONOS), <noBreak> (as in U+2007 FIGURE
SPACE)), and <wide> (as in U+3000 IDEOGRAPHIC SPACE).
That doesn't hit U+200B which is in
C.1.2,
Right. I am not sure whether ZERO WIDTH SPACE really ought to be mapped
to U+0020. See Florian's comment later in this thread.
nor does "has category Zs".
IMHO that is insufficient.
My intuition is that by "non-ASCII space" we mean anything that has a
compatibility mapping of any kind of U-0020, since that seems safest (it
casts a wider net) and is something we can apply in a programmatic way.
However, my intuitions are not always correct and applying this rule
this would result in a larger table than what we find in Appendix C.1.2
of RFC 3454.
draft-ietf-precis-mappings says
"Therefore, the special mapping table should be based on a well-
defined mapping table for each protocol", which although I don't
particularly like, I can live with - but we need the table here.
Do you feel that we need the table in 6122bis or in the framework? As
you say, the mappings document implies that each specification that
defines a rule like "map non-ASCII space to ASCII space" needs to define
their own table, but that seems like a recipe for trouble. If, say, SASL
and XMPP and LDAP each defines a different table, authentication might
become confusing (especially since XMPP uses SASL and authentication
might be based on an LDAP lookup).
> 3. So-called additional mappings MAY be applied, such as mapping of
> characters that are similar to common delimiters (such as '@',
> ':', '/', '+', '-', and '.', e.g., mapping of IDEOGRAPHIC FULL
> STOP (U+3002) to FULL STOP (U+002E)) and special handling of
> certain characters or classes of characters (e.g., mapping of
> non-ASCII spaces to ASCII space); the PRECIS mappings document
> [I-D.ietf-precis-mappings] describes such mappings in more
> detail.
>
> 4. Uppercase and titlecase characters MAY be mapped to their
> lowercase equivalents, preferably using Unicode Default Case
> Folding as defined in Chapter 3 of the Unicode Standard
> [UNICODE].
Again, I need more about the MAY here.
> 6. IANA Considerations
>
> The following completed templates provide the information necessary
> for the IANA to add 'JIDlocalIdentifierClass' and
> 'JIDresourceFreeformClass' to the PRECIS Profiles Registry.
Should we also ask them to mark the status of nodeprep and resourceprep to
deprecated in the stringprep profiles registry?
Yes.
> Appendix A. Differences from RFC 6122
>
> Based on consensus derived from working group discussion,
> implementation and deployment experience, and formal interoperability
> testing, the following substantive modifications were made from RFC
> 6122.
I think it might be nice to point out that this may have made
previously-valid JIDs no longer valid (or vice-versa), and that we suggest
careful testing before migrating user data.
+1 to at least that text. Ideally we'd perform the kind of analysis that
Takahiro Nemoto performed for SASLprep vs. SASLprepbis:
http://www.ietf.org/mail-archive/web/precis/current/msg00790.html
I haven't done that yet, though.
Thanks again to you and Florian for your careful reviews.
Peter
_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis