My apologies for the delayed reply. Comments inline.
On 10/1/15 7:50 AM, John C Klensin wrote:
--On Wednesday, September 30, 2015 15:16 -0600 Peter Saint-Andre
- &yet <[email protected]> wrote:
Hi Tom, thanks for the note.
My feeling is that we phrased things in a slightly wrong way,
because we assumed that case-mapping applies primarily or only
to uppercase and titlecase characters. I think this was more a
matter of communication (because people think of case mapping
as something needed only with respect to uppercase
characters), whereas it obviously applies more generally
(i.e., applying Unicode Default Case Folding will result in
mapping of the code points you mention here).
We could do something like this in the nickname spec...
...
NEW
3. Case Mapping Rule: Unicode Default Case Folding MUST
be applied,
as defined in the Unicode Standard [Unicode] (at the
time
of this writing, the algorithm is specified in Chapter
3 of
[Unicode7.0]). The primary result of doing so is that
uppercase
characters are mapped to lowercase characters. In
applications
that prohibit conflicting nicknames, this rule helps
to reduce
the possibility of confusion by ensuring that nicknames
differing only by case (e.g., "stpeter" vs. "StPeter")
would not
be presented to a human user at the same time.
...
Peter,
While your proposed text is an improvement,
Happy to hear it. All I intended was a slight clarification.
the desire of many
people for a magic "just tell me what to do" formula, one that
lets them avoid understanding the issues, may call for a little
more:
There is always a need for more when it comes to i18n.
(1) First, toCaseFold is _not_ toLowerCase. Saying "The primary
result of doing so is that uppercase characters are mapped to
lowercase characters" is true for toCaseFold,
By "primary" I meant two things: (1) lowercasing is what happens to the
preponderance of code points and (2) this is the result that most people
care about.
but it has other
effects that are information-losing and may be counterintuitive
in some locales and situations.
Indeed.
(2) Second, probably as a result of having IDNA in the lead,
we've gotten sloppy about language and operations and should
probably start untangling that before it gets people in trouble.
Where is the right place to do that untangling? (I doubt that it is the
precis-nickname document.)
The Unicode Standard, at least as I understand it, is fairly
clear that the most important (and really only safe) use of
toCaseFold is as part of a comparison operation.
Thanks for noting that. For example, Section 5.18 of Unicode 8.0.0 says:
Caseless matching is implemented using case folding, which is the
process of mapping characters of different case to a single form, so
that case differences in strings are erased. Case folding allows for
fast caseless matches in lookups because only binary comparison is
required. It is more than just conversion to lowercase.
Using your
example it is entirely reasonable to treat, "stpeter" and
"StPeter" as equivalent in a comparison operation, but accepting
one string and changing it to the other for display may not be a
really good idea. While that transformation may be acceptable
(although I would be surprised if there were no people who share
your surname who could consider "stpeter" or "Stpeter"
unacceptable and might even believe that "StPeter" is an
unacceptable substitute for "St. Peter"),
I do receive email at [email protected] intended for [email protected]
but that's a separate topic...
it also points out the
dangers of using Basic Latin script examples to illustrate
situations in which even more extended Latin script, much less
other scripts, may raise more complex issues. Because IDNA is
essentially a workaround because changing the DNS comparison
rules was impractical for several reasons, we ended up using
toCaweFold to map characters and strings into others in IDNA2003
but PRECIS implementations that do not have the same constraints
would, in general, be better off confining the use of
toCaseFold, or even toLowerCase, to comparison operations.
Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does it make
sense for this nickname specification to differ in this respect from the
published RFCs? Shall we file errata against those documents? (This
might apply only to RFC 7613, which says to apply case folding as part
of the enforcement process - when exactly to apply case folding is not
stipulated by RFC 7564.)
(3) Because toCaweFold loses information when used for more than
comparison (for comparison, it merely contributes to what some
people would consider false positives for matching) involves
some controversial decisions and, because of stability
requirements, cannot be changed even if the controversies are
resolved in other ways, we end up with, e.g.,
toCaseFold ("Nuß") -> "nuss"
which is considered an acceptable transformation in some places
that identify themselves as speaking/using German and two
different unacceptable errors in others. Again, this will
almost always be much more serious if the transformation is used
to map and replace strings than if it is used to compare (fwiw,
that particular example is part of a continuing disagreement
between IDNA2008 and, among others, German domain registry
authorities on one side and UTC and UTR 46 on the other).
Agreed.
(4) If the motivation is really to avoid confusion, the correct
confusion-blocking rule for Latin script (but not others) and
many languages that use it (but certainly not all) involves
moving beyond toCaseFold and treating all "decorated" characters
(characters normally represented by glyphs consisting of a Basic
Latin character and one or more diacritical or equivalent
markings) compare equal to their base characters, e.g., "á" not
only matches "Á" but also "a" and "A" and, as an unfortunate
side-effect, maybe "À" and "à" as well. This is bad news for
languages in which decorated Latin characters are used to
represent phonetically and conceptually different characters,
not just pronunciation variations. I am not qualified to
evaluate "how bad". In addition, extrapolations from this
principle about Latin script to unrelated scripts will almost
certainly lead to serious errors and/or additional confusion.
I would not be comfortable going that far...
More on this and Tom's question below...
On 9/29/15 3:28 PM, Tom Worster wrote:
Peter, Alexey,
I think there is an ambiguity in the specification of case
mapping in RFC 7613 and draft-ietf-precis-nickname-19.
...
But there are 55 code points in Unicode 7.0.0 that change
under default case folding that are neither uppercase nor
titlecase characters, 12 of which are Lowercase_Letter. I
suspect this stems from a confusion between Unicode case
mapping and case folding.
Yes, I think so. See above, but, if I were making the rules, I
would say "never use toCaseFold where case mapping is intended
and, in particular, where one wants to substitute one string for
another rather than checking a pair of strings for equivalence
or perhaps telling users what would be considered equivalent".
That interpretation is, I believe, consistent with most of the
Unicode FAQ text you have quoted and other Unicode statements.
However I have lost that argument before and hope, given
decisions that have been made and deployed, that I was wrong.
But there is another issue...
...
The nickname profile can be corrected or the algorithm
clarified. I'm not sure what to do with a Proposed Standard
RFC. Errata? Can the case mapping rule be changed in IANA?
https://www.iana.org/assignments/precis-parameters/profiles/U
sernameCaseMap ped.txt
e.g. to "Apply Unicode default case folding"
Almost certainly not... an "update" revision of the spec would
be needed.
Yes, RFC 7613 would need to be updated via a separate spec or 7613bis.
At least a few of the characters you questioned raise another
issue:
...
Ll; 03D0; C; 03B2; # GREEK BETA SYMBOL
Ll; 03D1; C; 03B8; # GREEK THETA SYMBOL
Ll; 03D5; C; 03C6; # GREEK PHI SYMBOL
Ll; 03D6; C; 03C0; # GREEK PI SYMBOL
Ll; 03F0; C; 03BA; # GREEK KAPPA SYMBOL
Ll; 03F1; C; 03C1; # GREEK RHO SYMBOL
Ll; 03F5; C; 03B5; # GREEK LUNATE EPSILON SYMBOL
...
Ll; 1FBE; C; 03B9; # GREEK PROSGEGRAMMENI
Nl; 2160; C; 2170; # ROMAN NUMERAL ONE
(etc)
So; 24B6; C; 24D0; # CIRCLED LATIN CAPITAL LETTER A
So; 24B7; C; 24D1; # CIRCLED LATIN CAPITAL LETTER B
...
Those examples, and others, are independent of their Unicode
categories, not characters used in writing "words" of normal
languages. Most of them are inherently confusable with the
similar-looking letters, e.g., U+2160 and U+2170 with upper and
lower-case "I" (and ("i") respectively or U+03D0 and its
relationship to "β". The latter also raises the
now-purely-academic question of whether a "variant letterform",
such as U+03D0, violates the Unicode principle that different
code points are not assigned to different glyph forms of the
same latter, but those kinds of questions are another thing that
makes these discussions difficult, especially for those who
don't want to get involved with even script-specific or
locale-specific details. To the extent possible, we dealt
with such characters in IDNA2008 by identifying them as
DISALLOWED, but PRECIS permits enough additional flexibility to,
as you have noticed, allow people who don't understand what they
are doing (or who are trying to avoid that necessity) to get
themselves and their users into a lot of trouble.
This is certainly the case for FreeformClass in PRECIS. I would hope
that we took a safer path with the IdentifierClass (e.g., U+03D0 is
disallowed there).
Fewer easy answers here than one would like and would expect in
some alternate and easier reality.
Always. :(
After all that, I have 3 questions:
(1) Is my proposed text enough of a clarification that we should make
that change before the nickname I-D is published as an RFC?
(2) Should we modify draft-ietf-precis-nickname so that case folding is
applied only as part of comparison and not as part of enforcement? If
so, should we make that change before this document is published as an RFC?
(3) Should we update RFC 7613 so that case folding is applied only as
part of comparison and not as part of enforcement?
Peter
_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis