On 10/27/15 11:32 AM, John C Klensin wrote:
Response to Monday's note immediately below; response to today's
follows it. My apologies, but it is probably important to read
both. My further apologies for the length of this note, but I
think we are in deep trouble here,
Internationalization always seems to be a matter of how deep the trouble
is...
trouble that is aggravated by
precis-mappings and precis-nickname both being post-approval and
that, as far as I know, there are no future plans for PRECIS
work (having precis-nickname in AUTH48 just emphasizes that --
see comment at end).
We had not planned to work on PRECIS because we thought we were done for
awhile. If that's not the case and we need to fix things, then so be it.
Whether there is sufficient and continued energy for such work is
another question. Personally I don't want us to have broken RFCs out
there.
--On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre -
&yet <[email protected]> wrote:
My apologies for the delayed reply. Comments inline.
A few remarks below... I can't tell whether we disagree or
whether at least one of us, probably me, are not being
adequately clear. (Material on which we fairly clearly agree
elided.)
On 10/1/15 7:50 AM, John C Klensin wrote:
--On Wednesday, September 30, 2015 15:16 -0600 Peter
Saint-Andre - &yet <[email protected]> wrote:
...
Peter,
While your proposed text is an improvement,
Happy to hear it. All I intended was a slight clarification.
But I'm not certain we are there yet...
Agreed. The text I proposed addressed only a very small part of the
problem.
the desire of many
people for a magic "just tell me what to do" formula, one that
lets them avoid understanding the issues, may call for a
little more:
There is always a need for more when it comes to i18n.
But I think it is a little more that that. I've heard several
times, including in PRECIS meetings, requests for "just tell me
what to do and make sure it isn't complicated" (or "I don't want
to have to think about, much less understand, the issues"). We
can debate whether giving in to those requests in the I18n case
is wise. I think it leads directly to conclusions equivalent to
"I understand my own script and writing system (or think I do)
and therefore, since all writing systems must be pretty much the
same, I understand all of the core issues in terms of my script
and understanding". That, in turn, leads directly to the "how
do you spell 'Zürich'?" and "all spellings of 'Zuerich' should
be treated as equivalent" discussion that sounded like they
dominated a BOF at IETF 93.
Now I actually think it is reasonable for someone to ask for a
library that will do the job most of the time and that will
almost never cause their users or customers to get angry at
them. But, if we are going to call what we do "standards", they
should contain sufficient information that would-be library
authors can know what to do ... or understand that they are in
over their heads. And, for these particular cases, we may need
to explain, or help the library authors explain, why some cases
will fail and, indeed, get users mad at vendors.
(1) First, toCaseFold is _not_ toLowerCase. Saying "The
primary result of doing so is that uppercase characters are
mapped to lowercase characters" is true for toCaseFold,
By "primary" I meant two things: (1) lowercasing is what
happens to the preponderance of code points and (2) this is
the result that most people care about.
If I parse the above correctly, I think you are wrong. I think
what most people want, care about, and think they are getting,
is lower case conversion, i.e., an operation that preserves
lower case characters and converts upper case characters to the
equivalent lower case. toCaseFold isn't that operation. It is
a much more complex and subtle operation that, as well as
converting upper case characters to lower case, sometimes
converts lower case characters to different lower case
characters (or strings of them). It also requires a fairly good
understanding of Unicode (not just a relevant script) and
historical Unicode decisions to predict its behavior and to have
any hope of explaining that behavior to users. If one is
trying to compare (as distinct from converting), then toCaseFold
may be exactly what it wanted. but it is really hard to explain
or justify that in terms of "nicknames" or "aliases", which are
about conversion. And, if one hopes to explain what is going
on to users in terms of "lower casing", then toCaseFold is just
the wrong operation. That is what toLowerCase is for and the
two operations are just not equivalent.
My recollection, quite possibly inaccurate or incomplete, from at least
one and I think several in-person meetings of the PRECIS WG was: just
use Unicode Default Case Folding because if you use anything else or try
to roll your own you will be fubar forever. I do not recall any
discussion of the issues you have raised in this thread (e.g., about the
inadvisability of using case folding for anything but comparison
operations) until the last few weeks. However, I freely admit that's
probably because, through my own faults and ignorance, I didn't
understand what you were saying.
FWIW and purely by coincidence wrt PRECIS and this document, I
had a conversation a few days ago with an expert on Arabic (and
Persian) calligraphy and writing systems (and good general
knowledge of writing systems) who is quite insistent that any
procedure we use for case-insensitive matching (e.g., case
folding) is discriminatory, inconsistent, and just
badly-thought-out if that same procedure doesn't treat isolated,
initial, and medial forms of the same character as equivalent.
He further strengthens his case (sic) by noting that Unicode
case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA,
unambiguously a lower-case character) to U+03C3 (GREEK SMALL
LETTER SIGMA), a relationship that depends entirely on
positional use and not case. He also believes the same
relationships should apply to all other scripts that make form
distinctions for some characters based on positions in a string
and for which Unicode has chosen to assign different code
points. Even if there were wide acceptance of his view, Unicode
stability principles would prevent changing toCaseFold (or
CaseFolding.txt), but this is more evidence that what toCaseFold
does and does not do is going to be hard to explain to either
casual users or to writing system experts whose primary
experience is not with the Greek-Latin-Cyrillic group.
I don't think we want to say "these matching rules are somewhat
arbitrary and irrational, but, if you don't like it, blame
Unicode and not us", if only because it is our choice to use
those matching rules. More below.
...
(2) Second, probably as a result of having IDNA in the lead,
we've gotten sloppy about language and operations and should
probably start untangling that before it gets people in
trouble.
Where is the right place to do that untangling? (I doubt that
it is the precis-nickname document.)
I agree that precis-nickname isn't the ideal place. I also
believe that you and it are the innocent victims of the
situation. At the same time, I don't believe IETF should be
producing incomplete, ambiguous, erroneous, or misleading
standards because no one could get around to doing the right
foundational work.
Agreed. I too want to get this right, even though it's not a lot of fun
and it's certainly more work than I thought I was signing up for at the
NEWPREP BoF years ago.
The Unicode Standard, at least as I understand it, is fairly
clear that the most important (and really only safe) use of
toCaseFold is as part of a comparison operation.
Thanks for noting that. For example, Section 5.18 of Unicode
8.0.0 says:
Caseless matching is implemented using case folding, which
is the
process of mapping characters of different case to a
single form, so
that case differences in strings are erased. Case folding
allows for
fast caseless matches in lookups because only binary
comparison is
required. It is more than just conversion to lowercase.
Right. But, again, when its use is appropriate (a very
controversial topic in itself with our painful IDNA history with
Final Sigma, Eszett and the case-independent versus
position-independent controversy called out above as examples)
that is "matches in lookups" (what I've described elsewhere as
"comparison only"). Not creating or defining nicknames or
aliases. And that _is_ a problem for this document.
I'm not convinced that things are as bad as you think. If we say in
draft-ietf-precis-nickname that the case mapping rule is to be applied
only as part of comparison and not as part of enforcement - which I
think is really what we care about (e.g., to prevent spoofing of users
in chat rooms) - then I think we might be most of the way there.
Using your
example it is entirely reasonable to treat, "stpeter" and
"StPeter" as equivalent in a comparison operation, but
accepting one string and changing it to the other for display
may not be a really good idea. While that transformation may
be acceptable (although I would be surprised if there were no
people who share your surname who could consider "stpeter" or
"Stpeter" unacceptable and might even believe that "StPeter"
is an unacceptable substitute for "St. Peter"),
I do receive email at [email protected] intended for
[email protected] but that's a separate topic...
One that is relevant because it "works" as a side-effect of a
decision Google has made about mailbox name equivalence, a
decision that, IMO, will sooner or later get someone into a lot
of trouble and, more important, a decision and matching rule
that PRECIS, AFAICT, does not allow and that IDNA unambigiously
forbids.
it also points out the
dangers of using Basic Latin script examples to illustrate
situations in which even more extended Latin script, much less
other scripts, may raise more complex issues. Because IDNA
is essentially a workaround because changing the DNS
comparison rules was impractical for several reasons, we
ended up using toCaweFold to map characters and strings into
others in IDNA2003 but PRECIS implementations that do not
have the same constraints would, in general, be better off
confining the use of toCaseFold, or even toLowerCase, to
comparison operations.
Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does
it make sense for this nickname specification to differ in
this respect from the published RFCs? Shall we file errata
against those documents? (This might apply only to RFC 7613,
which says to apply case folding as part of the enforcement
process - when exactly to apply case folding is not stipulated
by RFC 7564.)
To the extent to which this is a "botched that because the WG
didn't understand the issues well enough" conclusion, it would
be entirely reasonable to generate an updating RFC that repairs
7613 and/or 7564, even doing so in an addendum to
precis-nickname if that is the only way to do that
expeditiously. Per the above, we really don't want to give
library routine writers bad instructions. As I understand it,
the current position of the RFC Editor and IESG is that
technical specification errors discovered in retrospect or after
people start using a spec are not appropriate topics for errata.
If the WG is not willing to do any of those things, then I
suggest that precis-nickname at least needs to contain a very
clear warning notice about this situation (see my response to
your question 1 below).
I think we'll probably need to fix 7613 and 7564. I am hoping we can fix
nickname now so that it is less incorrect than the other two. That
doesn't necessarily mean we won't need to also further fix nickname
later on.
Granted, we were supposed to avoid this problem by working on all of the
PRECIS specs simultaneously. Clearly we have not avoided the problem, so
we need to solve it one way or another. If that means bis for them all,
we need to deal with it.
(3) Because toCaweFold loses information when used for more
than comparison (for comparison, it merely contributes to
what some people would consider false positives for matching)
involves some controversial decisions and, because of
stability requirements, cannot be changed even if the
controversies are resolved in other ways, we end up with,
e.g.,
toCaseFold ("Nuß") -> "nuss"
which is considered an acceptable transformation in some
places that identify themselves as speaking/using German and
two different unacceptable errors in others. Again, this will
almost always be much more serious if the transformation is
used to map and replace strings than if it is used to compare
(fwiw, that particular example is part of a continuing
disagreement between IDNA2008 and, among others, German
domain registry authorities on one side and UTC and UTR 46 on
the other).
Agreed.
See "warning notice" comment above and question 1 response below.
(4) If the motivation is really to avoid confusion, the
correct confusion-blocking rule for Latin script (but not
others) and many languages that use it (but certainly not
all) involves moving beyond toCaseFold and treating all
"decorated" characters (characters normally represented by
glyphs consisting of a Basic Latin character and one or more
diacritical or equivalent markings) compare equal to their
base characters, e.g., "á" not only matches "Á" but also
"a" and "A" and, as an unfortunate side-effect, maybe "À"
and "à" as well. This is bad news for languages in which
decorated Latin characters are used to represent phonetically
and conceptually different characters, not just pronunciation
variations. I am not qualified to evaluate "how bad". In
addition, extrapolations from this principle about Latin
script to unrelated scripts will almost certainly lead to
serious errors and/or additional confusion.
I would not be comfortable going that far...
In case it isn't clear, I would not be either. But it is where
getting sloppy about this stuff could easily take us. It is
worth noting that it also identifies one of the difficulties
with doing a global system to be applied to many types of
applications (like the PRECIS work) and then applying it in user
interface software that end users will expect to be localized to
their assumptions because it has been mapped or translated into
their language (if one normally speaks Upper Slobbovian but has
some familiarity with English, an application interface in
English will probably be expected to be "foreign", odd, and
maybe even inconsistent with whatever expectations exist. But,
if the interface is in Upper Slobbovian, the natural and
reasonable assumption will be the matching should conform to
normal Upper Slobbovian conventions. FWIW, a matching rule
that says:
(i) Two instances of a base character with the same
diacritical mark(s) match.
(ii) Two instances of a base character with different
diacritical mark(s) do not match.
(iii) Two instances of a base character, one with
diacritical mark(s) and the other without any decoration
match.
Is precisely correct and normal behavior for at least one
language that uses Latin script. It is also the normal practice
for at least one Latin script transcription system that is used
by a large fraction of a billion people (maybe more).
That is indeed sobering.
More on this and Tom's question below...
On 9/29/15 3:28 PM, Tom Worster wrote:
Peter, Alexey,
I think there is an ambiguity in the specification of case
mapping in RFC 7613 and draft-ietf-precis-nickname-19.
...
But there are 55 code points in Unicode 7.0.0 that change
under default case folding that are neither uppercase nor
titlecase characters, 12 of which are Lowercase_Letter. I
suspect this stems from a confusion between Unicode case
mapping and case folding.
In the context of the above, a different way to say the same
thing is that people are looking at toCaseFold and assuming (and
explaining things in terms of) toLowerCase. toCaseFold works
the way it is expected to and those 55 code points are, more or
less, collateral damage to get to a matching algorithm that
favors false positives over false negatives and various edge
cases (including in "edge cases" languages spoken by, and script
variations used by, millions of people).
Sadly I suspect that is an accurate description of the current state of
affairs (modulo my comment above about PRECIS WG discussions at one or
more IETF meetings).
...
After all that, I have 3 questions:
Personal opinions about answers...
(1) Is my proposed text enough of a clarification that we
should make that change before the nickname I-D is published
as an RFC?
I think the clarification is an improvement and is important
enough to incorporate (I know that is the answer to a slightly
different question).
However, I think it is inadequate without a serious warning
about the situation.
Yes.
That warning could appear in either this
document or RFC 7613 (or 7613bis) with a pointer from the other,
but, unless you want to revise 7613 now, this one is handy.
I suspect that we need to revise 7613. I suspect that we might also need
to revise 7564 (at least with respect to the order in which operations
are applied, since there has been some confusion among implementers).
Well, we always knew that we would need to revise them. Just not so soon.
Comment about possible text below.
(2) Should we modify draft-ietf-precis-nickname so that case
folding is applied only as part of comparison and not as part
of enforcement? If so, should we make that change before this
document is published as an RFC?
Yes. If something is used for "enforcement", it should be lower
casing or something else that can be explained to people who are
ordinarily familiar with one or more of the scripts that make
case distinctions.
However, viewed in the light of this discussion, the whole
"enforcement" concept becomes a little dicey, especially if, as
I believe but don't have time to verify, the transformations
performed by toLowerCase are not a proper subset of those
performed by toCaseFold.
My initial thought is that case mapping doesn't belong in the nickname
enforcement operation at all - only in the comparison operation.
(3) Should we update RFC 7613 so that case folding is applied
only as part of comparison and not as part of enforcement?
I think that is necessary. Following up on the comment above, I
would prefer that the current Section 3.2.2 (3) of RFC 7613
either point to Unicode Lower Casing or contain a warning along
the lines of that below.
Unlike the nickname profile (which I think can be cleaned up by moving
the case mapping rule to the comparison operation and continuing to use
Unicode Default Case Folding), I think you are right that for the
UsernameCaseMapped profile we probably want Unicode Lower Casing. Thus
the likely need, sooner rather than later, for 7613bis.
----------
--On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre
<[email protected]> wrote:
This issue has greater urgency now because
draft-ietf-precis-nickname is now in AUTH48...
On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote:
After all that, I have 3 questions:
(1) Is my proposed text enough of a clarification that we
should make that change before the nickname I-D is published
as an RFC?
I think so.
See above.
(2) Should we modify draft-ietf-precis-nickname so that case
folding is applied only as part of comparison and not as part
of enforcement? If so, should we make that change before this
document is published as an RFC?
Although it seems to be the case that Unicode case folding is
primarily designed for the purpose of matching (i.e.,
comparison),
"Seems" is a little weak. The Unicode Standard is really quite
specific about that.
I have a concern that applying the PRECIS case
mapping rule after applying the normalization and
directionality rules might have unintended consequences that
we haven't had a chance to consider yet. The PRECIS framework
expresses a preference (actually a hard requirement) for
applying the rules in a particular order. We made a late
change to the username profiles (RFC 7613), such that width
mapping is applied first (in order to accommodate fullwidth
and halfwidth characters in certain East Asian scripts).
Making a late change to the nickname profile also concerns me,
even though both of these late changes seem reasonable on the
face of it. I will try to find time to think about this
further in the next 24 hours.
First, a hint for the consideration process: there is a reason
why Unicode now supports a unified case folding and
normalization operation. My recollection is that it is not only
more efficient to perform both operations at once (rather than
looking in one table and then the other), but that there are
some order-dependent or priority-dependent cases.
The very fact that this issue exists (and is coming up again)
this late in the process (7613 published in August, WG winding
down and not, e.g., meeting next week) calls at least the PRECIS
quality of review and some fairly fundamental model issues into
question. I first raised that issue a rather long time ago but
have continued to hope that we have an approximation to "good
enough" without going back and rethinking everything.
The right solution, IMO, is that, if RFC 7613 is to rationalize
or explain the operation in terms of converting upper case
characters to lower case, then it should be using toLowerCase
because that is what the operation does. After a quick look at
7613, amending/updating it to simply convert to lower case would
be straightforward (and would not raise the ordering issue
called out above). It would presumably require another IETF
Last Call, however and I'd hope we would see some serious
discussion within the WG (and with UTC) before making the change
and about how it is explained.
If we are not willing to make a change
I'm willing. It would, as you note, require some careful thinking and
review to make sure that we got it (more) right this time.
that significant and/or
if we conclude that the WG (and perhaps the IETF) have
completely run out of energy for dealing with i18n issues [1],
then I suggest that we introduce some additional text. I've
just spent a half-hour trying to find the AUTH48 copy of
precis-nickname (aka RFC-to-be-7700), but the RFC Editor has
apparently changed naming conventions and the various queue
entry pages all point to the -19 I-D and not the current working
copy so I can't try to match text and insertion point to what is
there already.
http://www.rfc-editor.org/authors/rfc7700.txt
The suggestion is a patch (and a hack), not a
good fix but something like it is probably the least drastic
measure that would yield something that doesn't contain
unexplained known defects.
Rough version of suggested text (possibly to go after your
revised paragraph and following up my comments in my 1 October
note). Some of the terminology needs checking which I can do if
you want to go this route:
'Users of this specification should note that the
concept of "lower case conversion" is somewhat elusive
and more dependent on the conventions of different
languages and notation systems that use the same script
than may appear obvious at first glance, especially if
that glance is at Basic Latin characters (i.e., the
ASCII letter repertoire). Unicode provides two
different mapping procedures that produce lower-case
characters, but they have different effects and results
for many characters. The more conservative one,
typically appropriately applicable when lower case forms
are needed, is actual lower-casing (embodied in the
Unicode operation toLowerCase). A more radical
operation, normally suitable only for string matching in
situations in which it is better to consider uncertain
cases as matching than to treat them as distinct, is
called "Case Folding" (Unicode operation toCaseFold).
While the two operations will often produce the same
results, Case Folding maps some lower case characters
into others and performs other transformations that may
be intuitively reasonable and expected for some users
and quite astonishing (or just wrong) to others. There
may be no practical alternative, especially if the
operations are to be used for mapping or enforcement, to
developers of PRECIS-dependent understanding that the
cases in which the two yield different results require
careful understanding of the relevant user base and its
needs [2].'
Thanks.
I am not sure if we need something like that if we move case mapping
(here, case folding) to the comparison operation only - but something
like that might still be appropriate.
(3) Should we update RFC 7613 so that case folding is applied
only as part of comparison and not as part of enforcement?
That is less urgent so I suggest that we address the nickname
spec first.
Unless you (or someone else here) have a plausible plan to
continue and revitalize the WG and assign it that revision work
(and bring everyone actively participating up to the level
needed to easily understand this discussion thread and feel
embarrassed for not spotting the problems), I think we need to
assume that this is our last shot. Absent an active and
committed WG, "do this first" could easily be equivalent to
"don't get around to the other, ever".
As mentioned, I don't want to have broken RFCs out there.
I think that the particular set of issues that started this
thread as a known defect in the PRECIS specs, both nickname and
7613 and that we are obligated to either fix the problems or at
least explain them. The above warning text is an attempt to
explain and identify the problems even if it does not actually
provide a solution. If it were published as part of
precis-nickname, it could include a statement to the effect that
it should also be treated as an update to 7613 or, if the IESG
and RFC Editor would agree in advance to accept, rather than
bury, the thing, I suppose we could publish it in
precis-nickname and create an erratum to 7613 indicating that it
should have included some form of that statement. Neither
option implies a huge amount of work to update 7613. But I
think that making the changes of (2) without doing anything
about (3) makes the two documents inconsistent with each other
and that would be an additional known defect.
Procedural question: given that precis-nickname is in AUTH48 as
of yesterday and I don't see anything blocking publication next
week if you and Barry sign off on the revised text that the WG
hasn't seen,
There is no revised text yet. That's why we're having this discussion.
does someone need to file a pro forma objection/
appeal to block that until this is sorted out and the WG has a
chance to review proposed publication text?
I see no reason to invoke the specter of appeals quite yet. Everyone is
working in good faith to do the right thing and get this mess cleaned up.
[1] I believe our collective inability to deal with the
within-script character forms that do not normalize to each
other because of language-dependent or other usage factors can
be taken as evidence of having run out of energy,
Or in my case simple ignorance of some of the relevant issues and
examples. It's not easy to know about all of this.
but it is
probably in the interest of finishing the PRECIS work to try to
treat that as a separate issue.
Probably.
[2] Not unlike the reason to differentiate between NFC and NFKC
and understand the effects of each.
Another thing that's not easy to grok in fulness.
Peter