Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

Tom Worster Tue, 03 Nov 2015 05:17:59 -0800

Peter,

This is much better and, as far as the normative language goes, seems
correct and unambiguous.


The informative sentence in 2.1 Rule 3:

    "The primary result of doing so is that uppercase characters are
    mapped to lowercase characters."

is good but I think it's worth spending a few more words to spell out that
"primary" implies exceptions.

    "While the primary result is that uppercase characters are mapped to
    lowercase characters, there are exceptions."

It might nudge a few fore implementers to understand that toLowerCase()
isn't the right thing.

Tom


On 11/2/15, 11:42 PM, "Peter Saint-Andre" <[email protected]> wrote:

>For ease of reviewing only, and with no presumption that these proposed
>changes have been accepted by anyone, I have asked the RFC Editor to
>provisionally update the document in AUTH48 as outlined in the messages
>I have sent to the list. The resulting file is here:
>
>http://www.rfc-editor.org/authors/rfc7700.txt
>
>Despite those caveats, if at all possible I would prefer to find an
>acceptable solution for publishing this RFC now without undue further
>delays (in part because draft-ietf-simple-chat has been held on this
>document for almost 3 years!). That does not mean, as I said earlier in
>this thread, that I want to have broken RFCs out there, but I think we
>can fix this one acceptably now and then update it again in strict
>coherence with updates to RFC 7564 and RFC 7613. I am committed to
>getting things right, but I am also committed to not holding up other
>people's work for years and years on end.
>
>Peter
>
>On 10/28/15 3:52 PM, Peter Saint-Andre wrote:
>> Example 7 needs to be corrected, too, in accordance with
>>CaseFolding.txt.
>>
>> On 10/28/15 2:54 PM, Peter Saint-Andre wrote:
>>> And here is another correction in Section 3...
>>>
>>> OLD
>>>
>>>     Regarding examples 5, 6, and 7: applying Unicode Default Case
>>>Folding
>>>     to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL
>>>LETTER
>>>     SIGMA (U+03C3), and doing so during comparison would result in
>>>     matching the nicknames in examples 5 and 6; however, because the
>>>     PRECIS mapping rules do not account for the special status of GREEK
>>>     SMALL LETTER FINAL SIGMA (U+03C2), the nicknames in examples 5 and
>>>7
>>>     or examples 6 and 7 would not be matched.
>>>
>>> NEW
>>>
>>>     Regarding examples 5, 6, and 7: applying Unicode Default Case
>>>Folding
>>>     to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL
>>>LETTER
>>>     SIGMA (U+03C3), and the same is true of GREEK SMALL LETTER FINAL
>>>     SIGMA (U+03C2); therefore, the comparison operation defined in
>>>     Section 2.4 would result in matching of the nicknames in examples
>>>5,
>>>     6, and 7.
>>>
>>> On 10/28/15 2:06 PM, Peter Saint-Andre wrote:
>>>> I propose the following text changes:
>>>>
>>>> ###
>>>>
>>>> OLD
>>>>
>>>>     3.  Case Mapping Rule: Uppercase and titlecase characters MUST be
>>>>         mapped to their lowercase equivalents using Unicode Default
>>>>Case
>>>>         Folding as defined in the Unicode Standard [Unicode] (at the
>>>> time
>>>>         of this writing, the algorithm is specified in Chapter 3 of
>>>>         [Unicode7.0]).  In applications that prohibit conflicting
>>>>         nicknames, this rule helps to reduce the possibility of
>>>> confusion
>>>>         by ensuring that nicknames differing only by case (e.g.,
>>>>         "stpeter" vs. "StPeter") would not be presented to a human
>>>>user
>>>>         at the same time.
>>>>
>>>> NEW
>>>>
>>>>     3.  Case Mapping Rule: Unicode Default Case Folding MUST be
>>>>applied,
>>>>         as defined in the Unicode Standard [Unicode] (at the time
>>>>         of this writing, the algorithm is specified in Chapter 3 of
>>>>         [Unicode7.0]).  The primary result of doing so is that
>>>>uppercase
>>>>         characters are mapped to lowercase characters. In applications
>>>>         that prohibit conflicting nicknames, this rule helps to reduce
>>>>         the possibility of confusion by ensuring that nicknames
>>>>         differing only by case (e.g., "stpeter" vs. "StPeter") would
>>>>not
>>>>         be presented to a human user at the same time.
>>>>
>>>> ###
>>>>
>>>> (The foregoing was previously sent to the list.)
>>>>
>>>> ###
>>>>
>>>> OLD
>>>>
>>>> 2.3.  Enforcement
>>>>
>>>>     An entity that performs enforcement according to this profile MUST
>>>>     prepare a string as described in Section 2.2 and MUST also apply
>>>>the
>>>>     rules specified in Section 2.1.  The rules MUST be applied in the
>>>>     order shown.
>>>>
>>>>     After all of the foregoing rules have been enforced, the entity
>>>>MUST
>>>>     ensure that the nickname is not zero bytes in length (this is done
>>>>     after enforcing the rules to prevent applications from mistakenly
>>>>     omitting a nickname entirely, because when internationalized
>>>>     characters are accepted, a non-empty sequence of characters can
>>>>     result in a zero-length nickname after canonicalization).
>>>>
>>>> 2.4.  Comparison
>>>>
>>>>     An entity that performs comparison of two strings according to
>>>>this
>>>>     profile MUST prepare each string and enforce the rules as
>>>>specified
>>>>     in Sections 2.2 and 2.3.  The two strings are to be considered
>>>>     equivalent if they are an exact octet-for-octet match (sometimes
>>>>     called "bit-string identity").
>>>>
>>>> NEW
>>>>
>>>> 2.3.  Enforcement
>>>>
>>>>     An entity that performs enforcement according to this profile MUST
>>>>     prepare a string as described in Section 2.2 and MUST also apply
>>>>the
>>>>     following rules specified in Section 2.1 in the order shown:
>>>>
>>>>     1. Additional Mapping Rule
>>>>     2. Normalization Rule
>>>>     3. Directionality Rule
>>>>
>>>>     After all of the foregoing rules have been enforced, the entity
>>>>MUST
>>>>     ensure that the nickname is not zero bytes in length (this is done
>>>>     after enforcing the rules to prevent applications from mistakenly
>>>>     omitting a nickname entirely, because when internationalized
>>>>     characters are accepted, a non-empty sequence of characters can
>>>>     result in a zero-length nickname after canonicalization).
>>>>
>>>> 2.4.  Comparison
>>>>
>>>>     An entity that performs comparison of two strings according to
>>>>this
>>>>     profile MUST prepare each string as specified in Section 2.2 and
>>>>     MUST apply the following rules specified in Section 2.1 in the
>>>>order
>>>>     shown:
>>>>
>>>>     1. Additional Mapping Rule
>>>>     2. Case Mapping Rule
>>>>     3. Normalization Rule
>>>>     4. Directionality Rule
>>>>
>>>>     The two strings are to be considered equivalent if they are an
>>>>exact
>>>>     octet-for-octet match (sometimes called "bit-string identity").
>>>>
>>>> ###
>>>>
>>>> In addition, some variation on John's proposed text about toLowerCase
>>>> vs. toCaseFold might be appropriate at the end of Section 4; however,
>>>> I'm still not sure that is necessary if we move the case mapping rule
>>>>to
>>>> the comparison operation.
>>>>
>>>> Peter
>>>>
>>>> On 10/27/15 8:09 PM, Peter Saint-Andre wrote:
>>>>> On 10/27/15 11:32 AM, John C Klensin wrote:
>>>>>> Response to Monday's note immediately below; response to today's
>>>>>> follows it.  My apologies, but it is probably important to read
>>>>>> both.  My further apologies for the length of this note, but I
>>>>>> think we are in deep trouble here,
>>>>>
>>>>> Internationalization always seems to be a matter of how deep the
>>>>> trouble
>>>>> is...
>>>>>
>>>>>> trouble that is aggravated by
>>>>>> precis-mappings and precis-nickname both being post-approval and
>>>>>> that, as far as I know, there are no future plans for PRECIS
>>>>>> work (having precis-nickname in AUTH48 just emphasizes that --
>>>>>> see comment at end).
>>>>>
>>>>> We had not planned to work on PRECIS because we thought we were done
>>>>> for
>>>>> awhile. If that's not the case and we need to fix things, then so be
>>>>> it.
>>>>> Whether there is sufficient and continued energy for such work is
>>>>> another question. Personally I don't want us to have broken RFCs out
>>>>> there.
>>>>>
>>>>>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre -
>>>>>> &yet <[email protected]> wrote:
>>>>>>
>>>>>>> My apologies for the delayed reply. Comments inline.
>>>>>>
>>>>>> A few remarks below... I can't tell whether we disagree or
>>>>>> whether at least one of us, probably me, are not being
>>>>>> adequately clear.  (Material on which we fairly clearly agree
>>>>>> elided.)
>>>>>>
>>>>>>
>>>>>>> On 10/1/15 7:50 AM, John C Klensin wrote:
>>>>>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter
>>>>>>>> Saint-Andre - &yet <[email protected]> wrote:
>>>>>>> ...
>>>>>>>> Peter,
>>>>>>>>
>>>>>>>> While your proposed text is an improvement,
>>>>>>>
>>>>>>> Happy to hear it. All I intended was a slight clarification.
>>>>>>
>>>>>> But I'm not certain we are there yet...
>>>>>
>>>>> Agreed. The text I proposed addressed only a very small part of the
>>>>> problem.
>>>>>
>>>>>>>> the desire of many
>>>>>>>> people for a magic "just tell me what to do" formula, one that
>>>>>>>> lets them avoid understanding the issues, may call for a
>>>>>>>> little more:
>>>>>>>
>>>>>>> There is always a need for more when it comes to i18n.
>>>>>>
>>>>>> But I think it is a little more that that.  I've heard several
>>>>>> times, including in PRECIS meetings, requests for "just tell me
>>>>>> what to do and make sure it isn't complicated" (or "I don't want
>>>>>> to have to think about, much less understand, the issues").  We
>>>>>> can debate whether giving in to those requests in the I18n case
>>>>>> is wise.  I think it leads directly to conclusions equivalent to
>>>>>> "I understand my own script and writing system (or think I do)
>>>>>> and therefore, since all writing systems must be pretty much the
>>>>>> same, I understand all of the core issues in terms of my script
>>>>>> and understanding".   That, in turn, leads directly to the "how
>>>>>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should
>>>>>> be treated as equivalent" discussion that sounded like they
>>>>>> dominated a BOF at IETF 93.
>>>>>>
>>>>>> Now I actually think it is reasonable for someone to ask for a
>>>>>> library that will do the job most of the time and that will
>>>>>> almost never cause their users or customers to get angry at
>>>>>> them.  But, if we are going to call what we do "standards", they
>>>>>> should contain sufficient information that would-be library
>>>>>> authors can know what to do ... or understand that they are in
>>>>>> over their heads.  And, for these particular cases, we may need
>>>>>> to explain, or help the library authors explain, why some cases
>>>>>> will fail and, indeed, get users mad at vendors.
>>>>>>
>>>>>>
>>>>>>>> (1) First, toCaseFold is _not_ toLowerCase.  Saying "The
>>>>>>>> primary result of doing so is that uppercase characters are
>>>>>>>> mapped to lowercase characters" is true for toCaseFold,
>>>>>>>
>>>>>>> By "primary" I meant two things: (1) lowercasing is what
>>>>>>> happens to the preponderance of code points and (2) this is
>>>>>>> the result that most people care about.
>>>>>>
>>>>>> If I parse the above correctly, I think you are wrong.   I think
>>>>>> what most people want, care about, and think they are getting,
>>>>>> is lower case conversion, i.e., an operation that preserves
>>>>>> lower case characters and converts upper case characters to the
>>>>>> equivalent lower case.  toCaseFold isn't that operation.  It is
>>>>>> a much more complex and subtle operation that, as well as
>>>>>> converting upper case characters to lower case, sometimes
>>>>>> converts lower case characters to different lower case
>>>>>> characters (or strings of them).  It also requires a fairly good
>>>>>> understanding of Unicode (not just a relevant script) and
>>>>>> historical Unicode decisions to predict its behavior and to have
>>>>>> any hope of explaining that behavior to users.   If one is
>>>>>> trying to compare (as distinct from converting), then toCaseFold
>>>>>> may be exactly what it wanted. but it is really hard to explain
>>>>>> or justify that in terms of "nicknames" or "aliases", which are
>>>>>> about conversion.   And, if one hopes to explain what is going
>>>>>> on to users in terms of "lower casing", then toCaseFold is just
>>>>>> the wrong operation.  That is what toLowerCase is for and the
>>>>>> two operations are just not equivalent.
>>>>>
>>>>> My recollection, quite possibly inaccurate or incomplete, from at
>>>>>least
>>>>> one and I think several in-person meetings of the PRECIS WG was: just
>>>>> use Unicode Default Case Folding because if you use anything else or
>>>>> try
>>>>> to roll your own you will be fubar forever. I do not recall any
>>>>> discussion of the issues you have raised in this thread (e.g., about
>>>>> the
>>>>> inadvisability of using case folding for anything but comparison
>>>>> operations) until the last few weeks. However, I freely admit that's
>>>>> probably because, through my own faults and ignorance, I didn't
>>>>> understand what you were saying.
>>>>>
>>>>>> FWIW and purely by coincidence wrt PRECIS and this document, I
>>>>>> had a conversation a few days ago with an expert on Arabic (and
>>>>>> Persian) calligraphy and writing systems (and good general
>>>>>> knowledge of writing systems) who is quite insistent that any
>>>>>> procedure we use for case-insensitive matching (e.g., case
>>>>>> folding) is discriminatory, inconsistent, and just
>>>>>> badly-thought-out if that same procedure doesn't treat isolated,
>>>>>> initial, and medial forms of the same character as equivalent.
>>>>>> He further strengthens his case (sic) by noting that Unicode
>>>>>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA,
>>>>>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL
>>>>>> LETTER SIGMA), a relationship that depends entirely on
>>>>>> positional use and not case.  He also believes the same
>>>>>> relationships should apply to all other scripts that make form
>>>>>> distinctions for some characters based on positions in a string
>>>>>> and for which Unicode has chosen to assign different code
>>>>>> points.  Even if there were wide acceptance of his view, Unicode
>>>>>> stability principles would prevent changing toCaseFold (or
>>>>>> CaseFolding.txt), but this is more evidence that what toCaseFold
>>>>>> does and does not do is going to be hard to explain to either
>>>>>> casual users or to writing system experts whose primary
>>>>>> experience is not with the Greek-Latin-Cyrillic group.
>>>>>>
>>>>>> I don't think we want to say "these matching rules are somewhat
>>>>>> arbitrary and irrational, but, if you don't like it, blame
>>>>>> Unicode and not us", if only because it is our choice to use
>>>>>> those matching rules.  More below.
>>>>>>
>>>>>>
>>>>>>> ...
>>>>>>>> (2) Second, probably as a result of having IDNA in the lead,
>>>>>>>> we've gotten sloppy about language and operations and should
>>>>>>>> probably start untangling that before it gets people in
>>>>>>>> trouble.
>>>>>>>
>>>>>>> Where is the right place to do that untangling? (I doubt that
>>>>>>> it is the precis-nickname document.)
>>>>>>
>>>>>> I agree that precis-nickname isn't the ideal place.  I also
>>>>>> believe that you and it are the innocent victims of the
>>>>>> situation.  At the same time, I don't believe IETF should be
>>>>>> producing incomplete, ambiguous, erroneous, or misleading
>>>>>> standards because no one could get around to doing the right
>>>>>> foundational work.
>>>>>
>>>>> Agreed. I too want to get this right, even though it's not a lot of
>>>>>fun
>>>>> and it's certainly more work than I thought I was signing up for at
>>>>>the
>>>>> NEWPREP BoF years ago.
>>>>>
>>>>>>>> The Unicode Standard, at least as I understand it, is fairly
>>>>>>>> clear that the most important (and really only safe) use of
>>>>>>>> toCaseFold is as part of a comparison operation.
>>>>>>>
>>>>>>> Thanks for noting that. For example, Section 5.18 of Unicode
>>>>>>> 8.0.0 says:
>>>>>>>
>>>>>>>      Caseless matching is implemented using case folding, which
>>>>>>> is the
>>>>>>>      process of mapping characters of different case to a
>>>>>>> single form, so
>>>>>>>      that case differences in strings are erased. Case folding
>>>>>>> allows for
>>>>>>>      fast caseless matches in lookups because only binary
>>>>>>> comparison is
>>>>>>>      required. It is more than just conversion to lowercase.
>>>>>>
>>>>>> Right.  But, again, when its use is appropriate (a very
>>>>>> controversial topic in itself with our painful IDNA history with
>>>>>> Final Sigma, Eszett and the case-independent versus
>>>>>> position-independent controversy called out above as examples)
>>>>>> that is "matches in lookups" (what I've described elsewhere as
>>>>>> "comparison only").  Not creating or defining nicknames or
>>>>>> aliases.  And that _is_ a problem for this document.
>>>>>
>>>>> I'm not convinced that things are as bad as you think. If we say in
>>>>> draft-ietf-precis-nickname that the case mapping rule is to be
>>>>>applied
>>>>> only as part of comparison and not as part of enforcement - which I
>>>>> think is really what we care about (e.g., to prevent spoofing of
>>>>>users
>>>>> in chat rooms) - then I think we might be most of the way there.
>>>>>
>>>>>>>> Using your
>>>>>>>> example it is entirely reasonable to treat, "stpeter" and
>>>>>>>> "StPeter" as equivalent in a comparison operation, but
>>>>>>>> accepting one string and changing it to the other for display
>>>>>>>> may not be a really good idea.  While that transformation may
>>>>>>>> be acceptable (although I would be surprised if there were no
>>>>>>>> people who share your surname who could consider "stpeter" or
>>>>>>>> "Stpeter" unacceptable and might even believe that "StPeter"
>>>>>>>> is an unacceptable substitute for "St. Peter"),
>>>>>>>
>>>>>>> I do receive email at [email protected] intended for
>>>>>>> [email protected] but that's a separate topic...
>>>>>>
>>>>>> One that is relevant because it "works" as a side-effect of a
>>>>>> decision Google has made about mailbox name equivalence, a
>>>>>> decision that, IMO, will sooner or later get someone into a lot
>>>>>> of trouble and,  more important, a decision and matching rule
>>>>>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously
>>>>>> forbids.
>>>>>>
>>>>>>>> it also points out the
>>>>>>>> dangers of using Basic Latin script examples to illustrate
>>>>>>>> situations in which even more extended Latin script, much less
>>>>>>>> other scripts, may raise more complex issues.    Because IDNA
>>>>>>>> is essentially a workaround because changing the DNS
>>>>>>>> comparison rules was impractical for several reasons, we
>>>>>>>> ended up using toCaweFold to map characters and strings into
>>>>>>>> others in IDNA2003 but PRECIS implementations that do not
>>>>>>>> have the same constraints would, in general, be better off
>>>>>>>> confining the use of toCaseFold, or even toLowerCase, to
>>>>>>>> comparison operations.
>>>>>>>
>>>>>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does
>>>>>>> it make sense for this nickname specification to differ in
>>>>>>> this respect from the published RFCs? Shall we file errata
>>>>>>> against those documents? (This might apply only to RFC 7613,
>>>>>>> which says to apply case folding as part of the enforcement
>>>>>>> process - when exactly to apply case folding is not stipulated
>>>>>>> by RFC 7564.)
>>>>>>
>>>>>> To the extent to which this is a "botched that because the WG
>>>>>> didn't understand the issues well enough" conclusion, it would
>>>>>> be entirely reasonable to generate an updating RFC that repairs
>>>>>> 7613 and/or 7564, even doing so in an addendum to
>>>>>> precis-nickname if that is the only way to do that
>>>>>> expeditiously.  Per the above, we really don't want to give
>>>>>> library routine writers bad instructions.  As I understand it,
>>>>>> the current position of the RFC Editor and IESG is that
>>>>>> technical specification errors discovered in retrospect or after
>>>>>> people start using a spec are not appropriate topics for errata.
>>>>>> If the WG is not willing to do any of those things, then I
>>>>>> suggest that precis-nickname at least needs to contain a very
>>>>>> clear warning notice about this situation (see my response to
>>>>>> your question 1 below).
>>>>>
>>>>> I think we'll probably need to fix 7613 and 7564. I am hoping we can
>>>>> fix
>>>>> nickname now so that it is less incorrect than the other two. That
>>>>> doesn't necessarily mean we won't need to also further fix nickname
>>>>> later on.
>>>>>
>>>>> Granted, we were supposed to avoid this problem by working on all of
>>>>> the
>>>>> PRECIS specs simultaneously. Clearly we have not avoided the
>>>>> problem, so
>>>>> we need to solve it one way or another. If that means bis for them
>>>>>all,
>>>>> we need to deal with it.
>>>>>
>>>>>>>> (3) Because toCaweFold loses information when used for more
>>>>>>>> than comparison (for comparison, it merely contributes to
>>>>>>>> what some people would consider false positives for matching)
>>>>>>>> involves some controversial decisions and, because of
>>>>>>>> stability requirements, cannot be changed even if the
>>>>>>>> controversies are resolved in other ways, we end up with,
>>>>>>>> e.g.,
>>>>>>>>       toCaseFold ("Nuß") -> "nuss"
>>>>>>>> which is considered an acceptable transformation in some
>>>>>>>> places that identify themselves as speaking/using German and
>>>>>>>> two different unacceptable errors in others.  Again, this will
>>>>>>>> almost always be much more serious if the transformation is
>>>>>>>> used to map and replace strings than if it is used to compare
>>>>>>>> (fwiw, that particular example is part of a continuing
>>>>>>>> disagreement between IDNA2008 and, among others, German
>>>>>>>> domain registry authorities on one side and UTC and UTR 46 on
>>>>>>>> the other).
>>>>>>>
>>>>>>> Agreed.
>>>>>>
>>>>>> See "warning notice" comment above and question 1 response below.
>>>>>>
>>>>>>> (4) If the motivation is really to avoid confusion, the
>>>>>>>> correct confusion-blocking rule for Latin script (but not
>>>>>>>> others) and many languages that use it (but certainly not
>>>>>>>> all) involves moving beyond toCaseFold and treating all
>>>>>>>> "decorated" characters (characters normally represented by
>>>>>>>> glyphs consisting of a Basic Latin character and one or more
>>>>>>>> diacritical or equivalent markings) compare equal to their
>>>>>>>> base characters, e.g., "á" not only matches "Á" but also
>>>>>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À"
>>>>>>>> and "à" as well.  This is bad news for languages in which
>>>>>>>> decorated Latin characters are used to represent phonetically
>>>>>>>> and conceptually different characters, not just pronunciation
>>>>>>>> variations.  I am not qualified to evaluate "how bad".   In
>>>>>>>> addition, extrapolations from this principle about Latin
>>>>>>>> script to unrelated scripts will almost certainly lead to
>>>>>>>> serious errors and/or additional confusion.
>>>>>>>
>>>>>>> I would not be comfortable going that far...
>>>>>>
>>>>>> In case it isn't clear, I would not be either.  But it is where
>>>>>> getting sloppy about this stuff could easily take us.  It is
>>>>>> worth noting that it also identifies one of the difficulties
>>>>>> with doing a global system to be applied to many types of
>>>>>> applications (like the PRECIS work) and then applying it in user
>>>>>> interface software that end users will expect to be localized to
>>>>>> their assumptions because it has been mapped or translated into
>>>>>> their language (if one normally speaks Upper Slobbovian but has
>>>>>> some familiarity with English, an application interface in
>>>>>> English will probably be expected to be "foreign", odd, and
>>>>>> maybe even inconsistent with whatever expectations exist.  But,
>>>>>> if the interface is in Upper Slobbovian, the natural and
>>>>>> reasonable assumption will be the matching should conform to
>>>>>> normal Upper Slobbovian conventions.    FWIW, a matching rule
>>>>>> that says:
>>>>>>
>>>>>>   (i) Two instances of a base character with the same
>>>>>>     diacritical mark(s) match.
>>>>>>   (ii) Two instances of a base character with different
>>>>>>     diacritical mark(s) do not match.
>>>>>>   (iii) Two instances of a base character, one with
>>>>>>     diacritical mark(s) and the other without any decoration
>>>>>>     match.
>>>>>>
>>>>>> Is precisely correct and normal behavior for at least one
>>>>>> language that uses Latin script.  It is also the normal practice
>>>>>> for at least one Latin script transcription system that is used
>>>>>> by a large fraction of a billion people (maybe more).
>>>>>
>>>>> That is indeed sobering.
>>>>>
>>>>>>>> More on this and Tom's question below...
>>>>>>>>
>>>>>>>>> On 9/29/15 3:28 PM, Tom Worster wrote:
>>>>>>>>>> Peter, Alexey,
>>>>>>>>>>
>>>>>>>>>> I think there is an ambiguity in the specification of case
>>>>>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>>>>>>>>> ...
>>>>>>>>>> But there are 55 code points in Unicode 7.0.0 that change
>>>>>>>>>> under default case folding that are neither uppercase nor
>>>>>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I
>>>>>>>>>> suspect this stems from a confusion between Unicode case
>>>>>>>>>> mapping and case folding.
>>>>>>
>>>>>> In the context of the above, a different way to say the same
>>>>>> thing is that people are looking at toCaseFold and assuming (and
>>>>>> explaining things in terms of) toLowerCase.  toCaseFold works
>>>>>> the way it is expected to and those 55 code points are, more or
>>>>>> less, collateral damage to get to a matching algorithm that
>>>>>> favors false positives over false negatives and various edge
>>>>>> cases (including in "edge cases" languages spoken by, and script
>>>>>> variations used by, millions of people).
>>>>>
>>>>> Sadly I suspect that is an accurate description of the current state
>>>>>of
>>>>> affairs (modulo my comment above about PRECIS WG discussions at one
>>>>>or
>>>>> more IETF meetings).
>>>>>
>>>>>>> ...
>>>>>>> After all that, I have 3 questions:
>>>>>>
>>>>>> Personal opinions about answers...
>>>>>>
>>>>>>> (1) Is my proposed text enough of a clarification that we
>>>>>>> should make that change before the nickname I-D is published
>>>>>>> as an RFC?
>>>>>>
>>>>>> I think the clarification is an improvement and is important
>>>>>> enough to incorporate (I know that is the answer to a slightly
>>>>>> different question).
>>>>>>
>>>>>> However, I think it is inadequate without a serious warning
>>>>>> about the situation.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>  That warning could appear in either this
>>>>>> document or RFC 7613 (or 7613bis) with a pointer from the other,
>>>>>> but, unless you want to revise 7613 now, this one is handy.
>>>>>
>>>>> I suspect that we need to revise 7613. I suspect that we might also
>>>>> need
>>>>> to revise 7564 (at least with respect to the order in which
>>>>>operations
>>>>> are applied, since there has been some confusion among implementers).
>>>>>
>>>>> Well, we always knew that we would need to revise them. Just not so
>>>>> soon.
>>>>>
>>>>>> Comment about possible text below.
>>>>>>
>>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>>>>> folding is applied only as part of comparison and not as part
>>>>>>> of enforcement? If so, should we make that change before this
>>>>>>> document is published as an RFC?
>>>>>>
>>>>>> Yes.  If something is used for "enforcement", it should be lower
>>>>>> casing or something else that can be explained to people who are
>>>>>> ordinarily familiar with one or more of the scripts that make
>>>>>> case distinctions.
>>>>>>
>>>>>> However, viewed in the light of this discussion, the whole
>>>>>> "enforcement" concept becomes a little dicey, especially if, as
>>>>>> I believe but don't have time to verify, the transformations
>>>>>> performed by toLowerCase are not a proper subset of those
>>>>>> performed by toCaseFold.
>>>>>
>>>>> My initial thought is that case mapping doesn't belong in the
>>>>>nickname
>>>>> enforcement operation at all - only in the comparison operation.
>>>>>
>>>>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>>>>> only as part of comparison and not as part of enforcement?
>>>>>>
>>>>>> I think that is necessary.  Following up on the comment above, I
>>>>>> would prefer that the current Section 3.2.2 (3) of RFC 7613
>>>>>> either point to Unicode Lower Casing or contain a warning along
>>>>>> the lines of that below.
>>>>>
>>>>> Unlike the nickname profile (which I think can be cleaned up by
>>>>>moving
>>>>> the case mapping rule to the comparison operation and continuing to
>>>>>use
>>>>> Unicode Default Case Folding), I think you are right that for the
>>>>> UsernameCaseMapped profile we probably want Unicode Lower Casing.
>>>>>Thus
>>>>> the likely need, sooner rather than later, for 7613bis.
>>>>>
>>>>>>
>>>>>>     ----------
>>>>>>
>>>>>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> This issue has greater urgency now because
>>>>>>> draft-ietf-precis-nickname is now in AUTH48...
>>>>>>>
>>>>>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote:
>>>>>>>
>>>>>>>> After all that, I have 3 questions:
>>>>>>>>
>>>>>>>> (1) Is my proposed text enough of a clarification that we
>>>>>>>> should make that change before the nickname I-D is published
>>>>>>>> as an RFC?
>>>>>>>
>>>>>>> I think so.
>>>>>>
>>>>>> See above.
>>>>>>
>>>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>>>>>> folding is applied only as part of comparison and not as part
>>>>>>>> of enforcement? If so, should we make that change before this
>>>>>>>> document is published as an RFC?
>>>>>>>
>>>>>>> Although it seems to be the case that Unicode case folding is
>>>>>>> primarily designed for the purpose of matching (i.e.,
>>>>>>> comparison),
>>>>>>
>>>>>> "Seems" is a little weak.  The Unicode Standard is really quite
>>>>>> specific about that.
>>>>>>
>>>>>>> I have a concern that applying the PRECIS case
>>>>>>> mapping rule after applying the normalization and
>>>>>>> directionality rules might have unintended consequences that
>>>>>>> we haven't had a chance to consider yet. The PRECIS framework
>>>>>>> expresses a preference (actually a hard requirement) for
>>>>>>> applying the rules in a particular order. We made a late
>>>>>>> change to the username profiles (RFC 7613), such that width
>>>>>>> mapping is applied first (in order to accommodate fullwidth
>>>>>>> and halfwidth characters in certain East Asian scripts).
>>>>>>> Making a late change to the nickname profile also concerns me,
>>>>>>> even though both of these late changes seem reasonable on the
>>>>>>> face of it. I will try to find time to think about this
>>>>>>> further in the next 24 hours.
>>>>>>
>>>>>> First, a hint for the consideration process: there is a reason
>>>>>> why Unicode now supports a unified case folding and
>>>>>> normalization operation.  My recollection is that it is not only
>>>>>> more efficient to perform both operations at once (rather than
>>>>>> looking in one table and then the other), but that there are
>>>>>> some order-dependent or priority-dependent cases.
>>>>>>
>>>>>> The very fact that this issue exists (and is coming up again)
>>>>>> this late in the process (7613 published in August, WG winding
>>>>>> down and not, e.g., meeting next week) calls at least the PRECIS
>>>>>> quality of review and some fairly fundamental model issues into
>>>>>> question.  I first raised that issue a rather long time ago but
>>>>>> have continued to hope that we have an approximation to "good
>>>>>> enough" without going back and rethinking everything.
>>>>>>
>>>>>> The right solution, IMO, is that, if RFC 7613 is to rationalize
>>>>>> or explain the operation in terms of converting upper case
>>>>>> characters to lower case, then it should be using toLowerCase
>>>>>> because that is what the operation does.  After a quick look at
>>>>>> 7613, amending/updating it to simply convert to lower case would
>>>>>> be straightforward (and would not raise the ordering issue
>>>>>> called out above).  It would presumably require another IETF
>>>>>> Last Call, however and I'd hope we would see some serious
>>>>>> discussion within the WG (and with UTC) before making the change
>>>>>> and about how it is explained.
>>>>>>
>>>>>> If we are not willing to make a change
>>>>>
>>>>> I'm willing. It would, as you note, require some careful thinking and
>>>>> review to make sure that we got it (more) right this time.
>>>>>
>>>>>> that significant and/or
>>>>>> if we conclude that the WG (and perhaps the IETF) have
>>>>>> completely run out of energy for dealing with i18n issues [1],
>>>>>> then I suggest that we introduce some additional text.  I've
>>>>>> just spent a half-hour trying to find the AUTH48 copy of
>>>>>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has
>>>>>> apparently changed naming conventions and the various queue
>>>>>> entry pages all point to the -19 I-D and not the current working
>>>>>> copy so I can't try to match text and insertion point to what is
>>>>>> there already.
>>>>>
>>>>> http://www.rfc-editor.org/authors/rfc7700.txt
>>>>>
>>>>>>  The suggestion is a patch (and a hack), not a
>>>>>> good fix but something like it is probably the least drastic
>>>>>> measure that would yield something that doesn't contain
>>>>>> unexplained known defects.
>>>>>>
>>>>>> Rough version of suggested text (possibly to go after your
>>>>>> revised paragraph and following up my comments in my 1 October
>>>>>> note).  Some of the terminology needs checking which I can do if
>>>>>> you want to go this route:
>>>>>>
>>>>>>     'Users of this specification should note that the
>>>>>>     concept of "lower case conversion" is somewhat elusive
>>>>>>     and more dependent on the conventions of different
>>>>>>     languages and notation systems that use the same script
>>>>>>     than may appear obvious at first glance, especially if
>>>>>>     that glance is at Basic Latin characters (i.e., the
>>>>>>     ASCII letter repertoire).  Unicode provides two
>>>>>>     different mapping procedures that produce lower-case
>>>>>>     characters, but they have different effects and results
>>>>>>     for many characters.  The more conservative one,
>>>>>>     typically appropriately applicable when lower case forms
>>>>>>     are needed, is actual lower-casing (embodied in the
>>>>>>     Unicode operation toLowerCase).  A more radical
>>>>>>     operation, normally suitable only for string matching in
>>>>>>     situations in which it is better to consider uncertain
>>>>>>     cases as matching than to treat them as distinct, is
>>>>>>     called "Case Folding" (Unicode operation toCaseFold).
>>>>>>     While the two operations will often produce the same
>>>>>>     results, Case Folding maps some lower case characters
>>>>>>     into others and performs other transformations that may
>>>>>>     be intuitively reasonable and expected for some users
>>>>>>     and quite astonishing (or just wrong) to others.  There
>>>>>>     may be no practical alternative, especially if the
>>>>>>     operations are to be used for mapping or enforcement, to
>>>>>>     developers of PRECIS-dependent understanding that the
>>>>>>     cases in which the two yield different results require
>>>>>>     careful understanding of the relevant user base and its
>>>>>>     needs [2].'
>>>>>
>>>>> Thanks.
>>>>>
>>>>> I am not sure if we need something like that if we move case mapping
>>>>> (here, case folding) to the comparison operation only - but something
>>>>> like that might still be appropriate.
>>>>>
>>>>>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>>>>>> only as part of comparison and not as part of enforcement?
>>>>>>>
>>>>>>> That is less urgent so I suggest that we address the nickname
>>>>>>> spec first.
>>>>>>
>>>>>> Unless you (or someone else here) have a plausible plan to
>>>>>> continue and revitalize the WG and assign it that revision work
>>>>>> (and bring everyone actively participating up to the level
>>>>>> needed to easily understand this discussion thread and feel
>>>>>> embarrassed for not spotting the problems), I think we need to
>>>>>> assume that this is our last shot.  Absent an active and
>>>>>> committed WG, "do this first" could easily be equivalent to
>>>>>> "don't get around to the other, ever".
>>>>>
>>>>> As mentioned, I don't want to have broken RFCs out there.
>>>>>
>>>>>> I think that the particular set of issues that started this
>>>>>> thread as a known defect in the PRECIS specs, both nickname and
>>>>>> 7613 and that we are obligated to either fix the problems or at
>>>>>> least explain them.  The above warning text is an attempt to
>>>>>> explain and identify the problems even if it does not actually
>>>>>> provide a solution.  If it were published as part of
>>>>>> precis-nickname, it could include a statement to the effect that
>>>>>> it should also be treated as an update to 7613 or, if the IESG
>>>>>> and RFC Editor would agree in advance to accept, rather than
>>>>>> bury, the thing, I suppose we could publish it in
>>>>>> precis-nickname and create an erratum to 7613 indicating that it
>>>>>> should have included some form of that statement.  Neither
>>>>>> option implies a huge amount of work to update 7613.  But I
>>>>>> think that making the changes of (2) without doing anything
>>>>>> about (3) makes the two documents inconsistent with each other
>>>>>> and that would be an additional known defect.
>>>>>>
>>>>>> Procedural question: given that precis-nickname is in AUTH48 as
>>>>>> of yesterday and I don't see anything blocking publication next
>>>>>> week if you and Barry sign off on the revised text that the WG
>>>>>> hasn't seen,
>>>>>
>>>>> There is no revised text yet. That's why we're having this 
>>>>>discussion.
>>>>>
>>>>>> does someone need to file a pro forma objection/
>>>>>> appeal to block that until this is sorted out and the WG has a
>>>>>> chance to review proposed publication text?
>>>>>
>>>>> I see no reason to invoke the specter of appeals quite yet. Everyone 
>>>>>is
>>>>> working in good faith to do the right thing and get this mess cleaned
>>>>> up.
>>>>>
>>>>>> [1] I believe our collective inability to deal with the
>>>>>> within-script character forms that do not normalize to each
>>>>>> other because of language-dependent or other usage factors can
>>>>>> be taken as evidence of having run out of energy,
>>>>>
>>>>> Or in my case simple ignorance of some of the relevant issues and
>>>>> examples. It's not easy to know about all of this.
>>>>>
>>>>>> but it is
>>>>>> probably in the interest of finishing the PRECIS work to try to
>>>>>> treat that as a separate issue.
>>>>>
>>>>> Probably.
>>>>>
>>>>>> [2] Not unlike the reason to differentiate between NFC and NFKC
>>>>>> and understand the effects of each.
>>>>>
>>>>> Another thing that's not easy to grok in fulness.
>>>>>
>>>>> Peter



_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis

Re: [precis] Ambiguity in specification of case mapping in RFC 7613 and draft-ietf-precis-nickname

Reply via email to