Peter,
This is much better and, as far as the normative language goes, seems
correct and unambiguous.
The informative sentence in 2.1 Rule 3:
"The primary result of doing so is that uppercase characters are
mapped to lowercase characters."
is good but I think it's worth spending a few more words to spell out that
"primary" implies exceptions.
"While the primary result is that uppercase characters are mapped to
lowercase characters, there are exceptions."
It might nudge a few fore implementers to understand that toLowerCase()
isn't the right thing.
Tom
On 11/2/15, 11:42 PM, "Peter Saint-Andre" <[email protected]> wrote:
>For ease of reviewing only, and with no presumption that these proposed
>changes have been accepted by anyone, I have asked the RFC Editor to
>provisionally update the document in AUTH48 as outlined in the messages
>I have sent to the list. The resulting file is here:
>
>http://www.rfc-editor.org/authors/rfc7700.txt
>
>Despite those caveats, if at all possible I would prefer to find an
>acceptable solution for publishing this RFC now without undue further
>delays (in part because draft-ietf-simple-chat has been held on this
>document for almost 3 years!). That does not mean, as I said earlier in
>this thread, that I want to have broken RFCs out there, but I think we
>can fix this one acceptably now and then update it again in strict
>coherence with updates to RFC 7564 and RFC 7613. I am committed to
>getting things right, but I am also committed to not holding up other
>people's work for years and years on end.
>
>Peter
>
>On 10/28/15 3:52 PM, Peter Saint-Andre wrote:
>> Example 7 needs to be corrected, too, in accordance with
>>CaseFolding.txt.
>>
>> On 10/28/15 2:54 PM, Peter Saint-Andre wrote:
>>> And here is another correction in Section 3...
>>>
>>> OLD
>>>
>>> Regarding examples 5, 6, and 7: applying Unicode Default Case
>>>Folding
>>> to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL
>>>LETTER
>>> SIGMA (U+03C3), and doing so during comparison would result in
>>> matching the nicknames in examples 5 and 6; however, because the
>>> PRECIS mapping rules do not account for the special status of GREEK
>>> SMALL LETTER FINAL SIGMA (U+03C2), the nicknames in examples 5 and
>>>7
>>> or examples 6 and 7 would not be matched.
>>>
>>> NEW
>>>
>>> Regarding examples 5, 6, and 7: applying Unicode Default Case
>>>Folding
>>> to GREEK CAPITAL LETTER SIGMA (U+03A3) results in GREEK SMALL
>>>LETTER
>>> SIGMA (U+03C3), and the same is true of GREEK SMALL LETTER FINAL
>>> SIGMA (U+03C2); therefore, the comparison operation defined in
>>> Section 2.4 would result in matching of the nicknames in examples
>>>5,
>>> 6, and 7.
>>>
>>> On 10/28/15 2:06 PM, Peter Saint-Andre wrote:
>>>> I propose the following text changes:
>>>>
>>>> ###
>>>>
>>>> OLD
>>>>
>>>> 3. Case Mapping Rule: Uppercase and titlecase characters MUST be
>>>> mapped to their lowercase equivalents using Unicode Default
>>>>Case
>>>> Folding as defined in the Unicode Standard [Unicode] (at the
>>>> time
>>>> of this writing, the algorithm is specified in Chapter 3 of
>>>> [Unicode7.0]). In applications that prohibit conflicting
>>>> nicknames, this rule helps to reduce the possibility of
>>>> confusion
>>>> by ensuring that nicknames differing only by case (e.g.,
>>>> "stpeter" vs. "StPeter") would not be presented to a human
>>>>user
>>>> at the same time.
>>>>
>>>> NEW
>>>>
>>>> 3. Case Mapping Rule: Unicode Default Case Folding MUST be
>>>>applied,
>>>> as defined in the Unicode Standard [Unicode] (at the time
>>>> of this writing, the algorithm is specified in Chapter 3 of
>>>> [Unicode7.0]). The primary result of doing so is that
>>>>uppercase
>>>> characters are mapped to lowercase characters. In applications
>>>> that prohibit conflicting nicknames, this rule helps to reduce
>>>> the possibility of confusion by ensuring that nicknames
>>>> differing only by case (e.g., "stpeter" vs. "StPeter") would
>>>>not
>>>> be presented to a human user at the same time.
>>>>
>>>> ###
>>>>
>>>> (The foregoing was previously sent to the list.)
>>>>
>>>> ###
>>>>
>>>> OLD
>>>>
>>>> 2.3. Enforcement
>>>>
>>>> An entity that performs enforcement according to this profile MUST
>>>> prepare a string as described in Section 2.2 and MUST also apply
>>>>the
>>>> rules specified in Section 2.1. The rules MUST be applied in the
>>>> order shown.
>>>>
>>>> After all of the foregoing rules have been enforced, the entity
>>>>MUST
>>>> ensure that the nickname is not zero bytes in length (this is done
>>>> after enforcing the rules to prevent applications from mistakenly
>>>> omitting a nickname entirely, because when internationalized
>>>> characters are accepted, a non-empty sequence of characters can
>>>> result in a zero-length nickname after canonicalization).
>>>>
>>>> 2.4. Comparison
>>>>
>>>> An entity that performs comparison of two strings according to
>>>>this
>>>> profile MUST prepare each string and enforce the rules as
>>>>specified
>>>> in Sections 2.2 and 2.3. The two strings are to be considered
>>>> equivalent if they are an exact octet-for-octet match (sometimes
>>>> called "bit-string identity").
>>>>
>>>> NEW
>>>>
>>>> 2.3. Enforcement
>>>>
>>>> An entity that performs enforcement according to this profile MUST
>>>> prepare a string as described in Section 2.2 and MUST also apply
>>>>the
>>>> following rules specified in Section 2.1 in the order shown:
>>>>
>>>> 1. Additional Mapping Rule
>>>> 2. Normalization Rule
>>>> 3. Directionality Rule
>>>>
>>>> After all of the foregoing rules have been enforced, the entity
>>>>MUST
>>>> ensure that the nickname is not zero bytes in length (this is done
>>>> after enforcing the rules to prevent applications from mistakenly
>>>> omitting a nickname entirely, because when internationalized
>>>> characters are accepted, a non-empty sequence of characters can
>>>> result in a zero-length nickname after canonicalization).
>>>>
>>>> 2.4. Comparison
>>>>
>>>> An entity that performs comparison of two strings according to
>>>>this
>>>> profile MUST prepare each string as specified in Section 2.2 and
>>>> MUST apply the following rules specified in Section 2.1 in the
>>>>order
>>>> shown:
>>>>
>>>> 1. Additional Mapping Rule
>>>> 2. Case Mapping Rule
>>>> 3. Normalization Rule
>>>> 4. Directionality Rule
>>>>
>>>> The two strings are to be considered equivalent if they are an
>>>>exact
>>>> octet-for-octet match (sometimes called "bit-string identity").
>>>>
>>>> ###
>>>>
>>>> In addition, some variation on John's proposed text about toLowerCase
>>>> vs. toCaseFold might be appropriate at the end of Section 4; however,
>>>> I'm still not sure that is necessary if we move the case mapping rule
>>>>to
>>>> the comparison operation.
>>>>
>>>> Peter
>>>>
>>>> On 10/27/15 8:09 PM, Peter Saint-Andre wrote:
>>>>> On 10/27/15 11:32 AM, John C Klensin wrote:
>>>>>> Response to Monday's note immediately below; response to today's
>>>>>> follows it. My apologies, but it is probably important to read
>>>>>> both. My further apologies for the length of this note, but I
>>>>>> think we are in deep trouble here,
>>>>>
>>>>> Internationalization always seems to be a matter of how deep the
>>>>> trouble
>>>>> is...
>>>>>
>>>>>> trouble that is aggravated by
>>>>>> precis-mappings and precis-nickname both being post-approval and
>>>>>> that, as far as I know, there are no future plans for PRECIS
>>>>>> work (having precis-nickname in AUTH48 just emphasizes that --
>>>>>> see comment at end).
>>>>>
>>>>> We had not planned to work on PRECIS because we thought we were done
>>>>> for
>>>>> awhile. If that's not the case and we need to fix things, then so be
>>>>> it.
>>>>> Whether there is sufficient and continued energy for such work is
>>>>> another question. Personally I don't want us to have broken RFCs out
>>>>> there.
>>>>>
>>>>>> --On Monday, October 26, 2015 15:51 -0600 Peter Saint-Andre -
>>>>>> &yet <[email protected]> wrote:
>>>>>>
>>>>>>> My apologies for the delayed reply. Comments inline.
>>>>>>
>>>>>> A few remarks below... I can't tell whether we disagree or
>>>>>> whether at least one of us, probably me, are not being
>>>>>> adequately clear. (Material on which we fairly clearly agree
>>>>>> elided.)
>>>>>>
>>>>>>
>>>>>>> On 10/1/15 7:50 AM, John C Klensin wrote:
>>>>>>>> --On Wednesday, September 30, 2015 15:16 -0600 Peter
>>>>>>>> Saint-Andre - &yet <[email protected]> wrote:
>>>>>>> ...
>>>>>>>> Peter,
>>>>>>>>
>>>>>>>> While your proposed text is an improvement,
>>>>>>>
>>>>>>> Happy to hear it. All I intended was a slight clarification.
>>>>>>
>>>>>> But I'm not certain we are there yet...
>>>>>
>>>>> Agreed. The text I proposed addressed only a very small part of the
>>>>> problem.
>>>>>
>>>>>>>> the desire of many
>>>>>>>> people for a magic "just tell me what to do" formula, one that
>>>>>>>> lets them avoid understanding the issues, may call for a
>>>>>>>> little more:
>>>>>>>
>>>>>>> There is always a need for more when it comes to i18n.
>>>>>>
>>>>>> But I think it is a little more that that. I've heard several
>>>>>> times, including in PRECIS meetings, requests for "just tell me
>>>>>> what to do and make sure it isn't complicated" (or "I don't want
>>>>>> to have to think about, much less understand, the issues"). We
>>>>>> can debate whether giving in to those requests in the I18n case
>>>>>> is wise. I think it leads directly to conclusions equivalent to
>>>>>> "I understand my own script and writing system (or think I do)
>>>>>> and therefore, since all writing systems must be pretty much the
>>>>>> same, I understand all of the core issues in terms of my script
>>>>>> and understanding". That, in turn, leads directly to the "how
>>>>>> do you spell 'Zürich'?" and "all spellings of 'Zuerich' should
>>>>>> be treated as equivalent" discussion that sounded like they
>>>>>> dominated a BOF at IETF 93.
>>>>>>
>>>>>> Now I actually think it is reasonable for someone to ask for a
>>>>>> library that will do the job most of the time and that will
>>>>>> almost never cause their users or customers to get angry at
>>>>>> them. But, if we are going to call what we do "standards", they
>>>>>> should contain sufficient information that would-be library
>>>>>> authors can know what to do ... or understand that they are in
>>>>>> over their heads. And, for these particular cases, we may need
>>>>>> to explain, or help the library authors explain, why some cases
>>>>>> will fail and, indeed, get users mad at vendors.
>>>>>>
>>>>>>
>>>>>>>> (1) First, toCaseFold is _not_ toLowerCase. Saying "The
>>>>>>>> primary result of doing so is that uppercase characters are
>>>>>>>> mapped to lowercase characters" is true for toCaseFold,
>>>>>>>
>>>>>>> By "primary" I meant two things: (1) lowercasing is what
>>>>>>> happens to the preponderance of code points and (2) this is
>>>>>>> the result that most people care about.
>>>>>>
>>>>>> If I parse the above correctly, I think you are wrong. I think
>>>>>> what most people want, care about, and think they are getting,
>>>>>> is lower case conversion, i.e., an operation that preserves
>>>>>> lower case characters and converts upper case characters to the
>>>>>> equivalent lower case. toCaseFold isn't that operation. It is
>>>>>> a much more complex and subtle operation that, as well as
>>>>>> converting upper case characters to lower case, sometimes
>>>>>> converts lower case characters to different lower case
>>>>>> characters (or strings of them). It also requires a fairly good
>>>>>> understanding of Unicode (not just a relevant script) and
>>>>>> historical Unicode decisions to predict its behavior and to have
>>>>>> any hope of explaining that behavior to users. If one is
>>>>>> trying to compare (as distinct from converting), then toCaseFold
>>>>>> may be exactly what it wanted. but it is really hard to explain
>>>>>> or justify that in terms of "nicknames" or "aliases", which are
>>>>>> about conversion. And, if one hopes to explain what is going
>>>>>> on to users in terms of "lower casing", then toCaseFold is just
>>>>>> the wrong operation. That is what toLowerCase is for and the
>>>>>> two operations are just not equivalent.
>>>>>
>>>>> My recollection, quite possibly inaccurate or incomplete, from at
>>>>>least
>>>>> one and I think several in-person meetings of the PRECIS WG was: just
>>>>> use Unicode Default Case Folding because if you use anything else or
>>>>> try
>>>>> to roll your own you will be fubar forever. I do not recall any
>>>>> discussion of the issues you have raised in this thread (e.g., about
>>>>> the
>>>>> inadvisability of using case folding for anything but comparison
>>>>> operations) until the last few weeks. However, I freely admit that's
>>>>> probably because, through my own faults and ignorance, I didn't
>>>>> understand what you were saying.
>>>>>
>>>>>> FWIW and purely by coincidence wrt PRECIS and this document, I
>>>>>> had a conversation a few days ago with an expert on Arabic (and
>>>>>> Persian) calligraphy and writing systems (and good general
>>>>>> knowledge of writing systems) who is quite insistent that any
>>>>>> procedure we use for case-insensitive matching (e.g., case
>>>>>> folding) is discriminatory, inconsistent, and just
>>>>>> badly-thought-out if that same procedure doesn't treat isolated,
>>>>>> initial, and medial forms of the same character as equivalent.
>>>>>> He further strengthens his case (sic) by noting that Unicode
>>>>>> case folding maps U+03C2 (GREEK SMALL LETTER FINAL SIGMA,
>>>>>> unambiguously a lower-case character) to U+03C3 (GREEK SMALL
>>>>>> LETTER SIGMA), a relationship that depends entirely on
>>>>>> positional use and not case. He also believes the same
>>>>>> relationships should apply to all other scripts that make form
>>>>>> distinctions for some characters based on positions in a string
>>>>>> and for which Unicode has chosen to assign different code
>>>>>> points. Even if there were wide acceptance of his view, Unicode
>>>>>> stability principles would prevent changing toCaseFold (or
>>>>>> CaseFolding.txt), but this is more evidence that what toCaseFold
>>>>>> does and does not do is going to be hard to explain to either
>>>>>> casual users or to writing system experts whose primary
>>>>>> experience is not with the Greek-Latin-Cyrillic group.
>>>>>>
>>>>>> I don't think we want to say "these matching rules are somewhat
>>>>>> arbitrary and irrational, but, if you don't like it, blame
>>>>>> Unicode and not us", if only because it is our choice to use
>>>>>> those matching rules. More below.
>>>>>>
>>>>>>
>>>>>>> ...
>>>>>>>> (2) Second, probably as a result of having IDNA in the lead,
>>>>>>>> we've gotten sloppy about language and operations and should
>>>>>>>> probably start untangling that before it gets people in
>>>>>>>> trouble.
>>>>>>>
>>>>>>> Where is the right place to do that untangling? (I doubt that
>>>>>>> it is the precis-nickname document.)
>>>>>>
>>>>>> I agree that precis-nickname isn't the ideal place. I also
>>>>>> believe that you and it are the innocent victims of the
>>>>>> situation. At the same time, I don't believe IETF should be
>>>>>> producing incomplete, ambiguous, erroneous, or misleading
>>>>>> standards because no one could get around to doing the right
>>>>>> foundational work.
>>>>>
>>>>> Agreed. I too want to get this right, even though it's not a lot of
>>>>>fun
>>>>> and it's certainly more work than I thought I was signing up for at
>>>>>the
>>>>> NEWPREP BoF years ago.
>>>>>
>>>>>>>> The Unicode Standard, at least as I understand it, is fairly
>>>>>>>> clear that the most important (and really only safe) use of
>>>>>>>> toCaseFold is as part of a comparison operation.
>>>>>>>
>>>>>>> Thanks for noting that. For example, Section 5.18 of Unicode
>>>>>>> 8.0.0 says:
>>>>>>>
>>>>>>> Caseless matching is implemented using case folding, which
>>>>>>> is the
>>>>>>> process of mapping characters of different case to a
>>>>>>> single form, so
>>>>>>> that case differences in strings are erased. Case folding
>>>>>>> allows for
>>>>>>> fast caseless matches in lookups because only binary
>>>>>>> comparison is
>>>>>>> required. It is more than just conversion to lowercase.
>>>>>>
>>>>>> Right. But, again, when its use is appropriate (a very
>>>>>> controversial topic in itself with our painful IDNA history with
>>>>>> Final Sigma, Eszett and the case-independent versus
>>>>>> position-independent controversy called out above as examples)
>>>>>> that is "matches in lookups" (what I've described elsewhere as
>>>>>> "comparison only"). Not creating or defining nicknames or
>>>>>> aliases. And that _is_ a problem for this document.
>>>>>
>>>>> I'm not convinced that things are as bad as you think. If we say in
>>>>> draft-ietf-precis-nickname that the case mapping rule is to be
>>>>>applied
>>>>> only as part of comparison and not as part of enforcement - which I
>>>>> think is really what we care about (e.g., to prevent spoofing of
>>>>>users
>>>>> in chat rooms) - then I think we might be most of the way there.
>>>>>
>>>>>>>> Using your
>>>>>>>> example it is entirely reasonable to treat, "stpeter" and
>>>>>>>> "StPeter" as equivalent in a comparison operation, but
>>>>>>>> accepting one string and changing it to the other for display
>>>>>>>> may not be a really good idea. While that transformation may
>>>>>>>> be acceptable (although I would be surprised if there were no
>>>>>>>> people who share your surname who could consider "stpeter" or
>>>>>>>> "Stpeter" unacceptable and might even believe that "StPeter"
>>>>>>>> is an unacceptable substitute for "St. Peter"),
>>>>>>>
>>>>>>> I do receive email at [email protected] intended for
>>>>>>> [email protected] but that's a separate topic...
>>>>>>
>>>>>> One that is relevant because it "works" as a side-effect of a
>>>>>> decision Google has made about mailbox name equivalence, a
>>>>>> decision that, IMO, will sooner or later get someone into a lot
>>>>>> of trouble and, more important, a decision and matching rule
>>>>>> that PRECIS, AFAICT, does not allow and that IDNA unambigiously
>>>>>> forbids.
>>>>>>
>>>>>>>> it also points out the
>>>>>>>> dangers of using Basic Latin script examples to illustrate
>>>>>>>> situations in which even more extended Latin script, much less
>>>>>>>> other scripts, may raise more complex issues. Because IDNA
>>>>>>>> is essentially a workaround because changing the DNS
>>>>>>>> comparison rules was impractical for several reasons, we
>>>>>>>> ended up using toCaweFold to map characters and strings into
>>>>>>>> others in IDNA2003 but PRECIS implementations that do not
>>>>>>>> have the same constraints would, in general, be better off
>>>>>>>> confining the use of toCaseFold, or even toLowerCase, to
>>>>>>>> comparison operations.
>>>>>>>
>>>>>>> Unfortunately we didn't do that in RFC 7564 or RFC 7613. Does
>>>>>>> it make sense for this nickname specification to differ in
>>>>>>> this respect from the published RFCs? Shall we file errata
>>>>>>> against those documents? (This might apply only to RFC 7613,
>>>>>>> which says to apply case folding as part of the enforcement
>>>>>>> process - when exactly to apply case folding is not stipulated
>>>>>>> by RFC 7564.)
>>>>>>
>>>>>> To the extent to which this is a "botched that because the WG
>>>>>> didn't understand the issues well enough" conclusion, it would
>>>>>> be entirely reasonable to generate an updating RFC that repairs
>>>>>> 7613 and/or 7564, even doing so in an addendum to
>>>>>> precis-nickname if that is the only way to do that
>>>>>> expeditiously. Per the above, we really don't want to give
>>>>>> library routine writers bad instructions. As I understand it,
>>>>>> the current position of the RFC Editor and IESG is that
>>>>>> technical specification errors discovered in retrospect or after
>>>>>> people start using a spec are not appropriate topics for errata.
>>>>>> If the WG is not willing to do any of those things, then I
>>>>>> suggest that precis-nickname at least needs to contain a very
>>>>>> clear warning notice about this situation (see my response to
>>>>>> your question 1 below).
>>>>>
>>>>> I think we'll probably need to fix 7613 and 7564. I am hoping we can
>>>>> fix
>>>>> nickname now so that it is less incorrect than the other two. That
>>>>> doesn't necessarily mean we won't need to also further fix nickname
>>>>> later on.
>>>>>
>>>>> Granted, we were supposed to avoid this problem by working on all of
>>>>> the
>>>>> PRECIS specs simultaneously. Clearly we have not avoided the
>>>>> problem, so
>>>>> we need to solve it one way or another. If that means bis for them
>>>>>all,
>>>>> we need to deal with it.
>>>>>
>>>>>>>> (3) Because toCaweFold loses information when used for more
>>>>>>>> than comparison (for comparison, it merely contributes to
>>>>>>>> what some people would consider false positives for matching)
>>>>>>>> involves some controversial decisions and, because of
>>>>>>>> stability requirements, cannot be changed even if the
>>>>>>>> controversies are resolved in other ways, we end up with,
>>>>>>>> e.g.,
>>>>>>>> toCaseFold ("Nuß") -> "nuss"
>>>>>>>> which is considered an acceptable transformation in some
>>>>>>>> places that identify themselves as speaking/using German and
>>>>>>>> two different unacceptable errors in others. Again, this will
>>>>>>>> almost always be much more serious if the transformation is
>>>>>>>> used to map and replace strings than if it is used to compare
>>>>>>>> (fwiw, that particular example is part of a continuing
>>>>>>>> disagreement between IDNA2008 and, among others, German
>>>>>>>> domain registry authorities on one side and UTC and UTR 46 on
>>>>>>>> the other).
>>>>>>>
>>>>>>> Agreed.
>>>>>>
>>>>>> See "warning notice" comment above and question 1 response below.
>>>>>>
>>>>>>> (4) If the motivation is really to avoid confusion, the
>>>>>>>> correct confusion-blocking rule for Latin script (but not
>>>>>>>> others) and many languages that use it (but certainly not
>>>>>>>> all) involves moving beyond toCaseFold and treating all
>>>>>>>> "decorated" characters (characters normally represented by
>>>>>>>> glyphs consisting of a Basic Latin character and one or more
>>>>>>>> diacritical or equivalent markings) compare equal to their
>>>>>>>> base characters, e.g., "á" not only matches "Á" but also
>>>>>>>> "a" and "A" and, as an unfortunate side-effect, maybe "À"
>>>>>>>> and "à" as well. This is bad news for languages in which
>>>>>>>> decorated Latin characters are used to represent phonetically
>>>>>>>> and conceptually different characters, not just pronunciation
>>>>>>>> variations. I am not qualified to evaluate "how bad". In
>>>>>>>> addition, extrapolations from this principle about Latin
>>>>>>>> script to unrelated scripts will almost certainly lead to
>>>>>>>> serious errors and/or additional confusion.
>>>>>>>
>>>>>>> I would not be comfortable going that far...
>>>>>>
>>>>>> In case it isn't clear, I would not be either. But it is where
>>>>>> getting sloppy about this stuff could easily take us. It is
>>>>>> worth noting that it also identifies one of the difficulties
>>>>>> with doing a global system to be applied to many types of
>>>>>> applications (like the PRECIS work) and then applying it in user
>>>>>> interface software that end users will expect to be localized to
>>>>>> their assumptions because it has been mapped or translated into
>>>>>> their language (if one normally speaks Upper Slobbovian but has
>>>>>> some familiarity with English, an application interface in
>>>>>> English will probably be expected to be "foreign", odd, and
>>>>>> maybe even inconsistent with whatever expectations exist. But,
>>>>>> if the interface is in Upper Slobbovian, the natural and
>>>>>> reasonable assumption will be the matching should conform to
>>>>>> normal Upper Slobbovian conventions. FWIW, a matching rule
>>>>>> that says:
>>>>>>
>>>>>> (i) Two instances of a base character with the same
>>>>>> diacritical mark(s) match.
>>>>>> (ii) Two instances of a base character with different
>>>>>> diacritical mark(s) do not match.
>>>>>> (iii) Two instances of a base character, one with
>>>>>> diacritical mark(s) and the other without any decoration
>>>>>> match.
>>>>>>
>>>>>> Is precisely correct and normal behavior for at least one
>>>>>> language that uses Latin script. It is also the normal practice
>>>>>> for at least one Latin script transcription system that is used
>>>>>> by a large fraction of a billion people (maybe more).
>>>>>
>>>>> That is indeed sobering.
>>>>>
>>>>>>>> More on this and Tom's question below...
>>>>>>>>
>>>>>>>>> On 9/29/15 3:28 PM, Tom Worster wrote:
>>>>>>>>>> Peter, Alexey,
>>>>>>>>>>
>>>>>>>>>> I think there is an ambiguity in the specification of case
>>>>>>>>>> mapping in RFC 7613 and draft-ietf-precis-nickname-19.
>>>>>>>>> ...
>>>>>>>>>> But there are 55 code points in Unicode 7.0.0 that change
>>>>>>>>>> under default case folding that are neither uppercase nor
>>>>>>>>>> titlecase characters, 12 of which are Lowercase_Letter. I
>>>>>>>>>> suspect this stems from a confusion between Unicode case
>>>>>>>>>> mapping and case folding.
>>>>>>
>>>>>> In the context of the above, a different way to say the same
>>>>>> thing is that people are looking at toCaseFold and assuming (and
>>>>>> explaining things in terms of) toLowerCase. toCaseFold works
>>>>>> the way it is expected to and those 55 code points are, more or
>>>>>> less, collateral damage to get to a matching algorithm that
>>>>>> favors false positives over false negatives and various edge
>>>>>> cases (including in "edge cases" languages spoken by, and script
>>>>>> variations used by, millions of people).
>>>>>
>>>>> Sadly I suspect that is an accurate description of the current state
>>>>>of
>>>>> affairs (modulo my comment above about PRECIS WG discussions at one
>>>>>or
>>>>> more IETF meetings).
>>>>>
>>>>>>> ...
>>>>>>> After all that, I have 3 questions:
>>>>>>
>>>>>> Personal opinions about answers...
>>>>>>
>>>>>>> (1) Is my proposed text enough of a clarification that we
>>>>>>> should make that change before the nickname I-D is published
>>>>>>> as an RFC?
>>>>>>
>>>>>> I think the clarification is an improvement and is important
>>>>>> enough to incorporate (I know that is the answer to a slightly
>>>>>> different question).
>>>>>>
>>>>>> However, I think it is inadequate without a serious warning
>>>>>> about the situation.
>>>>>
>>>>> Yes.
>>>>>
>>>>>> That warning could appear in either this
>>>>>> document or RFC 7613 (or 7613bis) with a pointer from the other,
>>>>>> but, unless you want to revise 7613 now, this one is handy.
>>>>>
>>>>> I suspect that we need to revise 7613. I suspect that we might also
>>>>> need
>>>>> to revise 7564 (at least with respect to the order in which
>>>>>operations
>>>>> are applied, since there has been some confusion among implementers).
>>>>>
>>>>> Well, we always knew that we would need to revise them. Just not so
>>>>> soon.
>>>>>
>>>>>> Comment about possible text below.
>>>>>>
>>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>>>>> folding is applied only as part of comparison and not as part
>>>>>>> of enforcement? If so, should we make that change before this
>>>>>>> document is published as an RFC?
>>>>>>
>>>>>> Yes. If something is used for "enforcement", it should be lower
>>>>>> casing or something else that can be explained to people who are
>>>>>> ordinarily familiar with one or more of the scripts that make
>>>>>> case distinctions.
>>>>>>
>>>>>> However, viewed in the light of this discussion, the whole
>>>>>> "enforcement" concept becomes a little dicey, especially if, as
>>>>>> I believe but don't have time to verify, the transformations
>>>>>> performed by toLowerCase are not a proper subset of those
>>>>>> performed by toCaseFold.
>>>>>
>>>>> My initial thought is that case mapping doesn't belong in the
>>>>>nickname
>>>>> enforcement operation at all - only in the comparison operation.
>>>>>
>>>>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>>>>> only as part of comparison and not as part of enforcement?
>>>>>>
>>>>>> I think that is necessary. Following up on the comment above, I
>>>>>> would prefer that the current Section 3.2.2 (3) of RFC 7613
>>>>>> either point to Unicode Lower Casing or contain a warning along
>>>>>> the lines of that below.
>>>>>
>>>>> Unlike the nickname profile (which I think can be cleaned up by
>>>>>moving
>>>>> the case mapping rule to the comparison operation and continuing to
>>>>>use
>>>>> Unicode Default Case Folding), I think you are right that for the
>>>>> UsernameCaseMapped profile we probably want Unicode Lower Casing.
>>>>>Thus
>>>>> the likely need, sooner rather than later, for 7613bis.
>>>>>
>>>>>>
>>>>>> ----------
>>>>>>
>>>>>> --On Tuesday, October 27, 2015 07:52 -0600 Peter Saint-Andre
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> This issue has greater urgency now because
>>>>>>> draft-ietf-precis-nickname is now in AUTH48...
>>>>>>>
>>>>>>> On 10/26/15 3:51 PM, Peter Saint-Andre - &yet wrote:
>>>>>>>
>>>>>>>> After all that, I have 3 questions:
>>>>>>>>
>>>>>>>> (1) Is my proposed text enough of a clarification that we
>>>>>>>> should make that change before the nickname I-D is published
>>>>>>>> as an RFC?
>>>>>>>
>>>>>>> I think so.
>>>>>>
>>>>>> See above.
>>>>>>
>>>>>>>> (2) Should we modify draft-ietf-precis-nickname so that case
>>>>>>>> folding is applied only as part of comparison and not as part
>>>>>>>> of enforcement? If so, should we make that change before this
>>>>>>>> document is published as an RFC?
>>>>>>>
>>>>>>> Although it seems to be the case that Unicode case folding is
>>>>>>> primarily designed for the purpose of matching (i.e.,
>>>>>>> comparison),
>>>>>>
>>>>>> "Seems" is a little weak. The Unicode Standard is really quite
>>>>>> specific about that.
>>>>>>
>>>>>>> I have a concern that applying the PRECIS case
>>>>>>> mapping rule after applying the normalization and
>>>>>>> directionality rules might have unintended consequences that
>>>>>>> we haven't had a chance to consider yet. The PRECIS framework
>>>>>>> expresses a preference (actually a hard requirement) for
>>>>>>> applying the rules in a particular order. We made a late
>>>>>>> change to the username profiles (RFC 7613), such that width
>>>>>>> mapping is applied first (in order to accommodate fullwidth
>>>>>>> and halfwidth characters in certain East Asian scripts).
>>>>>>> Making a late change to the nickname profile also concerns me,
>>>>>>> even though both of these late changes seem reasonable on the
>>>>>>> face of it. I will try to find time to think about this
>>>>>>> further in the next 24 hours.
>>>>>>
>>>>>> First, a hint for the consideration process: there is a reason
>>>>>> why Unicode now supports a unified case folding and
>>>>>> normalization operation. My recollection is that it is not only
>>>>>> more efficient to perform both operations at once (rather than
>>>>>> looking in one table and then the other), but that there are
>>>>>> some order-dependent or priority-dependent cases.
>>>>>>
>>>>>> The very fact that this issue exists (and is coming up again)
>>>>>> this late in the process (7613 published in August, WG winding
>>>>>> down and not, e.g., meeting next week) calls at least the PRECIS
>>>>>> quality of review and some fairly fundamental model issues into
>>>>>> question. I first raised that issue a rather long time ago but
>>>>>> have continued to hope that we have an approximation to "good
>>>>>> enough" without going back and rethinking everything.
>>>>>>
>>>>>> The right solution, IMO, is that, if RFC 7613 is to rationalize
>>>>>> or explain the operation in terms of converting upper case
>>>>>> characters to lower case, then it should be using toLowerCase
>>>>>> because that is what the operation does. After a quick look at
>>>>>> 7613, amending/updating it to simply convert to lower case would
>>>>>> be straightforward (and would not raise the ordering issue
>>>>>> called out above). It would presumably require another IETF
>>>>>> Last Call, however and I'd hope we would see some serious
>>>>>> discussion within the WG (and with UTC) before making the change
>>>>>> and about how it is explained.
>>>>>>
>>>>>> If we are not willing to make a change
>>>>>
>>>>> I'm willing. It would, as you note, require some careful thinking and
>>>>> review to make sure that we got it (more) right this time.
>>>>>
>>>>>> that significant and/or
>>>>>> if we conclude that the WG (and perhaps the IETF) have
>>>>>> completely run out of energy for dealing with i18n issues [1],
>>>>>> then I suggest that we introduce some additional text. I've
>>>>>> just spent a half-hour trying to find the AUTH48 copy of
>>>>>> precis-nickname (aka RFC-to-be-7700), but the RFC Editor has
>>>>>> apparently changed naming conventions and the various queue
>>>>>> entry pages all point to the -19 I-D and not the current working
>>>>>> copy so I can't try to match text and insertion point to what is
>>>>>> there already.
>>>>>
>>>>> http://www.rfc-editor.org/authors/rfc7700.txt
>>>>>
>>>>>> The suggestion is a patch (and a hack), not a
>>>>>> good fix but something like it is probably the least drastic
>>>>>> measure that would yield something that doesn't contain
>>>>>> unexplained known defects.
>>>>>>
>>>>>> Rough version of suggested text (possibly to go after your
>>>>>> revised paragraph and following up my comments in my 1 October
>>>>>> note). Some of the terminology needs checking which I can do if
>>>>>> you want to go this route:
>>>>>>
>>>>>> 'Users of this specification should note that the
>>>>>> concept of "lower case conversion" is somewhat elusive
>>>>>> and more dependent on the conventions of different
>>>>>> languages and notation systems that use the same script
>>>>>> than may appear obvious at first glance, especially if
>>>>>> that glance is at Basic Latin characters (i.e., the
>>>>>> ASCII letter repertoire). Unicode provides two
>>>>>> different mapping procedures that produce lower-case
>>>>>> characters, but they have different effects and results
>>>>>> for many characters. The more conservative one,
>>>>>> typically appropriately applicable when lower case forms
>>>>>> are needed, is actual lower-casing (embodied in the
>>>>>> Unicode operation toLowerCase). A more radical
>>>>>> operation, normally suitable only for string matching in
>>>>>> situations in which it is better to consider uncertain
>>>>>> cases as matching than to treat them as distinct, is
>>>>>> called "Case Folding" (Unicode operation toCaseFold).
>>>>>> While the two operations will often produce the same
>>>>>> results, Case Folding maps some lower case characters
>>>>>> into others and performs other transformations that may
>>>>>> be intuitively reasonable and expected for some users
>>>>>> and quite astonishing (or just wrong) to others. There
>>>>>> may be no practical alternative, especially if the
>>>>>> operations are to be used for mapping or enforcement, to
>>>>>> developers of PRECIS-dependent understanding that the
>>>>>> cases in which the two yield different results require
>>>>>> careful understanding of the relevant user base and its
>>>>>> needs [2].'
>>>>>
>>>>> Thanks.
>>>>>
>>>>> I am not sure if we need something like that if we move case mapping
>>>>> (here, case folding) to the comparison operation only - but something
>>>>> like that might still be appropriate.
>>>>>
>>>>>>>> (3) Should we update RFC 7613 so that case folding is applied
>>>>>>>> only as part of comparison and not as part of enforcement?
>>>>>>>
>>>>>>> That is less urgent so I suggest that we address the nickname
>>>>>>> spec first.
>>>>>>
>>>>>> Unless you (or someone else here) have a plausible plan to
>>>>>> continue and revitalize the WG and assign it that revision work
>>>>>> (and bring everyone actively participating up to the level
>>>>>> needed to easily understand this discussion thread and feel
>>>>>> embarrassed for not spotting the problems), I think we need to
>>>>>> assume that this is our last shot. Absent an active and
>>>>>> committed WG, "do this first" could easily be equivalent to
>>>>>> "don't get around to the other, ever".
>>>>>
>>>>> As mentioned, I don't want to have broken RFCs out there.
>>>>>
>>>>>> I think that the particular set of issues that started this
>>>>>> thread as a known defect in the PRECIS specs, both nickname and
>>>>>> 7613 and that we are obligated to either fix the problems or at
>>>>>> least explain them. The above warning text is an attempt to
>>>>>> explain and identify the problems even if it does not actually
>>>>>> provide a solution. If it were published as part of
>>>>>> precis-nickname, it could include a statement to the effect that
>>>>>> it should also be treated as an update to 7613 or, if the IESG
>>>>>> and RFC Editor would agree in advance to accept, rather than
>>>>>> bury, the thing, I suppose we could publish it in
>>>>>> precis-nickname and create an erratum to 7613 indicating that it
>>>>>> should have included some form of that statement. Neither
>>>>>> option implies a huge amount of work to update 7613. But I
>>>>>> think that making the changes of (2) without doing anything
>>>>>> about (3) makes the two documents inconsistent with each other
>>>>>> and that would be an additional known defect.
>>>>>>
>>>>>> Procedural question: given that precis-nickname is in AUTH48 as
>>>>>> of yesterday and I don't see anything blocking publication next
>>>>>> week if you and Barry sign off on the revised text that the WG
>>>>>> hasn't seen,
>>>>>
>>>>> There is no revised text yet. That's why we're having this
>>>>>discussion.
>>>>>
>>>>>> does someone need to file a pro forma objection/
>>>>>> appeal to block that until this is sorted out and the WG has a
>>>>>> chance to review proposed publication text?
>>>>>
>>>>> I see no reason to invoke the specter of appeals quite yet. Everyone
>>>>>is
>>>>> working in good faith to do the right thing and get this mess cleaned
>>>>> up.
>>>>>
>>>>>> [1] I believe our collective inability to deal with the
>>>>>> within-script character forms that do not normalize to each
>>>>>> other because of language-dependent or other usage factors can
>>>>>> be taken as evidence of having run out of energy,
>>>>>
>>>>> Or in my case simple ignorance of some of the relevant issues and
>>>>> examples. It's not easy to know about all of this.
>>>>>
>>>>>> but it is
>>>>>> probably in the interest of finishing the PRECIS work to try to
>>>>>> treat that as a separate issue.
>>>>>
>>>>> Probably.
>>>>>
>>>>>> [2] Not unlike the reason to differentiate between NFC and NFKC
>>>>>> and understand the effects of each.
>>>>>
>>>>> Another thing that's not easy to grok in fulness.
>>>>>
>>>>> Peter
_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis