--On Wednesday, September 03, 2014 16:08 +0300 Andrew Sullivan
<[email protected]> wrote:

> On Wed, Sep 03, 2014 at 08:35:24AM -0400, John C Klensin wrote:
>> > 
>> > I don't see any case in those two strings.  What am I
>> > missing?
>> 
>> That applying ToCaseFold(fußball) yields "fussball".
> 
> Yes, I know.  But this is sort of my point.  If the matching
> behaviour of those two strings differ because of _case_
> treatment, it's a recipe for user puzzlement at best:
> 
> "Oh, that's because of case folding."
> 
> "Um, it's all lower case?"

Indeed.  We are in complete agreement... I was just trying to
explain how we got here.    And part of that is that our
practice has been to apply not "toLowerCase" to strings (which
many people think they understand even though it can be a little
ambiguous in common practice) rather than the case folding
operation which few people not familiar with Unicode understand.

>> I would question the "deep tradition" because, in reality, it
>> applies only to the ASCII repertoire, i.e., a subset of the
>> set of undecorated Latin characters.
> 
> Yes, but surely for someone German it's handy to be able to use
> CamelCase to make strings somewhat more memorable, and having
> it work when the string has no decorartions and not work when
> there are decorations is, at least, hard to understand. 

Sure.  But that is, IMO, for two reasons.  First, we "trained"
them to expect those variations to match by requiring it for
ASCII, thereby setting a very bad precedent for the i18n cases.
Second, and more important, it is unclear to me why the users of
slightly-decorated Latin text (German included) should expect
case matching to work while we tell (in no particular order) 

(i) Those same Germans that strings containing "ö" and those
containing "oe" don't match.

(ii) Our Greek, Hebrew, and Arabic script-using colleagues that,
if Unicode supplies different code points for final (or other)
forms than whatever is considered the base, then strings
containing the final form are not equal to otherwise-identical
ones that don't.

(iii) Our ASCII-using colleagues that, while they can write
MyFavoriteMovie.example and treat it as equivalent to
"myfavoritemovie.example", their preferred form will usually be
preserved by the DNS but sometimes won't.

(iv) Our English-speaking colleagues that "color" and "colour"
are different, even though most of their everyday experience
says otherwise.

(v) Lots of groups that joespizza.example and
joeʼspizza.example (that is not a single quote/apostrophe) are
not equivalent to each other, that JoesPizza.example is
equivalent to the first but JoeʼsPizza.example is invalid, and
that pizzeriajoe and pizzeragiovanni are not equivalent to any
of them or each other.  FWIW, I would not be at all surprised if
some clever search engine did consider all of the above to be
plausible alternatives to each other.

(vi, ff)  <endless numbers of additional examples)


> The
> only responsible advice in light of IDNA2008, in my opinion,
> is, "Use domain names are in lower case, always."

And that advice is part of what I meant in an earlier note when
I suggested that users avoid edge case and cases in which they
expected the computer system to figure out what they meant and
do it.  Users who don't want surprises and name registrants or
creators who don't either will stick to lower case and strings
as uncomplicated as possible, and will expect strings to match
only if what is in the database and what they type are
identical.    Since month ago, I would have written "modulo
normalization", but the recent discussions about U+08A1 and
related code points make even that suspect.

    john

_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis

Reply via email to