Just send UTF-8 with nameprep (was: RE: [idn] Reality Check)

Martin Duerst Tue, 17 Jul 2001 00:38:03 -0700

At 22:21 01/07/16 +0200, Patrik F$BgM(Btstr$B�N(B wrote:

>It's also the case that I only see "I support UTF-8" when a real solution
>with UTF-8 would be something like IDNA, with UTF-8 as the output and not
>ACE encoded strings, let's call it INDU. :-)


Yes. Actually it would be quite easy to convert the IDNA draft to an IDNU
draft, more or less by replacing ACE with UTF-8. Of course a second pass
to smoothen some edges may be required.


>Because of the fact that some software already do UTF-8 (without nameprep),
>and people on this list which should know better say "UTF-8 and not ACE"
>when they should point at the actual algorithm used, I am extremely worried
>that what I see on this list is IDNA (which include nameprep) or "just send
>UTF-8 without nameprep". In reality that is.

Just send UTF-8 with nameprep is definitely what's needed, much better
than just send UTF-8 without nameprep.


>If not, explain how the software which today send UTF-8 on the wire will
>stop doing that the day we release the IDNU proposal? Who will suffer and
>discover what software do nameprep and not?

This software will most probably be upgraded. If it's browsers,
I'll definitely contribute by talking to the right people. I'm very
sure browser vendors prefer UTF-8 with nameprep over ACE with nameprep.

Anyway, I think that nameprep is in various ways very important, but
on the other hand, its importance has also been highly overestimated.
So the amount of suffering is much more limited than it may seem.

It is very important to understand that in most contexts, the chance
that a label is changed (except for lowercasing, which was part of
the UTF-8 proposal from the start) by nameprep when somebody takes a
proper domain name from paper and inputs it, is *very* small.
There are some well-known and important exceptions such as 
half-width/full-width
kana, and some implementations of Vietnamese (windows-1258), but for
many areas of the world, nameprep, in its relevant parts, just
conforms what's done anyway.

The reason for this is that this property is *designed* into NFC, and
that most of NFK (but not all of it) is garbage collection and includes
many things that are in some way similar but that the user would never
want to type in (because they really look different from the real thing),
and in addition are difficult to type; a lot of examples can e.g. be
found in blocks U+32xx and U+33xx).

So what we very much need is a very clear definition of which names are
acceptable and which names are not acceptable, and a high-enough
checking/compliance rate (somewhere between 50% and 90%) on the request
side to put enough pressure on the registry side to make that side of
compliance 100%.

What is also quite beneficial is to have some clear guidelines as to
what characters should be mapped to others before lookup and what not.
Half-width/full-width is a typical example. But currently, we are e.g.
forbidding somebody who makes Turkish software to use the case mapping
that a Turkish user would expect, just because we want to avoid
problems if ever a user without an idea about Turkish casing rules
uses that software. The current tendencies of 'better check once too much
than not enough times' (to which I agree in principle) and 'better uniform
and sometimes wrong than according to user expectations' (about which I
have serious doubts) seem to have lead us to overshoot our goals.


Regards,    Martin.

Just send UTF-8 with nameprep (was: RE: [idn] Reality Check)

Reply via email to