Basically, the work to install filters at the registries, and the work to write the next version of the nameprep spec can proceed in parallel, pretty much independently.
As George points out, the registries are going to have to start filtering IDN lookalikes, otherwise they will eventually face lawsuits from the "big boys" (as George so delightfully puts it). The ccTLDs will have a relatively easy task, while the gTLDs like .com will have the difficult task of deciding which subset of Unicode to allow. They will also have to go through their database, looking for lookalikes, and deleting them or transferring them to new owners, probably paying their previous owners back. The registrars might have to be involved in the money transaction too. What a mess. I don't envy the gTLDs. Maybe the Unicode Consortium could help them out by providing homograph tables.
One possible approach for the gTLDs is to halt IDN registration now. Then they can work on their filters, starting with a small subset of Unicode. After reopening IDN registration, they can grow the subset if there is enough demand for characters outside the initial subset.
If the gTLDs are going to do some serious subsetting, then they will probably also need to provide software to the registrars that will map users' characters into the subset. E.g. converting a user's local charset to the subset of Unicode. Then again, this might be an area where registrars could compete with each other, to provide the most friendly software to the end-user (registrant).
On the other side, we have the nameprep spec, and the work required to rev it. As John Klensin points out in another email, nameprep will eventually have to be updated to include new Unicode characters. Nameprep specifies Unicode 3.2, but Unicode itself is already at 4.0.1, and may be even further along by the time we finish discussing and drafting nameprep bis (new version). Call it nameprep2.
Now, one item that is clearly on nameprep2's table is the new version of Unicode. Another item that could be considered is the banning of slash '/' homographs and others. This type of spoofing was recently discussed on the IDN list. Certain Unicode blocks, like the math characters, might also be banned instead of mapped as they are now. And I'm sure we would discuss mapping or banning the homographs, such as Cyrillic small 'a'. A lot of this is likely to be controversial, and some people might suggest that we leave the subsetting to the registries, since they have to do it anyway. So, instead of shrinking the character set, nameprep2 might just grow it (for the new version of Unicode). I don't know. We'll see.
I'm not sure whether we would need a new ACE prefix if we are only adding characters (and not removing any). I'm too tired right now to think about it.
Erik
George W Gerrity wrote:
The two references below summarise much that has been said about the difficulty of dealing with the internationalisation of Domain Names. Let us agree once and for all:
1. The completely general problem is mathematically */and/* computationally intractable, even if we use fuzzy mapping;
2. The problem is a typical engineering challenge to find a workable solution â future-proofed as much as possible â which is minimally complex;
3. If the engineers (us?) don't solve it, the lawyers will have a heyday, the courts will find expensive solutions, the cost of running the web will blow out, and all of us will have mud all over our faces.
4. Now is the time â when there are only a very few registered names with possible clashes â to do it before we */have/* to go through the painful process of unregistering names and upgrading TLD machine codes.
So let's sketch out an approach, using <.com.ru> as an example.
a) The <.com.ru> registrar only accepts latin characters for that domain name, or only accepts Cyrillic characters, */no mix/*, and maps the two as equivalent. Case-equivalence mapping */may/* also be allowed, at a cost of more complexity. Let the registrar decide that, and let's be sure that as far as possible, the issuing authority licencing the TLD to the registrar ensures legal protection for these */arbitrary/*, but fixed decisions.
b) the first filter selects name tags whose codes (including diacritics, etc) are either not all in the Cyrillic block or the Latin block(s) for special attention.
My guess is that at this point, only a few percent will require special attention.
c) At this point, the <.com.ru> registrar will need to exercise some common sense. For instance, it seems unreasonable that this domain should accept codes outside the Latin and Cyrillic code blocks, and if they do, then mixes should be strongly discouraged. Certainly, the use of, say, Hebrew vowel pointing with Latin Codes, while perhaps acceptable in Israel TLD, should be unacceptable in the Russia TLD. In fact, as a general rule, mixes of diacritics from one code block with code points from another, should never be allowed.
Further rules can limit legal sequences of the allowed mixes. For instance, in alphabetic scripts such as Latin and Cyrillic, isolated code points from one script found in another make no sense unless spoofing is intended. Earlier, I suggested that a code-point string of a single script found mixed with strings of other scripts, should be of minimum length 2. One can also limit the number of separate substrings of an alternate script found interspersed with a dominant (national?) script.
These sort of common-sense rules can be easily implemented and the computational overhead is minimal. Of course, owners of ridiculous trade marks (such as <U+004B U+0049 U+039B>, âKIÎâ, for the brand name of the automobile âKIAâ) will disagree, but realism has to intrude somewhere into the free market economy.
The problems for universal TLDs (<.com>, <.net>) are far more complex, because they are required to accept all language scripts. At the TLD itself, one can allow a limited, but finite number of character strings to be equivalent, including the rule that script mixtures are inadmissable, but maybe case folding will be allowed.
Once again, however, application of some judicious sieve filters and rules about how mixed scripts may be composed, can simplify the handling of the name tags. There are also sieve rules that can immediately throw out most inadmissable combinations, such as the string length rule mentioned above. Those strings remaining can be tossed to a human, who will be required to be an expert in orthography (nice new line of business for many on the Unicode list?).
Now, it doesn't make sense for these rules to be part of a standard on how to extend Domain names to use scripts other than Latin: they are much better handled as (algorithmic where possible) regulations specified by the authority for a given TLD, or set of TLDs, in the case of the universal TLDs.
By using this approach, and starting off with a set of rules that disallow most forms of script mixes, then where appeals to common sense and the wishes of a reasonable number of potential clients suggest a loosening of the rules, this can be done with little disruption to the existing state of affairs.
George ------
On 22 Feb 2005, at 08:40, Doug Ewell wrote:
Hans Aberg <haberg at math dot su dot se> wrote:
The suggestion I made, was to use a function to detect confusables by declaring them equivalent, but retaining the full Unicode character set for representing the IDN's. If this is used at the registration level only, the only thing that happens when somebody enters a confusable, is that it is rejected. There is a problem only when an authority admits parallel, confusable names to be registered.
Granted. The problem, as I have said so often, is determining what the set of "confusables" is. Don't just say a/Ð and o/Î, either; that's only the tip of the iceberg.
On 22 Feb 2005, at 07:03, Erik van der Poel wrote:
Hans Aberg wrote:
Sure you can change it: One can make the equivalence classes smaller, whenever one wants.
As a mathematician, one might be inclined to think that way. But here, we're not talking about theoretical mathematics. We're talking about network engineering. A totally different way of thinking.
You can't just change the mapping whenever you want because there are many (client and server) installations out there that can't be changed overnight (what is known in network parlance as a "flag day").
For example, even if a registry were to change their mapping, go through their entire database, and delete the names that are determined to be duplicates (however one might accomplish that), there will be people with the old version of the app, which uses the old mapping, and will not be able to find the name (since it has been deleted).
Now, this might be a good thing if the name is an evil spoof, but what about innocent registrations? What if two separate parties have an equally legitimate claim on a particular name? This happens a lot in the ASCII DNS, and basically, whoever got there first (or is willing to pay a lot of money) wins.
One way to continue to support these innocent duplicates is to use a different prefix (i.e. something other than xn--) in the new mapping, and keep the old names (with the old prefix) in the database (instead of deleting them). This way, the old clients continue to find the old innocent names.
But what about the new clients? Now they will suddenly end up on a different Web site when the user clicks on a link. I suppose the user will just have to update their client, or the domain name owner will have to register a different name and update all the Web pages to point to the different name (assuming that they even have control over *all* of the Web pages that might contain a link to their site).
And so on. Do you get it now? You can't just change the mapping "whenever" you want. If you do this at all, you do it as few times as possible.
Now, you may point out that we are just getting started with IDN and that not very many names have been registered (and I may even agree with you), but it would still take a while to come up with a better mapping (and reach consensus on it -- shudder), and in the meantime, more names would be registered.
And this still would not negate my main point, which is that you can't do this "whenever" you want.
Erik
