<a href="http://www.paypаl.com/"
The above is from:
http://secunia.com/multiple_browsers_idn_spoofing_test/
As you can see, this HTML snippet is using a "numeric character reference" (the а) instead of the Punycode form of the name. Now, you may point out that HTML authors and editors should use the Punycode form, but the above proves that it is possible to type IDNs by hand.
HTML has a long history of allowing users to type it manually, and, regrettably, the browsers have an equally long history of *accepting* practically anything that they type, whether it be well-formed SGML or not.
This is how we ended up with such a huge mass of garbage. The Web browsers took the "be liberal in what you accept..." rule a bit too literally. The browsers were too liberal. But I digress.
My point is that if it's possible to type IDNs manually into HTML, then you might end up with some HTML documents that use the very characters that you want to ban in the next rev of nameprep. I am certainly not claiming that there will be very many of those. All I'm asking is, how do we find out whether they are going to be a problem?
Erik
Erik van der Poel wrote:
I'm probably missing something, but if the apps are not currently warning the user when characters from blocks that ought to be banned appear, then people and/or tools may be generating references (e.g. HTML documents) to those characters, without realizing that those characters are being mapped to some other base characters, which then work OK in the DNS lookup.
Maybe it is unlikely that a lot of such references would come to exist, and it wouldn't be such a burden to the user of a new app to see the occasional error e.g. when they click on such a link.
But how do you determine how many HTML documents contain bad characters in their links, and how do you decide that that number is low enough to make such a change to the spec?
[...] How do you know that such a change does not also impact the users of new clients accessing existing documents?
Oh wait, I know! Just get Google to do a survey in their cache?
John C Klensin wrote:
(i) A change that would largely impact what can be registered needs to be reflected and implemented only in 250-odd registries. The registry operators are mostly on their toes, communicate with each other, and many of them are pretty early in their implementation of IDNs and conservative about what they are permitting. Getting them to make changes is an entirely different sort of problem than, e.g., trying to change already-installed browsers or client plugins or getting people to upgrade them.
(ii) The main things I've seen in observing and working with registries that I didn't understand well enough a couple of years ago to argue forcefully are things that we might be able to change because the impact of whether someone was running an old or new version would not be large. For example, IDNA makes some mappings that are dubious, not in the technical sense of whether the characters are equivalent, but in the human factors sense of whether treating them as equivalent leads to bad habits. To take a handy example from a Roman ("Latin")-based script, I now suspect that permitting all of those font-variant "mathematical" characters to map onto their lower-case ASCII equivalents is a bad idea, just because it encourages users to assume that, if something looks like a particular base character, it is that character. That, in turn, increases the perceptual window for these phishing attacks. If, instead, we had simply banned those characters, creating an error if someone tried to use one rather than a quiet mapping into something else, we might have been better off. So I now think we should have banned them when IDNA and nameprep were defined and think I could have made that case very strongly had I understood the issues the way I do now. Is it worth making that change today? I don't know. But I suggest that it would be possible to make it for two reasons: (a) such a change would not change the number of strings or characters that can be registered at all: only the base characters can actually appear in an IDNA string post the ToUnicode(ToASCII(char)) operation pair and (b) if I were a browser or other application producer, I'd be seriously considering warnings if any characters from those blocks appeared... something IDNA certainly does not prohibit. Changes that increased the number of registerable characters are problematic, but not that problematic if they don't pick up a character that now maps and make it "real" (which is the problem with deciding that upper case Omega is a good idea). Reducing the number of characters that can be registered --making a now-valid base character invalid-- would be a much harder problem.
