Re: [idn] quick & dirty (but not too dirty) homograph defense

Erik van der Poel Mon, 21 Feb 2005 20:03:51 -0800

I've been wondering whether my email was clear or not. The kind of reference I'm talking about is the kind that started this whole IDN spoofing discussion:

<a href="http://www.payp&#1072;l.com/";

The above is from:

http://secunia.com/multiple_browsers_idn_spoofing_test/

As you can see, this HTML snippet is using a "numeric character reference" (the а) instead of the Punycode form of the name. Now, you may point out that HTML authors and editors should use the Punycode form, but the above proves that it is possible to type IDNs by hand.

HTML has a long history of allowing users to type it manually, and, regrettably, the browsers have an equally long history of *accepting* practically anything that they type, whether it be well-formed SGML or not.

This is how we ended up with such a huge mass of garbage. The Web browsers took the "be liberal in what you accept..." rule a bit too literally. The browsers were too liberal. But I digress.

My point is that if it's possible to type IDNs manually into HTML, then you might end up with some HTML documents that use the very characters that you want to ban in the next rev of nameprep. I am certainly not claiming that there will be very many of those. All I'm asking is, how do we find out whether they are going to be a problem?

Erik

Erik van der Poel wrote:

I'm probably missing something, but if the apps are not currently warning the user when characters from blocks that ought to be banned appear, then people and/or tools may be generating references (e.g. HTML documents) to those characters, without realizing that those characters are being mapped to some other base characters, which then work OK in the DNS lookup.

Maybe it is unlikely that a lot of such references would come to exist, and it wouldn't be such a burden to the user of a new app to see the occasional error e.g. when they click on such a link.

But how do you determine how many HTML documents contain bad characters in their links, and how do you decide that that number is low enough to make such a change to the spec?

[...] How do you know that such a change does not also impact the users of new clients accessing existing documents?

Oh wait, I know! Just get Google to do a survey in their cache?

John C Klensin wrote:

(i) A change that would largely impact what can be registered
needs to be reflected and implemented only in 250-odd
registries.  The registry operators are mostly on their toes,
communicate with each other, and many of them are pretty early
in their implementation of IDNs and conservative about what they
are permitting.  Getting them to make changes is an entirely
different sort of problem than, e.g., trying to change
already-installed browsers or client plugins or getting people
to upgrade them.

(ii) The main things I've seen in observing and working with
registries that I didn't understand well enough a couple of
years ago to argue forcefully are things that we might be able
to change because the impact of whether someone was running an
old or new version would not be large.  For example, IDNA makes
some mappings that are dubious, not in the technical sense of
whether the characters are equivalent, but in the human factors
sense of whether treating them as equivalent leads to bad
habits.  To take a handy example from a Roman ("Latin")-based
script, I now suspect that permitting all of those font-variant
"mathematical" characters to map onto their lower-case ASCII
equivalents is a bad idea, just because it encourages users to
assume that, if something looks like a particular base
character, it is that character.  That, in turn, increases the
perceptual window for these phishing attacks.  If, instead, we
had simply banned those characters, creating an error if someone
tried to use one rather than a quiet mapping into something
else, we might have been better off.  So I now think we should
have banned them when IDNA and nameprep were defined and think I
could have made that case very strongly had I understood the
issues the way I do now.   Is it worth making that change today?
I don't know.  But I suggest that it would be possible to make
it for two reasons: (a) such a change would not change the
number of strings or characters that can be registered at all:
only the base characters can actually appear in an IDNA string
post the ToUnicode(ToASCII(char)) operation pair and (b) if I
were a browser or other application producer, I'd be seriously
considering warnings if any characters from those blocks
appeared... something IDNA certainly does not prohibit.  Changes
that increased the number of registerable characters are
problematic, but not that problematic if they don't pick up a
character that now maps and make it "real" (which is the problem
with deciding that upper case Omega is a good idea).  Reducing
the number of characters that can be registered --making a
now-valid base character invalid-- would be a much harder
problem.

Re: [idn] quick & dirty (but not too dirty) homograph defense

Reply via email to